Parallel FFT program optimization on heterogeneous computers

Chen, Shuo

Parallel FFT program optimization on heterogeneous computers

Files

2015_ChenShuo_PhD.pdf (2.9 MB)

Date

2015

Authors

Chen, Shuo

Publisher

University of Delaware

Abstract

Generating high performance Fast Fourier Transform (FFT) library is an important research topic for the traditional processors, CPUs, and new accelerators, like Graphics Processing Units (GPUs). It is not rare that large scientific and engineering computation, such as physics simulations, signal processing and data compression, spend majority of execution time on large size FFTs. Such FFT implementations require large amount of computing resources and memory bandwidth. On the system side, in spite of highly influential results in prior FFT work on GPUs, the GPU performance is severely restricted by the limited memory size and the low bandwidth of data transfer through PCI channel. Additionally, current GPU based FFT implementation only uses GPU to compute, but employs CPU as a mere memory-transfer controller. The computing power of CPUs is wasted. On the algorithmic side, input signals are frequently sparse. If we know that an input is sparse, the computational complexity of FFT can be reduced. Many sparse FFT algorithms have been proposed to improve sparse FFT’s efficiency. However, the existing sparse FFT implementations are confined to serial execution and are input oblivious in the sense that how the algorithms work is not affected by input characteristics. In this dissertation, we present two high performance optimization strategies. First, we study the problems of current GPU based FFT implementations, and propose a hybrid approach for 2D and 3D FFT, which concurrently executes both multithreaded CPU and GPU in a heterogeneous computer to accelerate large FFT problems that cannot fit into GPU memory. Within the scheme, an empirical performance modeling is constructed to determine optimal load balancing between CPU and GPU, and an optimizer is proposed to exploit substantial parallelism for both GPU and CPUs and to overlap communication with computation. Second, we investigate the existing sparse FFT algorithms and propose an input adaptive model for algorithmic parallelization. In particular, the algorithm takes advantage of the similarity between input samples to save much computation and to exploit substantial data parallelism. The solution has runtime sub-linear to the input size and gets rid of coefficient estimation’s dependencies, both of which improve parallelism and performance.

URI

http://udspace.udel.edu/handle/19716/17524

Collections

Doctoral Dissertations (Winter 2014 to Present)

Full item page