Browsing by Author "Chen, Shuo"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item A hybrid GPU/CPU FFT library for large FFT problems(University of Delaware, 2013) Chen, ShuoGraphic Processing Units (GPU) has been proved to be a promising platform to accelerate large size Fast Fourier Transform (FFT) computation. However, current GPU-based FFT implementation only uses GPU to compute, but employs CPU as a mere memory-transfer controller. The computation power in today's high-performance CPU is wasted. In this project, a hybrid optimization framework is proposed to use both CPU and GPU in heterogeneous CPU-GPU systems to compute large scale 2D and 3D FFTs that exceed GPU memory. This work introduces a exible partitioning scheme that makes it possible to decompose FFT for two computing devices with hugely different performance characteristics. The partitioning scheme enables concurrent execution of FFT sub-problems on CPU and GPU. Additionally, our approach integrates several FFT decomposition paradigms to tailor the extraction of computation and communication patterns for CPU and GPU, and in the process exploits more hidden parallelism than other heterogeneous methods. In addition, our work automatically adapts to diff erent hardware confi gurations by tuning for architecture features and the work distribution between GPU and CPU. Several empirical profiling techniques are proposed to characterize the communication and computation of FFT problems on GPU and CPU, and we develop effective heuristics to guide the entire empirical tuning process. Our library also overlaps data transfers to achieve higher bandwidth over PCI bus and equally importantly maintains data and layout consistency between CPU and GPU. We evaluate our hybrid FFT library from three aspects, i.e., optimal load distribution ratios, running time, and precision of result. In particular, the library is compared with CPU based libraries FFTW and Intel MKL, as well as a GPU based library on three GPUs, i.e., NVIDIA GeForce GTX480, Tesla C2070 and Tesla C2075. On average, our large FFT library is 121% and 145% faster than the 4-thread SSE-enabled FFTW and the 4-thread SSE-enabled Intel MKL, with max speedups 4.61 and 2.81, respectively.Item Parallel FFT program optimization on heterogeneous computers(University of Delaware, 2015) Chen, ShuoGenerating high performance Fast Fourier Transform (FFT) library is an important research topic for the traditional processors, CPUs, and new accelerators, like Graphics Processing Units (GPUs). It is not rare that large scientific and engineering computation, such as physics simulations, signal processing and data compression, spend majority of execution time on large size FFTs. Such FFT implementations require large amount of computing resources and memory bandwidth. On the system side, in spite of highly influential results in prior FFT work on GPUs, the GPU performance is severely restricted by the limited memory size and the low bandwidth of data transfer through PCI channel. Additionally, current GPU based FFT implementation only uses GPU to compute, but employs CPU as a mere memory-transfer controller. The computing power of CPUs is wasted. On the algorithmic side, input signals are frequently sparse. If we know that an input is sparse, the computational complexity of FFT can be reduced. Many sparse FFT algorithms have been proposed to improve sparse FFT’s efficiency. However, the existing sparse FFT implementations are confined to serial execution and are input oblivious in the sense that how the algorithms work is not affected by input characteristics. In this dissertation, we present two high performance optimization strategies. First, we study the problems of current GPU based FFT implementations, and propose a hybrid approach for 2D and 3D FFT, which concurrently executes both multithreaded CPU and GPU in a heterogeneous computer to accelerate large FFT problems that cannot fit into GPU memory. Within the scheme, an empirical performance modeling is constructed to determine optimal load balancing between CPU and GPU, and an optimizer is proposed to exploit substantial parallelism for both GPU and CPUs and to overlap communication with computation. Second, we investigate the existing sparse FFT algorithms and propose an input adaptive model for algorithmic parallelization. In particular, the algorithm takes advantage of the similarity between input samples to save much computation and to exploit substantial data parallelism. The solution has runtime sub-linear to the input size and gets rid of coefficient estimation’s dependencies, both of which improve parallelism and performance.