Optimized implementation of small DFT problems on GPU
Date
2015
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Digital Fourier Transform (DFT) and its fast implementations (FFT) are important and widely used in both research and industry. Thus implementations of generating high performance DFTs and FFTs are significant and many efforts has been done in both algorithm field and implementation field. For most FFT algorithms, decomposition is always used and large DFT problems are divided into small DFT problems. Thus small size DFTs are performance foundation of most FFT algorithms, optimized even optimal implementation of small size DFT problems is necessary and important. GPU architecture has large scale of parallel computing resources which is powerful for computation- intensive and parallel applications. DFT/FFT has huge potential for parallel computing so that GPU is a good choice for implementing DFT/FFT algorithms. Our purpose in this paper is to find out an optimized implementation for small size DFTs/FFTs and possible features of such implementations on GPU. Firstly we build GPU codelets by modifying FFTW codelets. Then an implementation of 1k-size FFT using Cooley-Tukey algorithm based on GPU codelets is given to find out the optimal possible codelets on GPU. Besides Cooley-Tukey algorithm, general DFT definition also has computing potential on GPUs. Thus we implement small size (size 128) FFT by optimizing computations of general DFT definition. At last, a code generator which generates different patterns of programs is built to find out if there are hidden potentials in different computing patterns of general DFT on GPU. Experiments under empirical search idea are completed on groups of small size (size 1k) DFTs. Some experiments show good trend and indicate more ideas. Some experiments do not show much difference but give out more ideas in searching possible computing potential.
