A Search Optimization in FFTW

Gu, Liang
Journal Title
Journal ISSN
Volume Title
University of Delaware
Generating high performance fast Fourier transform(FFT) libraries for different computer architectures is an important task. Architecture vendors sometimes have to rely on dedicated experts to tune FFT implementation on each new platform. Fastest Fourier transform in the West(FFTW) replaces this tedious and repeated work with an adaptive FFT library. It automatically generates FFT code that are comparable to libraries provided by vendors. Part of its success is due to its highly e cient straight-line style code for small DFTs, called codelets. The other part of its success is the result of a large and carefully chosen search space of FFT algorithms. FFTW mainly traverses this space by empirical search, otherwise a simple heuristic is used. However, both methods have their downside. The empirical search method spends a lot of search time on large DFT problems and the simple heuristic often delivers implementation that is much worse than optimum. An ideal approach should nd a reasonably good implementation within the FFT search space in a small amount of time. Model-driven optimization is often believed to be inferior to empirical search. It is very hard to capture all the performance features of an adaptive library on many modern architectures. No one has implemented an adaptive performance model to automatically assist the search of FFT algorithms on multiple architectures. This thesis presents an implicit abstract machine model and a codelet performance model that can be used in the current FFTW framework. With the performance prediction given by these models, the empirical search engine of FFTW can be replaced without serious hurt of performance. This technique also helps to break down the runtime in FFTW and explain the performance of complex combination of DFT algorithms. As a good trade-o , this optimization achieves 95% of the performance of FFTW empirical search and uses less than 5% of the search overhead on four test platforms.