An efficient shuffle-light FFT library
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
The Fast Fourier Transform (FFT) is one of the most widely used computation kernels. Many real world applications are based on the techniques provided by its algorithm, including radar technology, audio processing and medical imaging just to mention a few. Indeed, since its inception of the FFT In 1965 by James Cooley and John Tukey, the world has irreversibly changed in all aspects of life, and the software revolution has accelerated the impact of FFTs even further. Many everyday technologies such as mp3 formats for digital music, to JPEG image formats for image compression are all dependent on the nature of the FFT algorithm for its existence. The entire field of Digital Signal Processing has been able to deliver everlasting changes to societies mainly thanks to the FFT algorithm, especially when proving emphasis on software development. While FFT can be applied on any finite size problem, the determining factors for its performance will vary. While for small sizes computational components like CPU often determine performance, a critical phenomena occurs when the FFT is applied on large problem sizes, because the overall performance becomes almost totally memory bound. In these cases, the impact of the algorithm’s computational operations is demoted to minimal, while the specifics of the memory layout such as cache sizes and speed, along with access patterns of the FFT algorithm will be the outweighing factors for performance. Consequently, most of today’s research in the FFT domain is focused mainly on optimizing large size FFTs, which in turn concentrates on optimizing memory operational efficiency. The widely adopted approach in this regard is to adapt parameters external to the FFT algorithm, such as data layout, system software memory management, etc., to the needs of FFT’s memory operation pattern. In other words, most techniques try to work around the FFT algorithm by adapting other components to the FFT’s intrinsic behavior. ☐ In this paper, we take a completely different approach, by looking into the FFT internals for a way to mitigate its memory “boundness”, especially when dealing with large size problems. The low efficiency of FFT’s memory operation, to a large degree, is incurred by the many strided data shuffles intrinsic to the algorithm, and made worse by the fact that many of those stride sizes are power-of-two in value. The strided data shuffling is already one of the hard workload patterns for memory systems, and the power-of-two stride sizes make such a memory access pattern even more challenging. This power-of-two strided memory access pattern is the key problem that our paper addresses and represents our main contribution. This key idea is founded on the principle of complete separation of the logic indexes and the physical indexes of the data internally assessed in FFT. In previous algorithms, a decomposition of a problem into smaller sub-problems combined with the particular divide-and-conquer approach will inevitably determine all the strided access patterns needed to execute the algorithm. The Shuffle-Lite FFT internal structure allows for each of the subproblems of the DFT decomposition to have a physical stride in a way that is not bounded to the decomposition itself or the the sub-problem size. Instead, each stride can be opportunely selected to possibly avoid detrimental memory access patterns or to potentially serve some other unforeseen scenario. By maintaining an efficient mapping between the logic indexes and the physical indexes through all passes, almost any stride sizes can be used and be further tuned. As the result, FFT can be correctly carried out without the need for most of those worst power-of-two data shuffles. This greatly improves memory performance and in turn raises the overall performance. We compare our Shuffle-Light FFT (SLFFT) library with FFTW, a very widely used FFT library, and achieve the average speedup on 24.7% on problems sizes from 2 6 to 2 26 , and the maximal speedup of 39.4%.
Description
Keywords
Cache hierarchy, Codelet, Fast fourier transform, High performance computing, Memory performance