Optimized Strategies for Mapping Three-dimensional FFTs onto CUDA GPUs
Jing Wu - University of Maryland, College Park Joseph JaJa - University of Maryland, College Park Abstract: We address in this paper the problem of mapping three-dimensional Fast Fourier Transforms (FFTs) onto the recent, highly multithreaded CUDA Graphics Processing Units (GPUs) and present some of the fastest known algorithms for a wide range of 3-D FFTs on the NVIDIA Tesla and Fermi architectures. We exploit the high-degree of multi-threading ordered by the CUDA environment while carefully managing the multiple levels of the memory hierarchy in such a way that: (i) all global memory accesses are coalesced into 128-byte device memory transactions issued in such a way as to optimize eects related to partition camping  locality , and associativity. And (ii) all computations are carried out on the registers with eective data movement involved in shared memory transposition. In particular, the number of global memory accesses to the entire 3-D dataset is minimized and the FFT computations along the X dimension are almost completely overlapped with global memory data transfers needed to compute the FFTs along the Y or Z dimensions. We were able to achieve performance between 135 Gflops and 172 Gflops on the Tesla architecture (Tesla C1060 and GTX280) and between 192 Gflops and 290 Gflops on the Fermi architecture (Tesla C2050 and GTX480). The bandwidths achieved by our algorithms reach over 90 GB/s for the GTX280 and around 140 GB/s for the GTX480.