More Web Proxy on the site http://driver.im/

research-article

Accelerating Fourier and Number Theoretic Transforms using Tensor Cores and Warp Shuffles

Authors:

Sultan Durrani,

Muhammad Saad Chughtai,

Mert Hidayetoglu,

Lawrence Rauchwerger,

Wen-mei HwuAuthors Info & Claims

PACT '21: Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques

Pages 345 - 355

https://doi.org/10.1109/PACT52795.2021.00032

Published: 26 November 2024 Publication History

Abstract

The discrete Fourier transform (DFT) and its specialized case, the number theoretic transform (NTT), are two important mathematical tools having applications in several areas of science and engineering. However, despite their usefulness and utility, their adoption continues to be a challenge as computing the DFT of a signal can be a time-consuming and expensive operation. To speed things up, fast Fourier transform (FFT) algorithms, which are reduced-complexity formulations for computing the DFT of a sequence, have been proposed and implemented for traditional processors and their corresponding instruction sets. With the rise of GPUs, NVIDIA introduced its own FFT computation library called cuFFT, which leverages the power of GPUs to compute the DFT. However, as this paper demonstrates, there is a lot of room for improvement to accelerate the FFT and NTT algorithms on modern GPUs by utilizing specialized operations and architectural advancements. In particular, we present four major types of optimizations that leverage tensor cores and the warp-shuffle instruction. Through extensive evaluations, we show that our approach consistently outperforms existing GPU-based implementations with a speedup of up to 4× for NTT and a speed of up to 1.5× for FFT.

References

[1]

J. Goldthwaite. The value of digital signal processing. [Online]. Available: https://www.sensear.com/blog/the-value-of-digital-signal-processing

[2]

Y. Doröz, E. Özturk, and B. Sunar, "A million-bit multiplier architecture for fully homomorphic encryption", Microprocessors and Microsystems, vol. 38, no. 8, pp. 766--775, 2014.

Digital Library

[3]

D. Coldewey. Duality scores $14m darpa contract for hardware-accelerated homomorphic encryption. [Online]. Available: https://techcrunch.com/2021/02/03/duality-scores-14m-darpa-contract-for-hardware-accelerated-homomorphic-encryption/amp/

[4]

A. A. Al Sallab, H. Fahmy, and M. Rashwan, "Optimized hardware implementation of fft processor", in 2009 4th International Design and Test Workshop (IDT). IEEE, 2009, pp. 1--5.

[5]

M. Frigo and S. G. Johnson, "FFTW: An adaptive software architecture for the fft", in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Cat. No. 98CH36181), vol. 3. IEEE, 1998, pp. 1381--1384.

[6]

C. D. Zone. CUDA toolkit documentation -. [Online]. Available: https://docs.nvidia.com/cuda/cufft/index.html

[7]

T. NVIDIA, "V100 GPU architecture", 2017.

[8]

NVIDIA, "A100 GPU architecture", 2020. [Online]. Available: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf...

[9]

D. B. Kirk and W. H. Wen-Mei, Programming massively parallel processors: A hands-on approach. Morgan kaufmann, 2016.

[10]

M. Rahman, Applications of Fourier transforms to generalized functions. WIT Press, 2011.

[11]

P. Duhamel and M. Vetterli, "Fast Fourier transforms: A tutorial review and a state of the art", Signal Processing (Elsevier), vol. 19, no. ARTICLE, pp. 259--299, 1990.

[12]

C. Burrus, "Index mappings for multidimensional formulation of the dft and convolution", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 25, no. 3, pp. 239--242, 1977.

[13]

J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation of complex fourier series", Mathematics of computation, vol. 19, no. 90, pp. 297--301, 1965.

[14]

A. Al Badawi, B. Veeravalli, C. F. Mun, and K. M. M. Aung, "High-performance fv somewhat homomorphic encryption on GPUs: An implementation using CUDA", IACR Transactions on Cryptographic Hardware and Embedded Systems, pp. 70--95, 2018.

[15]

H. Chen, K. Han, Z. Huang, A. Jalali, and K. Laine, "Simple encrypted arithmetic library v2. 3.0", Microsoft Research, December, 2017.

[16]

S. Kim, W. Jung, J. Park, and J. H. Ahn, "Accelerating number theoretic transformations for bootstrappable homomorphic encryption on GPUs", in 2020 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2020, pp. 264--275.

[17]

W. T. Cochran, J. W. Cooley, D. L. Favin, H. D. Helms, R. A. Kaenel, W. W. Lang, G. C. Maling, D. E. Nelson, C. M. Rader, and P. D. Welch, "What is the fast fourier transform?" Proceedings of the IEEE, vol. 55, no. 10, pp. 1664--1674, 1967.

[18]

W. Dai and B. Sunar, "cuhe: A homomorphic encryption accelerator library", in International Conference on Cryptography and Information Security in the Balkans. Springer, 2015, pp. 169--186.

[19]

X. Cui, Y. Chen, and H. Mei, "Improving performance of matrix multiplication and fft on GPU", in 2009 15th International Conference on Parallel and Distributed Systems. IEEE, 2009, pp. 42--48.

Digital Library

[20]

O. Fialka and M. Cadik, "FFT and convolution performance in image filtering on GPU", in Tenth International Conference on Information Visualisation (IV'06). IEEE, 2006, pp. 609--614.

Digital Library

[21]

K. Moreland and E. Angel, "The FFT on a GPU", in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 2003, pp. 112--119.

Digital Library

[22]

J. L. Mitchell, M. Y. Ansari, and E. Hart, "Advanced image processing with directx® 9 pixel shaders", ShaderX2, pp. 439--464, 2004.

[23]

N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli, "High performance discrete fourier transforms on graphics processors", in SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Ieee, 2008, pp. 1--12.

Digital Library

[24]

Y. Chen, X. Cui, and H. Mei, "Large-scale FFT on GPU clusters", in Proceedings of the 24th ACM International Conference on Supercomputing, 2010, pp. 315--324.

Digital Library

[25]

S. Keskin, E. Erdil, and T. Kocak, "An efficient parallel implementation of 3D-FFT on gpu", in Proc. IEEE High Perform. Extreme Comput. Conf., 2017, pp. 1--4.

[26]

A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka, "Bandwidth intensive 3-D FFT kernel for gpus using CUDA", in SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE, 2008, pp. 1--11.

Digital Library

[27]

S. Kim, W. Jung, J. Park, and J. H. Ahn, "Accelerating number theoretic transformations for bootstrappable homomorphic encryption on GPUs", in 2020 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2020, pp. 264--275.

[28]

S. Hatfield, M. Chantry, P. Düben, and T. Palmer, "Accelerating high-resolution weather models with deep-learning hardware", in Proceedings of the Platform for Advanced Scientific Computing Conference, 2019, pp. 1--11.

Digital Library

[29]

C. A. Navarro, R. Carrasco, R. J. Barrientos, J. A. Riquelme, and R. Vega, "Gpu tensor cores for fast arithmetic reductions", arXiv preprint arXiv:2001.05585, 2020.

[30]

A. Dakkak, C. Li, J. Xiong, I. Gelado, and W.-m. Hwu, "Accelerating reduction and scan using tensor core units", in Proceedings of the ACM International Conference on Supercomputing, 2019, pp. 46--57.

Digital Library

[31]

A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, "Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers", in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 603--613.

Digital Library

[32]

S. Sioutas, S. Stuijk, T. Basten, L. Somers, and H. Corporaal, "Programming tensor cores from an image processing dsl", in Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems, 2020, pp. 36--41.

Digital Library

[33]

A. Abdelfattah, S. Tomov, and J. Dongarra, "Matrix multiplication on batches of small matrices in half and half-complex precisions", Journal of Parallel and Distributed Computing, vol. 145, pp. 188--201, 2020.

[34]

N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun, "Fast convolutional nets with fbfft: A gpu performance evaluation", arXiv preprint arXiv:1412.7580, 2014.

[35]

A. Sorna, X. Cheng, E. D'azevedo, K. Won, and S. Tomov, "Optimizing the fast fourier transform using mixed precision on tensor core hardware", in 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW). IEEE, 2018, pp. 3--7.

[36]

B. Li, S. Cheng, and J. Lin, "tcfft: Accelerating half-precision fft through tensor cores", 2021.

[37]

S. Durrani, M. S. Chughtai, A. Dakkak, W.-m. Hwu, and L. Rauchwerger, "Fft blitz: the tensor cores strike back", in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 488--489.

Digital Library

[38]

S. H. K. Durrani, "Utilizing gpu tensor cores for algorithmic acceleration", 2020.

[39]

C. M. Rader, "Discrete fourier transforms when the number of data samples is prime", Proceedings of the IEEE, vol. 56, no. 6, pp. 1107--1108, 1968.

[40]

L. Bluestein, "A linear filtering approach to the computation of discrete fourier transform", IEEE Transactions on Audio and Electroacoustics, vol. 18, no. 4, pp. 451--455, 1970.

Cited By

Ibrahim MAga S(2024)Pimacolaba: Collaborative Acceleration for FFT on Commercial Processing-In-Memory ArchitecturesProceedings of the International Symposium on Memory Systems10.1145/3695794.3695796(13-25)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695796
Ha DZhang YKao CHughes CRo WTseng H(2024)M3XU: Achieving High-Precision and Complex Matrix Multiplication with Low-Precision MXUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00016(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00016

Index Terms

Index terms have been assigned to the content through auto-classification.

Recommendations

Small Discrete Fourier Transforms on GPUs
CCGRID '11: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Efficient implementations of the Discrete Fourier Transform (DFT) for GPUs provide good performance with large data sizes, but are not competitive with CPU code for small data sizes. On the other hand, several applications perform multiple DFTs on small ...
FFT blitz: the tensor cores strike back
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

The fast Fourier Transform (FFT), a reduced-complexity formulation of the Discrete Fourier Transform (DFT), is an important tool in many areas of science and engineering. FFTW is a well-known package that follows this approach and is currently one of ...
Memory-accelerated parallel method for multidimensional fast fourier implementation on GPU
Abstract
Fast Fourier transform (FFT) is a well-known algorithm that calculates the discrete Fourier transform (DFT) of discrete data and is an essential tool in scientific and engineering computation. Due to the large amounts of data, parallelly executing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '21: Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques

September 2021

370 pages

ISBN:9781665442787

Editor:
Jaejin Lee
Seoul National University Albert Cohen, Google

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS\DATC: IEEE Computer Society

Publisher

IEEE Press

Publication History

Published: 26 November 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

PACT '21

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE-CS\DATC

PACT '21: International Conference on Parallel Architectures and Compilation Techniques

September 26 - 29, 2021

GA, Atlanta, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
5
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)4

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ibrahim MAga S(2024)Pimacolaba: Collaborative Acceleration for FFT on Commercial Processing-In-Memory ArchitecturesProceedings of the International Symposium on Memory Systems10.1145/3695794.3695796(13-25)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695796
Ha DZhang YKao CHughes CRo WTseng H(2024)M3XU: Achieving High-Precision and Complex Matrix Multiplication with Low-Precision MXUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00016(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00016

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents