[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/PACT52795.2021.00032acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Accelerating Fourier and Number Theoretic Transforms using Tensor Cores and Warp Shuffles

Published: 26 November 2024 Publication History

Abstract

The discrete Fourier transform (DFT) and its specialized case, the number theoretic transform (NTT), are two important mathematical tools having applications in several areas of science and engineering. However, despite their usefulness and utility, their adoption continues to be a challenge as computing the DFT of a signal can be a time-consuming and expensive operation. To speed things up, fast Fourier transform (FFT) algorithms, which are reduced-complexity formulations for computing the DFT of a sequence, have been proposed and implemented for traditional processors and their corresponding instruction sets. With the rise of GPUs, NVIDIA introduced its own FFT computation library called cuFFT, which leverages the power of GPUs to compute the DFT. However, as this paper demonstrates, there is a lot of room for improvement to accelerate the FFT and NTT algorithms on modern GPUs by utilizing specialized operations and architectural advancements. In particular, we present four major types of optimizations that leverage tensor cores and the warp-shuffle instruction. Through extensive evaluations, we show that our approach consistently outperforms existing GPU-based implementations with a speedup of up to 4× for NTT and a speed of up to 1.5× for FFT.

References

[1]
J. Goldthwaite. The value of digital signal processing. [Online]. Available: https://www.sensear.com/blog/the-value-of-digital-signal-processing
[2]
Y. Doröz, E. Özturk, and B. Sunar, "A million-bit multiplier architecture for fully homomorphic encryption", Microprocessors and Microsystems, vol. 38, no. 8, pp. 766--775, 2014.
[3]
D. Coldewey. Duality scores $14m darpa contract for hardware-accelerated homomorphic encryption. [Online]. Available: https://techcrunch.com/2021/02/03/duality-scores-14m-darpa-contract-for-hardware-accelerated-homomorphic-encryption/amp/
[4]
A. A. Al Sallab, H. Fahmy, and M. Rashwan, "Optimized hardware implementation of fft processor", in 2009 4th International Design and Test Workshop (IDT). IEEE, 2009, pp. 1--5.
[5]
M. Frigo and S. G. Johnson, "FFTW: An adaptive software architecture for the fft", in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Cat. No. 98CH36181), vol. 3. IEEE, 1998, pp. 1381--1384.
[6]
C. D. Zone. CUDA toolkit documentation -. [Online]. Available: https://docs.nvidia.com/cuda/cufft/index.html
[7]
T. NVIDIA, "V100 GPU architecture", 2017.
[8]
NVIDIA, "A100 GPU architecture", 2020. [Online]. Available: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf...
[9]
D. B. Kirk and W. H. Wen-Mei, Programming massively parallel processors: A hands-on approach. Morgan kaufmann, 2016.
[10]
M. Rahman, Applications of Fourier transforms to generalized functions. WIT Press, 2011.
[11]
P. Duhamel and M. Vetterli, "Fast Fourier transforms: A tutorial review and a state of the art", Signal Processing (Elsevier), vol. 19, no. ARTICLE, pp. 259--299, 1990.
[12]
C. Burrus, "Index mappings for multidimensional formulation of the dft and convolution", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 25, no. 3, pp. 239--242, 1977.
[13]
J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation of complex fourier series", Mathematics of computation, vol. 19, no. 90, pp. 297--301, 1965.
[14]
A. Al Badawi, B. Veeravalli, C. F. Mun, and K. M. M. Aung, "High-performance fv somewhat homomorphic encryption on GPUs: An implementation using CUDA", IACR Transactions on Cryptographic Hardware and Embedded Systems, pp. 70--95, 2018.
[15]
H. Chen, K. Han, Z. Huang, A. Jalali, and K. Laine, "Simple encrypted arithmetic library v2. 3.0", Microsoft Research, December, 2017.
[16]
S. Kim, W. Jung, J. Park, and J. H. Ahn, "Accelerating number theoretic transformations for bootstrappable homomorphic encryption on GPUs", in 2020 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2020, pp. 264--275.
[17]
W. T. Cochran, J. W. Cooley, D. L. Favin, H. D. Helms, R. A. Kaenel, W. W. Lang, G. C. Maling, D. E. Nelson, C. M. Rader, and P. D. Welch, "What is the fast fourier transform?" Proceedings of the IEEE, vol. 55, no. 10, pp. 1664--1674, 1967.
[18]
W. Dai and B. Sunar, "cuhe: A homomorphic encryption accelerator library", in International Conference on Cryptography and Information Security in the Balkans. Springer, 2015, pp. 169--186.
[19]
X. Cui, Y. Chen, and H. Mei, "Improving performance of matrix multiplication and fft on GPU", in 2009 15th International Conference on Parallel and Distributed Systems. IEEE, 2009, pp. 42--48.
[20]
O. Fialka and M. Cadik, "FFT and convolution performance in image filtering on GPU", in Tenth International Conference on Information Visualisation (IV'06). IEEE, 2006, pp. 609--614.
[21]
K. Moreland and E. Angel, "The FFT on a GPU", in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 2003, pp. 112--119.
[22]
J. L. Mitchell, M. Y. Ansari, and E. Hart, "Advanced image processing with directx® 9 pixel shaders", ShaderX2, pp. 439--464, 2004.
[23]
N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli, "High performance discrete fourier transforms on graphics processors", in SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Ieee, 2008, pp. 1--12.
[24]
Y. Chen, X. Cui, and H. Mei, "Large-scale FFT on GPU clusters", in Proceedings of the 24th ACM International Conference on Supercomputing, 2010, pp. 315--324.
[25]
S. Keskin, E. Erdil, and T. Kocak, "An efficient parallel implementation of 3D-FFT on gpu", in Proc. IEEE High Perform. Extreme Comput. Conf., 2017, pp. 1--4.
[26]
A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka, "Bandwidth intensive 3-D FFT kernel for gpus using CUDA", in SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE, 2008, pp. 1--11.
[27]
S. Kim, W. Jung, J. Park, and J. H. Ahn, "Accelerating number theoretic transformations for bootstrappable homomorphic encryption on GPUs", in 2020 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2020, pp. 264--275.
[28]
S. Hatfield, M. Chantry, P. Düben, and T. Palmer, "Accelerating high-resolution weather models with deep-learning hardware", in Proceedings of the Platform for Advanced Scientific Computing Conference, 2019, pp. 1--11.
[29]
C. A. Navarro, R. Carrasco, R. J. Barrientos, J. A. Riquelme, and R. Vega, "Gpu tensor cores for fast arithmetic reductions", arXiv preprint arXiv:2001.05585, 2020.
[30]
A. Dakkak, C. Li, J. Xiong, I. Gelado, and W.-m. Hwu, "Accelerating reduction and scan using tensor core units", in Proceedings of the ACM International Conference on Supercomputing, 2019, pp. 46--57.
[31]
A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, "Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers", in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 603--613.
[32]
S. Sioutas, S. Stuijk, T. Basten, L. Somers, and H. Corporaal, "Programming tensor cores from an image processing dsl", in Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems, 2020, pp. 36--41.
[33]
A. Abdelfattah, S. Tomov, and J. Dongarra, "Matrix multiplication on batches of small matrices in half and half-complex precisions", Journal of Parallel and Distributed Computing, vol. 145, pp. 188--201, 2020.
[34]
N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun, "Fast convolutional nets with fbfft: A gpu performance evaluation", arXiv preprint arXiv:1412.7580, 2014.
[35]
A. Sorna, X. Cheng, E. D'azevedo, K. Won, and S. Tomov, "Optimizing the fast fourier transform using mixed precision on tensor core hardware", in 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW). IEEE, 2018, pp. 3--7.
[36]
B. Li, S. Cheng, and J. Lin, "tcfft: Accelerating half-precision fft through tensor cores", 2021.
[37]
S. Durrani, M. S. Chughtai, A. Dakkak, W.-m. Hwu, and L. Rauchwerger, "Fft blitz: the tensor cores strike back", in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 488--489.
[38]
S. H. K. Durrani, "Utilizing gpu tensor cores for algorithmic acceleration", 2020.
[39]
C. M. Rader, "Discrete fourier transforms when the number of data samples is prime", Proceedings of the IEEE, vol. 56, no. 6, pp. 1107--1108, 1968.
[40]
L. Bluestein, "A linear filtering approach to the computation of discrete fourier transform", IEEE Transactions on Audio and Electroacoustics, vol. 18, no. 4, pp. 451--455, 1970.

Cited By

View all
  • (2024)Pimacolaba: Collaborative Acceleration for FFT on Commercial Processing-In-Memory ArchitecturesProceedings of the International Symposium on Memory Systems10.1145/3695794.3695796(13-25)Online publication date: 30-Sep-2024
  • (2024)M3XU: Achieving High-Precision and Complex Matrix Multiplication with Low-Precision MXUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00016(1-16)Online publication date: 17-Nov-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '21: Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques
September 2021
370 pages
ISBN:9781665442787
  • Editor:
  • Jaejin Lee

Sponsors

Publisher

IEEE Press

Publication History

Published: 26 November 2024

Check for updates

Author Tags

  1. DFT
  2. FFT
  3. GPU
  4. NTT
  5. Tensor Cores
  6. cyphertext
  7. homomorphic encryption

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PACT '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)4
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Pimacolaba: Collaborative Acceleration for FFT on Commercial Processing-In-Memory ArchitecturesProceedings of the International Symposium on Memory Systems10.1145/3695794.3695796(13-25)Online publication date: 30-Sep-2024
  • (2024)M3XU: Achieving High-Precision and Complex Matrix Multiplication with Low-Precision MXUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00016(1-16)Online publication date: 17-Nov-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media