[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors

Published: 01 January 2001 Publication History

Abstract

In this paper, we examine different methods using techniques of blocking, buffering, and padding for efficient implementations of bit-reversals. We evaluate the merits and limits of each technique and its application and architecture-dependent conditions for developing cache-optimal methods. Besides testing the methods on different uniprocessors, we conducted both simulation and measurements on two commercial symmetric multiprocessors (SMP) to provide architectural insights into the methods and their implementations. We present two contributions in this paper: (1) Our integrated blocking methods, which match cache associativity and translation-lookaside buffer (TLB) cache size and which fully use the available registers, are cache-optimal and fast. (2) We show that our padding methods outperform other software-oriented methods, and we believe they are the fastest in terms of minimizing both CPU and memory access cycles. Since the padding methods are almost independent of hardware, they could be widely used on many uniprocessor workstations and multiprocessors.

References

[1]
D. F. Bacon, S. L. Graham, and O. J. Sharp, Compiler transformations for high performance computing, ACM Computing Surveys, 26 (1994), pp. 345–420.
[2]
D. H. Bailey, FFTs in external or hierarchical memory, J. Supercomputing, 4 (1990), pp. 23–35.
[3]
B. Bershad, D. Lee, T. Romer, and B. Chen, Avoiding conflict misses dynamically in large direct‐mapped caches, in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, 1994, pp. 158–170.
[4]
M. Cekleov and M. Dubois, Virtual‐address caches, IEEE Micro, 17 (1997), pp. 64–71.
[5]
James Cooley, John Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comp., 19 (1965), 297–301
[6]
A. Edelman, Optimal matrix transpose and bit reversal on hypercube: All‐to‐all personalized communication, J. Parallel Distrib. Comput., 11 (1991), pp. 328–331.
[7]
David Evans, An improved digit‐reversal permutation algorithm for the fast Fourier and Hartley transforms, IEEE Trans. Acoust. Speech Signal Process., 35 (1987), 1120–1125
[8]
K. S. Gatlin and L. Carter, Memory hierarchy considerations for fast transpose and bit‐reversals, in Proceedings of Fifth International Symposium on High‐Performance Computer Architecture, 1999, pp. 33–44.
[9]
S. Johnsson, Ching‐Tien Ho, Algorithms for matrix transposition on Boolean N‐cube configured ensemble architectures, SIAM J. Matrix Anal. Appl., 9 (1988), 419–454
[10]
IEEE, POSIX P1003.4a: Threads Extension for Portable Operating Systems, IEEE Press, Piscataway, NJ, 1994.
[11]
Alan Karp, Bit reversal on uniprocessors, SIAM Rev., 38 (1996), 1–26
[12]
J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, Morgan‐Kaufmann, San Francisco, 1996.
[13]
N. P. Jouppi, Improving direct‐mapped cache performance by the addition of a small fully‐associative cache and prefetch buffers, in Proceedings of 17th Annual International Symposium on Computer Architecture, 1990, pp. 364–373.
[14]
L. McVoy and C. Staelin, lmbench: Portable tools for performance analysis, in Proceedings of the 1996 USENIX Technical Conference, 1996, pp. 279–295.
[15]
P. N. Swarztrauber, FFT algorithms for vector computers, Parallel Comput., 1 (1984), pp. 45–63.
[16]
C. Rivera and C.‐W. Tseng, Data transformations for eliminating conflict misses, in Proceedings of the ACM SIGPLAN‘98 Conference on Programming Language Design and Implementation, 1998, pp. 38–49.
[17]
M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod, Using the SimOS machine simulator to study complex computer systems, ACM Transactions on Modeling and Computer Simulation, 7 (1997), pp. 78–103.
[18]
L. Xiao, X. Zhang, and S. A. Kubricht,Improving memory performance of sorting algorithms, 5 (2000), pp. 1–23.
[19]
Y. Yan, X. Zhang, and Z. Zhang, Cacheminer: A runtime approach to exploit cache locality on SMP, IEEE Trans. Parallel Distrib. Systems, 11 (2000), pp. 357–374.
[20]
C. Zhang, X. Zhang, and Y. Yan, Two fast and high‐associativity cache schemes, IEEE Micro, 17 (1997), pp. 40–49.

Cited By

View all
  • (2010)Performance of parallel bit-reversal with cilk and UPC for fast fourier transformProceedings of the 5th international conference on Advances in Grid and Pervasive Computing10.1007/978-3-642-13067-0_26(224-233)Online publication date: 10-May-2010
  • (2007)Optimal bit-reversal using vector permutationsProceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures10.1145/1248377.1248411(198-199)Online publication date: 9-Jun-2007
  • (2006)Combining analytical and empirical approaches in tuning matrix transpositionProceedings of the 15th international conference on Parallel architectures and compilation techniques10.1145/1152154.1152190(233-242)Online publication date: 16-Sep-2006
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing
SIAM Journal on Scientific Computing  Volume 22, Issue 6
2001
377 pages

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2001

Author Tags

  1. 68P05
  2. 65Y20
  3. 65Y05

Author Tags

  1. cache optimizations
  2. memory hierarchy
  3. bit-reversals
  4. shared-memory multiprocessors
  5. parallel computing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2010)Performance of parallel bit-reversal with cilk and UPC for fast fourier transformProceedings of the 5th international conference on Advances in Grid and Pervasive Computing10.1007/978-3-642-13067-0_26(224-233)Online publication date: 10-May-2010
  • (2007)Optimal bit-reversal using vector permutationsProceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures10.1145/1248377.1248411(198-199)Online publication date: 9-Jun-2007
  • (2006)Combining analytical and empirical approaches in tuning matrix transpositionProceedings of the 15th international conference on Parallel architectures and compilation techniques10.1145/1152154.1152190(233-242)Online publication date: 16-Sep-2006
  • (2002)Dynamic Cluster Resource Allocations for Jobs with Known and Unknown Memory DemandsIEEE Transactions on Parallel and Distributed Systems10.1109/71.99320413:3(223-240)Online publication date: 1-Mar-2002

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media