More Web Proxy on the site http://driver.im/

research-article

Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors

Authors:

Xiaodong ZhangAuthors Info & Claims

SIAM Journal on Scientific Computing, Volume 22, Issue 6

Pages 2113 - 2134

https://doi.org/10.1137/S1064827599359709

Published: 01 January 2001 Publication History

Abstract

In this paper, we examine different methods using techniques of blocking, buffering, and padding for efficient implementations of bit-reversals. We evaluate the merits and limits of each technique and its application and architecture-dependent conditions for developing cache-optimal methods. Besides testing the methods on different uniprocessors, we conducted both simulation and measurements on two commercial symmetric multiprocessors (SMP) to provide architectural insights into the methods and their implementations. We present two contributions in this paper: (1) Our integrated blocking methods, which match cache associativity and translation-lookaside buffer (TLB) cache size and which fully use the available registers, are cache-optimal and fast. (2) We show that our padding methods outperform other software-oriented methods, and we believe they are the fastest in terms of minimizing both CPU and memory access cycles. Since the padding methods are almost independent of hardware, they could be widely used on many uniprocessor workstations and multiprocessors.

References

[1]

D. F. Bacon, S. L. Graham, and O. J. Sharp, Compiler transformations for high performance computing, ACM Computing Surveys, 26 (1994), pp. 345–420.

Digital Library

[2]

D. H. Bailey, FFTs in external or hierarchical memory, J. Supercomputing, 4 (1990), pp. 23–35.

Digital Library

[3]

B. Bershad, D. Lee, T. Romer, and B. Chen, Avoiding conflict misses dynamically in large direct‐mapped caches, in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, 1994, pp. 158–170.

[4]

M. Cekleov and M. Dubois, Virtual‐address caches, IEEE Micro, 17 (1997), pp. 64–71.

Digital Library

[5]

James Cooley, John Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comp., 19 (1965), 297–301

[6]

A. Edelman, Optimal matrix transpose and bit reversal on hypercube: All‐to‐all personalized communication, J. Parallel Distrib. Comput., 11 (1991), pp. 328–331.

Digital Library

[7]

David Evans, An improved digit‐reversal permutation algorithm for the fast Fourier and Hartley transforms, IEEE Trans. Acoust. Speech Signal Process., 35 (1987), 1120–1125

[8]

K. S. Gatlin and L. Carter, Memory hierarchy considerations for fast transpose and bit‐reversals, in Proceedings of Fifth International Symposium on High‐Performance Computer Architecture, 1999, pp. 33–44.

[9]

S. Johnsson, Ching‐Tien Ho, Algorithms for matrix transposition on Boolean

N

‐cube configured ensemble architectures, SIAM J. Matrix Anal. Appl., 9 (1988), 419–454

Digital Library

[10]

IEEE, POSIX P1003.4a: Threads Extension for Portable Operating Systems, IEEE Press, Piscataway, NJ, 1994.

[11]

Alan Karp, Bit reversal on uniprocessors, SIAM Rev., 38 (1996), 1–26

Digital Library

[12]

J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, Morgan‐Kaufmann, San Francisco, 1996.

[13]

N. P. Jouppi, Improving direct‐mapped cache performance by the addition of a small fully‐associative cache and prefetch buffers, in Proceedings of 17th Annual International Symposium on Computer Architecture, 1990, pp. 364–373.

[14]

L. McVoy and C. Staelin, lmbench: Portable tools for performance analysis, in Proceedings of the 1996 USENIX Technical Conference, 1996, pp. 279–295.

[15]

P. N. Swarztrauber, FFT algorithms for vector computers, Parallel Comput., 1 (1984), pp. 45–63.

Digital Library

[16]

C. Rivera and C.‐W. Tseng, Data transformations for eliminating conflict misses, in Proceedings of the ACM SIGPLAN‘98 Conference on Programming Language Design and Implementation, 1998, pp. 38–49.

[17]

M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod, Using the SimOS machine simulator to study complex computer systems, ACM Transactions on Modeling and Computer Simulation, 7 (1997), pp. 78–103.

Digital Library

[18]

L. Xiao, X. Zhang, and S. A. Kubricht,Improving memory performance of sorting algorithms, 5 (2000), pp. 1–23.

[19]

Y. Yan, X. Zhang, and Z. Zhang, Cacheminer: A runtime approach to exploit cache locality on SMP, IEEE Trans. Parallel Distrib. Systems, 11 (2000), pp. 357–374.

Digital Library

[20]

C. Zhang, X. Zhang, and Y. Yan, Two fast and high‐associativity cache schemes, IEEE Micro, 17 (1997), pp. 40–49.

Digital Library

Cited By

Weng THuang SLiau WLi K(2010)Performance of parallel bit-reversal with cilk and UPC for fast fourier transformProceedings of the 5th international conference on Advances in Grid and Pervasive Computing10.1007/978-3-642-13067-0_26(224-233)Online publication date: 10-May-2010
https://dl.acm.org/doi/10.1007/978-3-642-13067-0_26
Lokhmotov AMycroft AGibbons PScheideler C(2007)Optimal bit-reversal using vector permutationsProceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures10.1145/1248377.1248411(198-199)Online publication date: 9-Jun-2007
https://dl.acm.org/doi/10.1145/1248377.1248411
Lu QKrishnamoorthy SSadayappan PAltman ESkadron KZorn B(2006)Combining analytical and empirical approaches in tuning matrix transpositionProceedings of the 15th international conference on Parallel architectures and compilation techniques10.1145/1152154.1152190(233-242)Online publication date: 16-Sep-2006
https://dl.acm.org/doi/10.1145/1152154.1152190
Show More Cited By

Index Terms

Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors

Index terms have been assigned to the content through auto-classification.

Recommendations

Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors
HPCA '95: Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture

We study the relative efficiency of previously proposed stride and sequential prefetching-two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of ...
Write buffer design for cache-coherent shared-memory multiprocessors
ICCD '95: Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors

We evaluate the performance impact of two different write-buffer configurations (one word per buffer entry and one block per buffer entry) and two different write policies (write-through and write-back), when using the partial block invalidation ...
Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors
ICPP '97: Proceedings of the international Conference on Parallel Processing

The performance of programs consisting of parallel loops on shared-memory multiprocessors is limited by long memory latencies as processor speeds increase more rapidly than memory speeds. Two complementary techniques for addressing memory latency and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing

SIAM Journal on Scientific Computing Volume 22, Issue 6

2001

377 pages

ISSN:1064-8275

Issue’s Table of Contents

Copyright © 2001 Society for Industrial and Applied Mathematics.

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2001

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Weng THuang SLiau WLi K(2010)Performance of parallel bit-reversal with cilk and UPC for fast fourier transformProceedings of the 5th international conference on Advances in Grid and Pervasive Computing10.1007/978-3-642-13067-0_26(224-233)Online publication date: 10-May-2010
https://dl.acm.org/doi/10.1007/978-3-642-13067-0_26
Lokhmotov AMycroft AGibbons PScheideler C(2007)Optimal bit-reversal using vector permutationsProceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures10.1145/1248377.1248411(198-199)Online publication date: 9-Jun-2007
https://dl.acm.org/doi/10.1145/1248377.1248411
Lu QKrishnamoorthy SSadayappan PAltman ESkadron KZorn B(2006)Combining analytical and empirical approaches in tuning matrix transpositionProceedings of the 15th international conference on Parallel architectures and compilation techniques10.1145/1152154.1152190(233-242)Online publication date: 16-Sep-2006
https://dl.acm.org/doi/10.1145/1152154.1152190
Xiao LChen SZhang X(2002)Dynamic Cluster Resource Allocations for Jobs with Known and Unknown Memory DemandsIEEE Transactions on Parallel and Distributed Systems10.1109/71.99320413:3(223-240)Online publication date: 1-Mar-2002
https://dl.acm.org/doi/10.1109/71.993204

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents