More Web Proxy on the site http://driver.im/

Article

Automatic Thread Extraction with Decoupled Software Pipelining

Authors:

Guilherme Ottoni,

David I. AugustAuthors Info & Claims

MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture

Pages 105 - 118

https://doi.org/10.1109/MICRO.2005.13

Published: 12 November 2005 Publication History

Publisher Site Get Access

Abstract

Until recently, a steadily rising clock rate and other uniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturers to add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have not succeeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improvement for a large class of existing codes. To find useful work for chip multiprocessors, we propose an automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the finegrained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Use of the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and provide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-core resource requirements. Using our initial fully automatic compiler implementation and a validated processor model, we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety of codes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promising future for this approach.

References

[1]

{1} J. N. Amaral, G. Gao, E. D. Kocalar, P. O'Neill, and X. Tang. Design and implementation of an efficient thread partitioning algorithm. In Proceedings of the International Symposium on High Performance Computing, pages 252-259, 2000.

Digital Library

[2]

{2} D. I. August, D. A. Connors, S. A. Mahlke, J. W. Sias, K. M. Crozier, B. Cheng, P. R. Eaton, Q. B. Olaniran, and W. W. Hwu. Integrated predication and speculative execution in the IMPACT EPIC architecture. In Proceedings of the 25th International Symposium on Computer Architecture, pages 227-237, June 1998.

Digital Library

[3]

{3} R. D. Barnes, E. M. Nystrom, J. W. Sias, S. J. Patel, N. Navarro, and W. W. Hwu. Beating in-order stalls with 'flea-flicker' two-pass pipelining. In Proceedings of the 36th International Symposium on Microarchitecture, December 2003.

Digital Library

[4]

{4} M. E. Benitez and J. W. Davidson. Code generation for streaming: An access/execute mechanism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, California, United States, April 1991.

Digital Library

[5]

{5} A. Bhowmik and M. Franklin. A general compiler framework for speculative multithreading. In Proceedings of the 14th ACM Symposium on Parallel Algorithms and Architectures, pages 99-108, 2002.

Digital Library

[6]

{6} D.-K. Chen. Compiler Optimizations for Parallel Loops with Fine-grained Synchronization. PhD thesis, Department of Computer Science, University of Illinois, Urbana, IL, 1994.

Digital Library

[7]

{7} B.-C. Cheng and W. W. Hwu. Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 57-69, 2000.

Digital Library

[8]

{8} M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W H Freeman & Co, New York, NY, 1979.

Digital Library

[9]

{9} M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In ASPLOS-X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, pages 291-303, 2002.

Digital Library

[10]

{10} B. Guo, M. J. Bridges, S. Triantafyllis, G. Ottoni, E. Raman, and D. I. August. Practical and accurate low-level pointer analysis. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, March 2005.

Digital Library

[11]

{11} Intel Corporation. Intel Itanium 2 Processor Reference Manual: For Software Development and Optimization. Santa Clara, CA, 2002.

[12]

{12} T. A. Johnson, R. Eigenmann, and T. N. Vijaykumar. Min-cut program decomposition for thread-level speculation. In Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, pages 59-70, 2004.

Digital Library

[13]

{13} K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002.

Digital Library

[14]

{14} D. Kim and D. Yeung. A study of source-level compiler algorithms for automatic construction of pre-execution code. ACM Trans. Comput. Syst., 22(3):326-379, 2004.

Digital Library

[15]

{15} M. S. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, pages 318-328, June 1988.

Digital Library

[16]

{16} C. Lee, M. Potkonjak, and W. Mangione-Smith. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture, pages 330-335, December 1997.

Digital Library

[17]

{17} W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. P. Amarasinghe. Space-time scheduling of instruction-level parallelism on a Raw Machine. In The Proceedings of the Eighth International Confrence on Architectural Support for Programming Languages and Operating Systems, pages 46-57, 1998.

Digital Library

[18]

{18} S. A. Mahlke, W. Y. Chen, J. C. Gyllenhaal, W. W. Hwu, P. P. Chang, and T. Kiyohara. Compiler code transformations for superscalar-based high-performance systems. In Proceedings of Supercomputing '92, pages 808-817, November 1992.

Digital Library

[19]

{19} D. A. Penry,M. Vachharajani, and D. I. August. Rapid development of a flexible validated processor model. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation (MOBS), June 2005.

[20]

{20} R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 177-188, September 2004.

Digital Library

[21]

{21} K. Rich and M. Farrens. Code partitioning in decoupled compilers. In Proceedings of the 6th European Conference on Parallel Processing, pages 1008-1017, Munich, Germany, September 2000.

Digital Library

[22]

{22} J. W. Sias, S. zee Ueng, G. A. Kent, I. M. Steiner, E. M. Nystrom, and W. mei W. Hwu. Field-testing IMPACT EPIC research results in Itanium 2. In Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004.

Digital Library

[23]

{23} J. E. Smith. Decoupled access/execute computer architectures. In Proceedings of the 9th International Symposium on Computer Architecture, pages 112-119, April 1982.

Digital Library

[24]

{24} G. S. Sohi, S. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the 22th International Symposium on Computer Architecture, June 1995.

Digital Library

[25]

{25} J. G. Steffan and T. C. Mowry. The potential for using thread-level data speculation to facilitate automatic parallelization. In The 4th International Symposium on High-Performance Computer Architecture, pages 2-13, February 1998.

Digital Library

[26]

{26} M. Sung, R. Krashinsky, and K. Asanovic. Multithreading decoupled architectures for complexity-effective general purpose computing. SIGARCH Comput. Archit. News, 29(5):56-61, 2001.

Digital Library

[27]

{27} R. E. Tarjan. Depth first search and linear graph algorithms. SIAM Journal on Computing, 1(2):146-160, 1972.

[28]

{28} M. B. Taylor, W. Lee, S. P. Amarasinghe, and A. Agarwal. Scalar operand networks. IEEE Transactions on Parallel and Distributed Systems, 16(2):145-162, February 2005.

Digital Library

[29]

{29} J.-Y. Tsai, J. Huang, C. Amlo, D. J. Lilja, and P.-C. Yew. The superthreaded processor architecture. IEEE Transactions on Computers, 48(9):881-902, 1999.

Digital Library

[30]

{30} M. Vachharajani, N. Vachharajani, D. A. Penry, J. A. Blome, and D. I. August. Microarchitectural exploration with Liberty. In Proceedings of the 35th International Symposium on Microarchitecture (MICRO), pages 271-282, November 2002.

Digital Library

[31]

{31} W. A. Wulf. Evaluation of the wm architecture. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 382-390, Queensland, Australia, May 1992.

Digital Library

[32]

{32} C. Zilles and G. Sohi. Master/slave speculative parallelization. In Proceedings of the 35th International Symposium on Microarchitecture, November 2002.

Digital Library

Cited By

Zhang YLiu TSun ZChen ZLi XZuo ZJust RFraser G(2023)Catamaran: Low-Overhead Memory Safety Enforcement via Parallel AccelerationProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598098(816-828)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598098
Ma JWang WNelson ACuevas MHomerding BLiu CHuang ZCampanoni SHale KDinda Pde Supinski BHall MGamblin T(2021)Paths to OpenMP in the kernelProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476183(1-17)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476183
Vasiladiotis CLozano RCole MFranke BLee J(2021)Loop parallelization using dynamic commutativity analysisProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370319(150-161)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370319
Show More Cited By

Index Terms

Automatic Thread Extraction with Decoupled Software Pipelining
1. Computer systems organization
  1. Architectures
    1. Serial architectures
      1. Pipeline computing
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multiprocessing / multiprogramming / multitasking

Recommendations

Parallel-stage decoupled software pipelining
CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization

In recent years, the microprocessor industry has embraced chip multiprocessors (CMPs), also known as multi-core architectures, as the dominant design paradigm. For existing and new applications to make effective use of CMPs, it is desirable that ...
Decoupled Software Pipelining with the Synchronization Array
PACT '04: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques

Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely ...
Speculative Decoupled Software Pipelining
PACT '07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques

In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. To avoid burdening programmers with the responsibility of parallelizing their applications, some researchers have advocated automatic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture

November 2005

350 pages

ISBN:0769524400

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 12 November 2005

Check for updates

Qualifiers

Article

Conference

Micro-38

Sponsor:

SIGMICRO

Micro-38: The 38th Annual IEEE/ACM International Symposium on Microarchitecture

November 12 - 16, 2005

Barcelona, Spain

Acceptance Rates

MICRO 38 Paper Acceptance Rate 29 of 147 submissions, 20%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

111
Total Citations
View Citations
67
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YLiu TSun ZChen ZLi XZuo ZJust RFraser G(2023)Catamaran: Low-Overhead Memory Safety Enforcement via Parallel AccelerationProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598098(816-828)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598098
Ma JWang WNelson ACuevas MHomerding BLiu CHuang ZCampanoni SHale KDinda Pde Supinski BHall MGamblin T(2021)Paths to OpenMP in the kernelProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476183(1-17)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476183
Vasiladiotis CLozano RCole MFranke BLee J(2021)Loop parallelization using dynamic commutativity analysisProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370319(150-161)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370319
Apostolakis SXu ZChan GCampanoni SAugust DLarus JCeze LStrauss K(2020)PerspectiveProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378458(351-367)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378458
Zhou RJones TKandemir MJimborean AMoseley T(2019)Janus: statically-driven and profile-guided automatic dynamic binary parallelisationProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314877(15-25)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314877
Mastoras AGross T(2019)Efficient and Scalable Execution of Fine-Grained Dynamic Linear PipelinesACM Transactions on Architecture and Code Optimization10.1145/330741116:2(1-26)Online publication date: 18-Apr-2019
https://dl.acm.org/doi/10.1145/3307411
Tran KJimborean ACarlson TKoukos KSjälander MKaxiras S(2018)SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order coresACM SIGPLAN Notices10.1145/3296979.319239353:4(328-343)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192393
Deiana ESt-Amour VDinda PHardavellas NCampanoni S(2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173181
Kiriansky VXu HRinard MAmarasinghe SEvripidou SStenström PO'Boyle M(2018)Cimple: instruction and memory level parallelismProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243185(1-16)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243185
Yviquel HCruz LAraujo G(2018)Cluster Programming using the OpenMP Accelerator ModelACM Transactions on Architecture and Code Optimization10.1145/322611215:3(1-23)Online publication date: 28-Aug-2018
https://dl.acm.org/doi/10.1145/3226112
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents