[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/MICRO.2005.13acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
Article

Automatic Thread Extraction with Decoupled Software Pipelining

Published: 12 November 2005 Publication History

Abstract

Until recently, a steadily rising clock rate and other uniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturers to add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have not succeeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improvement for a large class of existing codes. To find useful work for chip multiprocessors, we propose an automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the finegrained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Use of the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and provide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-core resource requirements. Using our initial fully automatic compiler implementation and a validated processor model, we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety of codes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promising future for this approach.

References

[1]
{1} J. N. Amaral, G. Gao, E. D. Kocalar, P. O'Neill, and X. Tang. Design and implementation of an efficient thread partitioning algorithm. In Proceedings of the International Symposium on High Performance Computing, pages 252-259, 2000.
[2]
{2} D. I. August, D. A. Connors, S. A. Mahlke, J. W. Sias, K. M. Crozier, B. Cheng, P. R. Eaton, Q. B. Olaniran, and W. W. Hwu. Integrated predication and speculative execution in the IMPACT EPIC architecture. In Proceedings of the 25th International Symposium on Computer Architecture, pages 227-237, June 1998.
[3]
{3} R. D. Barnes, E. M. Nystrom, J. W. Sias, S. J. Patel, N. Navarro, and W. W. Hwu. Beating in-order stalls with 'flea-flicker' two-pass pipelining. In Proceedings of the 36th International Symposium on Microarchitecture, December 2003.
[4]
{4} M. E. Benitez and J. W. Davidson. Code generation for streaming: An access/execute mechanism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, California, United States, April 1991.
[5]
{5} A. Bhowmik and M. Franklin. A general compiler framework for speculative multithreading. In Proceedings of the 14th ACM Symposium on Parallel Algorithms and Architectures, pages 99-108, 2002.
[6]
{6} D.-K. Chen. Compiler Optimizations for Parallel Loops with Fine-grained Synchronization. PhD thesis, Department of Computer Science, University of Illinois, Urbana, IL, 1994.
[7]
{7} B.-C. Cheng and W. W. Hwu. Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 57-69, 2000.
[8]
{8} M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W H Freeman & Co, New York, NY, 1979.
[9]
{9} M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In ASPLOS-X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, pages 291-303, 2002.
[10]
{10} B. Guo, M. J. Bridges, S. Triantafyllis, G. Ottoni, E. Raman, and D. I. August. Practical and accurate low-level pointer analysis. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, March 2005.
[11]
{11} Intel Corporation. Intel Itanium 2 Processor Reference Manual: For Software Development and Optimization. Santa Clara, CA, 2002.
[12]
{12} T. A. Johnson, R. Eigenmann, and T. N. Vijaykumar. Min-cut program decomposition for thread-level speculation. In Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, pages 59-70, 2004.
[13]
{13} K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002.
[14]
{14} D. Kim and D. Yeung. A study of source-level compiler algorithms for automatic construction of pre-execution code. ACM Trans. Comput. Syst., 22(3):326-379, 2004.
[15]
{15} M. S. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, pages 318-328, June 1988.
[16]
{16} C. Lee, M. Potkonjak, and W. Mangione-Smith. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture, pages 330-335, December 1997.
[17]
{17} W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. P. Amarasinghe. Space-time scheduling of instruction-level parallelism on a Raw Machine. In The Proceedings of the Eighth International Confrence on Architectural Support for Programming Languages and Operating Systems, pages 46-57, 1998.
[18]
{18} S. A. Mahlke, W. Y. Chen, J. C. Gyllenhaal, W. W. Hwu, P. P. Chang, and T. Kiyohara. Compiler code transformations for superscalar-based high-performance systems. In Proceedings of Supercomputing '92, pages 808-817, November 1992.
[19]
{19} D. A. Penry,M. Vachharajani, and D. I. August. Rapid development of a flexible validated processor model. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation (MOBS), June 2005.
[20]
{20} R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 177-188, September 2004.
[21]
{21} K. Rich and M. Farrens. Code partitioning in decoupled compilers. In Proceedings of the 6th European Conference on Parallel Processing, pages 1008-1017, Munich, Germany, September 2000.
[22]
{22} J. W. Sias, S. zee Ueng, G. A. Kent, I. M. Steiner, E. M. Nystrom, and W. mei W. Hwu. Field-testing IMPACT EPIC research results in Itanium 2. In Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004.
[23]
{23} J. E. Smith. Decoupled access/execute computer architectures. In Proceedings of the 9th International Symposium on Computer Architecture, pages 112-119, April 1982.
[24]
{24} G. S. Sohi, S. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the 22th International Symposium on Computer Architecture, June 1995.
[25]
{25} J. G. Steffan and T. C. Mowry. The potential for using thread-level data speculation to facilitate automatic parallelization. In The 4th International Symposium on High-Performance Computer Architecture, pages 2-13, February 1998.
[26]
{26} M. Sung, R. Krashinsky, and K. Asanovic. Multithreading decoupled architectures for complexity-effective general purpose computing. SIGARCH Comput. Archit. News, 29(5):56-61, 2001.
[27]
{27} R. E. Tarjan. Depth first search and linear graph algorithms. SIAM Journal on Computing, 1(2):146-160, 1972.
[28]
{28} M. B. Taylor, W. Lee, S. P. Amarasinghe, and A. Agarwal. Scalar operand networks. IEEE Transactions on Parallel and Distributed Systems, 16(2):145-162, February 2005.
[29]
{29} J.-Y. Tsai, J. Huang, C. Amlo, D. J. Lilja, and P.-C. Yew. The superthreaded processor architecture. IEEE Transactions on Computers, 48(9):881-902, 1999.
[30]
{30} M. Vachharajani, N. Vachharajani, D. A. Penry, J. A. Blome, and D. I. August. Microarchitectural exploration with Liberty. In Proceedings of the 35th International Symposium on Microarchitecture (MICRO), pages 271-282, November 2002.
[31]
{31} W. A. Wulf. Evaluation of the wm architecture. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 382-390, Queensland, Australia, May 1992.
[32]
{32} C. Zilles and G. Sohi. Master/slave speculative parallelization. In Proceedings of the 35th International Symposium on Microarchitecture, November 2002.

Cited By

View all
  • (2023)Catamaran: Low-Overhead Memory Safety Enforcement via Parallel AccelerationProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598098(816-828)Online publication date: 12-Jul-2023
  • (2021)Paths to OpenMP in the kernelProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476183(1-17)Online publication date: 14-Nov-2021
  • (2021)Loop parallelization using dynamic commutativity analysisProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370319(150-161)Online publication date: 27-Feb-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
November 2005
350 pages
ISBN:0769524400

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 12 November 2005

Check for updates

Qualifiers

  • Article

Conference

Micro-38
Sponsor:

Acceptance Rates

MICRO 38 Paper Acceptance Rate 29 of 147 submissions, 20%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)3
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Catamaran: Low-Overhead Memory Safety Enforcement via Parallel AccelerationProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598098(816-828)Online publication date: 12-Jul-2023
  • (2021)Paths to OpenMP in the kernelProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476183(1-17)Online publication date: 14-Nov-2021
  • (2021)Loop parallelization using dynamic commutativity analysisProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370319(150-161)Online publication date: 27-Feb-2021
  • (2020)PerspectiveProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378458(351-367)Online publication date: 9-Mar-2020
  • (2019)Janus: statically-driven and profile-guided automatic dynamic binary parallelisationProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314877(15-25)Online publication date: 16-Feb-2019
  • (2019)Efficient and Scalable Execution of Fine-Grained Dynamic Linear PipelinesACM Transactions on Architecture and Code Optimization10.1145/330741116:2(1-26)Online publication date: 18-Apr-2019
  • (2018)SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order coresACM SIGPLAN Notices10.1145/3296979.319239353:4(328-343)Online publication date: 11-Jun-2018
  • (2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
  • (2018)Cimple: instruction and memory level parallelismProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243185(1-16)Online publication date: 1-Nov-2018
  • (2018)Cluster Programming using the OpenMP Accelerator ModelACM Transactions on Architecture and Code Optimization10.1145/322611215:3(1-23)Online publication date: 28-Aug-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media