[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

HELIX-RC: an architecture-compiler co-design for automatic parallelization of irregular programs

Published: 14 June 2014 Publication History

Abstract

Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks

References

[1]
Randy Allen and Ken Kennedy. Optimizing compilers for modern architectures. Morgan Kaufmann, 2002.
[2]
Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Rankin, P. S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: An integrated solution to high-speed parallel computing. In Supercomputing, 1988.
[3]
Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas Jablin, and David I. August. Revisiting the sequential programming model for multicore. In MICRO, 2007.
[4]
Doug Burger, James R. Goodman, and Alain Kägi. Memory bandwidth limitations of future microprocessors. In ISCA, 1996.
[5]
Simone Campanoni, Giovanni Agosta, Stefano Crespi Reghizzi, and Andrea Di Biagio. A Highly Flexible, Parallel Virtual Machine: Design and Experience of ILDJIT. In Software: Practice and Experience, 2010.
[6]
Simone Campanoni, Timothy M. Jones, Glenn Holloway, Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks. HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing. In CGO, 2012.
[7]
Simone Campanoni, Timothy M. Jones, Glenn Holloway, Gu-Yeon Wei, and David Brooks. HELIX: Making the Extraction of Thread-Level Parallelism Mainstream. In IEEE Micro, 2012.
[8]
Ramkrishna Chatterjee, Barbara G. Ryder, andWilliam A. Landi. Relevant Context Inference. In POPL, 1999.
[9]
Lynn Choi and Pen-Chung Yew. Compiler and hardware support for cache coherence in large-scale multiprocessors: Design considerations and performance study. In ISCA, 1996.
[10]
Ron Cytron. DOACROSS: Beyond vectorization for multiprocessors. In ICPP, 1986.
[11]
Alain Deutsch. A storeless model of aliasing and its abstractions using finite representations of right-regular equivalence relations. In ICCL, 1992.
[12]
Paul Gratz, Changkyu Kim, Karthikeyan Sankaralingam, Heather Hanson, Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. On-Chip Interconnection Networks of the TRIPS Chip. In IEEE Micro, 2007.
[13]
Bolei Guo, Matthew J. Bridges, Spyridon Triantafyllis, Guilherme Ottoni, Easwaran Raman, and David I. August. Practical and accurate low-level pointer analysis. In CGO, 2005.
[14]
Greg Hamerly, Erez Perelman, and Brad Calder. How to use simpoint to pick simulation points. In ACM SIGMETRICS Performance Evaluation Review, 2004.
[15]
Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu, Michael K. Chen, and Kunle Olukotun. The Stanford Hydra CMP. In IEEE Micro, 2000.
[16]
Jialu Huang, Arun Raman, Thomas B. Jablin, Yun Zhang, Tzu-Han Hung, and David I. August. Decoupled software pipelining creates parallelization opportunities. In CGO, 2010.
[17]
Natalie Enright Jerger and Li-Shiuan Peh. On-Chip Networks. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.
[18]
Troy A. Johnson, Rudolf Eigenmann, and T. N. Vijaykumar. Speculative thread decomposition through empirical optimization. In PPoPP, 2007.
[19]
Svilen Kanev, Gu-Yeon Wei, and David Brooks. XIOSim: powerperformance modeling of mobile x86 cores. In ISLPED, 2012.
[20]
Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau, and Josep Torrellas. POSH: A TLS compiler that exploits program structure. In PPoPP, 2006.
[21]
Gabriel H Loh, Samantika Subramaniam, and Yuejian Xie. Zesto: A cycle-level simulator for highly detailed microarchitecture exploration. In ISPASS, 2009.
[22]
Stephen F. Lundstrom and George H. Barnes. A controllable MIMD architecture. In Advanced computer architecture, 1986.
[23]
Milo M. K. Martin. Token coherence. PhD thesis, University of Wisconsin- Madison, 2003.
[24]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. CACTI 6.0: A tool to model large caches. Technical Report 85, HP Laboratories, 2009.
[25]
Alexandru Nicolau, Guangqiang Li, and Arun Kejariwal. Techniques for efficient placement of synchronization primitives. In PPoPP, 2009.
[26]
Alexandru Nicolau, Guangqiang Li, Alexander V. Veidenbaum, and Arun Kejariwal. Synchronization optimizations for efficient execution on multicores. In ICS, 2009.
[27]
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. Automatic thread extraction with decoupled software pipelining. In MICRO, 2005.
[28]
David K. Poulsen and Pen-Chung Yew. Data prefetching and data forwarding in shared memory multiprocessors. In ICPP, 1994.
[29]
Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. Speculative parallelization using software multi-threaded transactions. In ASPLOS, 2010.
[30]
Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I. August. Parallel-stage decoupled software pipelining. In CGO, 2008.
[31]
Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August. Performance scalability of decoupled software pipelining. In ACM TACO, 2008.
[32]
Behnam Robatmil, Dong Li, Hadi Esmaeilzadeh, Sibi Govindan, Aaron Smith, Andrew Putnam, Doug Burger, and Stephen W. Keckler. How to Implement Effective Prediction and Forwarding for Fusable Dynamic Multicore Architectures. In HPCA, 2013.
[33]
Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. In IEEE Computer Architecture Letters, 2011.
[34]
Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In ASPLOS, 2010.
[35]
Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger, Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP. In ACM TACO, 2004.
[36]
Steven L. Scott. Synchronization and Communication in the T3E Multiprocessor. In ASPLOS, 1996.
[37]
Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: a many-core x86 architecture for visual computing. In ACM Transactions on Graphics, 2008.
[38]
Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. Multiscalar processors. In ISCA, 1995.
[39]
J. Gregory Steffan, Christopher Colohan, Antonia Zhai, and Todd C. Mowry. The STAMPede approach to thread-level speculation. In ACM Transactions on Computer Systems, 2005.
[40]
J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. Improving value communication for thread-level speculation. In HPCA, 2002.
[41]
Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Ararwal. The RAW microprocessor: A computational fabric for software circuits and general-purpose programs. In IEEE Micro, 2002.
[42]
Michael Bedford Taylor, Walter Lee, Saman P. Amarasinghe, and Anant Agarwal. Scalar Operand Networks. In IEEE Transactions on Parallel Distributed Systems, 2005.
[43]
Georgios Tournavitis, Zheng Wang, Björn Franke, and Michael F. P. O'Boyle. Towards a holistic approach to auto-parallelization. In PLDI, 2009.
[44]
Rob F. van der Wijngaart, Timothy G. Mattson, and Werner Haas. Lightweight communications on Intel's single-chip cloud computer processor. In SIGOPS Operating Systems Review, 2011.
[45]
Hans Vandierendonck, Sean Rul, and Koen De Bosschere. The paralax infrastructure: Automatic parallelization with a helping hand. In PACT, 2010.
[46]
David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown, III, and Anant Agarwal. On-chip interconnection architecture of the tile processor. In IEEE Micro, 2007.
[47]
Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan, and Todd C. Mowry. Compiler optimization of scalar value communication between speculative threads. In ASPLOS, 2002.
[48]
Antonia Zhai, J. Gregory Steffan, Christopher B. Colohan, and Todd C. Mowry. Compiler and hardware support for reducing the synchronization of speculative threads. In ACM TACO, 2008.
[49]
Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman, and Scott Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA, 2008.

Cited By

View all
  • (2024)Compiling Loop-Based Nested Parallelism for Irregular WorkloadsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640405(232-250)Online publication date: 27-Apr-2024
  • (2023)Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous ProcessingACM Transactions on Computer Systems10.1145/363000741:1-4(1-30)Online publication date: 18-Dec-2023
  • (2023)Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware AccelerationACM Transactions on Embedded Computing Systems10.1145/358039422:3(1-23)Online publication date: 20-Apr-2023
  • Show More Cited By
  1. HELIX-RC: an architecture-compiler co-design for automatic parallelization of irregular programs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 42, Issue 3
    ISCA '14
    June 2014
    552 pages
    ISSN:0163-5964
    DOI:10.1145/2678373
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture
      June 2014
      566 pages
      ISBN:9781479943944

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 June 2014
    Published in SIGARCH Volume 42, Issue 3

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Compiling Loop-Based Nested Parallelism for Irregular WorkloadsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640405(232-250)Online publication date: 27-Apr-2024
    • (2023)Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous ProcessingACM Transactions on Computer Systems10.1145/363000741:1-4(1-30)Online publication date: 18-Dec-2023
    • (2023)Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware AccelerationACM Transactions on Embedded Computing Systems10.1145/358039422:3(1-23)Online publication date: 20-Apr-2023
    • (2023)Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071026(1262-1274)Online publication date: Feb-2023
    • (2022)CARAT CAKE: replacing paging via compiler/kernel cooperationProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507771(98-114)Online publication date: 28-Feb-2022
    • (2022)A New Combination Method for Improving Parallelism in Two and Three Level Perfect Nested LoopsIEEE Access10.1109/ACCESS.2022.319048310(74542-74554)Online publication date: 2022
    • (2022)TLP: Towards three‐level loop parallelisationIET Computers & Digital Techniques10.1049/cdt2.1204616:5-6(159-171)Online publication date: 9-Aug-2022
    • (2022)Exploit the data level parallelism and schedule dependent tasks on the multi-core processorsInformation Sciences: an International Journal10.1016/j.ins.2021.10.072585:C(382-394)Online publication date: 1-Mar-2022
    • (2021)Quantifying the Semantic Gap Between Serial and Parallel Programming2021 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC53511.2021.00024(151-162)Online publication date: Nov-2021
    • (2020)CARAT: a case for virtual memory through compiler- and runtime-based address translationProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3385412.3385987(329-345)Online publication date: 11-Jun-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media