[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Value State Flow Graph: A Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware

Published: 04 December 2015 Publication History

Abstract

Although custom (and reconfigurable) computing can provide orders-of-magnitude improvements in energy efficiency and performance for many numeric, data-parallel applications, performance on nonnumeric, sequential code is often worse than conventional superscalar processors. This work attempts to improve sequential performance in custom hardware by (a) switching from a statically scheduled to a dynamically scheduled (dataflow) execution model and (b) developing a new compiler IR for high-level synthesis—the value state flow graph (VSFG)—that enables aggressive exposition of ILP even in the presence of complex control flow. Compared to existing control-data flow graph (CDFG)-based IRs, the VSFG exposes more instruction-level parallelism from control-intensive sequential code by exploiting aggressive speculation, enabling control dependence analysis, as well as execution along multiple flows of control. This new IR is directly implemented as a static-dataflow graph in hardware by our prototype high-level synthesis tool chain and shows an average speedup of 1.13× over equivalent hardware generated using LegUp, an existing CDFG-based HLS tool. Furthermore, the VSFG allows us to further trade area and energy for performance through loop unrolling, increasing the average speedup to 1.55×, with a peak speedup of 4.05×. Our VSFG-based hardware approaches the sequential cycle counts of an Intel Nehalem Core i7 processor while consuming only 0.25× the energy of an in-order Altera Nios IIf processor.

References

[1]
Volker Baumgarte, Gerd Ehlers, Frank May, Armin Nückel, Martin Vorbach, and Markus Weinhardt. 2003. PACT XPP—a self-reconfigurable data processing architecture. Journal of Supercomputing 26, 2, 167--184.
[2]
Geoffrey Blake, Ronald G. Dreslinski, Trevor Mudge, and Krisztián Flautner. 2010. Evolution of thread-level parallelism in desktop applications. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 302--313.
[3]
Mihai Budiu. 2003. Spatial Computation. Ph.D. Dissertation. Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. http://www.cs.cmu.edu/∼mihaib/research/thesis.pdf Technical report CMU-CS-03-217
[4]
Mihai Budiu, Pedro V. Artigas, and Seth C. Goldstein. 2005. Dataflow: A complement to superscalar. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’05). 177--186.
[5]
Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen Goldstein. 2004. Spatial computation. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XI). 14--26.
[6]
Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, William Yoder, and the TRIPS Team. 2004. Scaling to the end of silicon with edge architectures. Computer 37, 7, 44--55.
[7]
Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’11). 33--36.
[8]
Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’11).
[9]
Philippe Coussy and Adam Morawiec. 2008. High-Level Synthesis: From Algorithm to Digital Circuit. Springer.
[10]
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 365--376.
[11]
Hagen Gädke and Andreas Koch. 2008. Accelerating speculative execution in high-level synthesis with cancel tokens. In Proceedings of the 4th International Workshop on Reconfigurable Computing: Architectures, Tools, and Applications (ARC’08). Springer-Verlag, Berlin, 185--195.
[12]
Guang R. Gao. 1989. Algorithmic aspects of balancing techniques for pipelined data flow code generation. Journal of Parallel and Distributed Computing 6, 1, 39--61.
[13]
Guang R. Gao. 1991. Algorithmic aspects of pipeline balancing. In A Code Mapping Scheme for Dataflow Software Pipelining. The Kluwer International Series in Engineering and Computer Science, Vol. 125. Springer, 41--59.
[14]
Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th International Conference on High Performance Computer Architecture (HPCA’11).
[15]
Ed Grochowski and Murali Annavaram. 2006. Energy Per Instruction Trends in Intel® Microprocessors. Retrieved November 10, 2015, from http://www.intel.com/pressroom/kits/core2duo/pdf/epi-trends-final2.pdf.
[16]
Sumit Gupta, Nikil Dutt, Rajesh Gupta, and Alexandru Nicolau. 2004a. Loop shifting and compaction for the high-level synthesis of designs with complex control flow. In Proceedings of the Conference on Design, Automation, and Test in Europe—Volume 1 (DATE’04). 114--119.
[17]
Sumit Gupta, Rajesh Kumar Gupta, Nikil D. Dutt, and Alexandru Nicolau. 2004b. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Transactions on Design Automation of Electronic Systems 9, 4, 441--470.
[18]
Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA’10). 37--47.
[19]
Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, Hiroaki Takada, and Katsuya Ishii. 2008. CHStone: A benchmark program suite for practical C-based high-level synthesis. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’08). 1192--1195.
[20]
Chao Huang, Srivaths Ravi, Anand Raghunathan, and Niraj K. Jha. 2007. Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 15, 11, 1191--1204.
[21]
Xiangyang Liang, Minh Nguyen, and Hao Che. 2013. Wimpy or brawny cores: A throughput perspective. Journal of Parallel and Distributed Computing 73, 10, 1351--1361.
[22]
Wen-Mei Hwu, Shane Ryoo, Sain-Zee Ueng, John H. Kelm, Isaac Gelado, Sam S. Stone, Robert E. Kidd, Sara S. Baghsorkhi, Aqeel A. Mahesri, Stephanie C. Tsao, Nacho Navarro, Steve S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2007. Implicitly parallel programming models for thousand-core microprocessors. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). 754--759.
[23]
Neil Johnson and Alan Mycroft. 2003. Combined code motion and register allocation using the value state dependence graph. In Proceedings of the 12th International Conference on Compiler Construction (CC’03). 1--16.
[24]
Changkyu Kim, Simha Sethumadhavan, Madhu S. Govindan, Nitya Ranganathan, Divya Gulati, Doug Burger, and Stephen W. Keckler. 2007. Composable lightweight processors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). 381--394.
[25]
Srikanth Kurra, Neeraj Kumar Singh, and Preeti Ranjan Panda. 2007. The impact of loop unrolling on controller delay in high level synthesis. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’07). 391--396.
[26]
Monica S. Lam and Robert P. Wilson. 1992. Limits of control flow on parallelism. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA’92). 46--57.
[27]
Alan C. Lawrence. 2007. Optimizing Compilation with the Value State Dependence Graph. Technical Report UCAM-CL-TR-705. Computer Laboratory, University of Cambridge. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-705.pdf.
[28]
Edward A. Lee. 2006. The problem with threads. Computer 39, 5, 33--42.
[29]
Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO 25). 45--54.
[30]
Jonathan Mak and Alan Mycroft. 2009. Limits of parallelism using dynamic dependence graphs. In Proceedings of the 7th International Workshop on Dynamic Analysis (WODA’09). 42--48.
[31]
Daniel S. McFarlin, Charles Tucker, and Craig Zilles. 2013. Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism? In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 241--252.
[32]
Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein, and Mihai. Budiu. 2006. Tartan: Evaluating spatial computation for whole program execution. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 163--174.
[33]
Rishiyur Nikhil. 2004. Bluespec system verilog: Efficient, correct RTL from high level specifications. In Proceedings of the 2nd ACM and IEEE International Conference on Formal Methods and Models for Co-Design (MEMOCODE’04). 69--70.
[34]
Andrew R. Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and Prasanna Sundararajan. 2008. CHiMPS: A high-level compilation flow for hybrid CPU-FPGA architectures. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays (FPGA’08). 261.
[35]
Jack Sampson, Ganesh Venkatesh, Nathan Goulding-Hotta, Saturnino Garcia, Steven Swanson, and Michael Bedford Taylor. 2011. Efficient complex operators for irregular codes. In Proceedings of the Conference on High Performance Computing Architecture (HPCA’11).
[36]
Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger, Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. 2004. TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP. ACM Transactions on Architecture and Code Optimization 1, 1, 62--93.
[37]
Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar processors. ACM SIGARCH Computer Architecture News 23, 2, 414--425.
[38]
Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. 2007. The wavescalar architecture. ACM Transactions on Computer Systems 25, 2, Article No. 4.
[39]
Girish Venkataramani, Tobias Bjerregaard, Tiberiu Chelcea, and Seth C. Goldstein. 2006. Hardware compilation of application-specific memory-access interconnect. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems 25, 5, 756--771.
[40]
Girish Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael B. Taylor. 2010. Conservation cores: Reducing the energy of mature computations. ACM SIGARCH Computer Architecture News 38, 1, 205--218.
[41]
Paraskevas Yiapanis, Demian Rosas-Ham, Gavin Brown, and Mikel Luján. 2013. Optimizing software runtime systems for speculative parallelization. ACM Transactions on Architecture and Code Optimization 9, 4, Article No. 39.
[42]
Ali Mustafa Zaidi. 2015. Accelerating Control-Flow Intensive Code in Spatial Hardware. Technical Report UCAM-CL-TR-870. Computer Laboratory, University of Cambridge. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-870.pdf.
[43]
Ali Mustafa Zaidi and David J. Greaves. 2014. A new dataflow compiler IR for accelerating control-intensive code in spatial hardware. In Proceedings of the 2014 IEEE International Parallel and Distributed Symposium Workshops (IPDPSW’14). 122--131.

Cited By

View all
  • (2020)RVSDGACM Transactions on Embedded Computing Systems10.1145/339190219:6(1-28)Online publication date: Dec-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 9, Issue 2
Special Section on RAW2014
February 2016
146 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/2854101
  • Editor:
  • Steve Wilton
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 December 2015
Accepted: 01 July 2015
Revised: 01 May 2015
Received: 01 November 2014
Published in TRETS Volume 9, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Dark silicon
  2. amdahl’s law
  3. compilers
  4. custom computing
  5. high-level synthesis
  6. instruction level parallelism
  7. reconfigurable computing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • C3D:Communication Centric Computer Design
  • UK EPSRC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2020)RVSDGACM Transactions on Embedded Computing Systems10.1145/339190219:6(1-28)Online publication date: Dec-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media