More Web Proxy on the site http://driver.im/

research-article

Value State Flow Graph: A Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware

Authors:

Ali Mustafa Zaidi,

David GreavesAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 9, Issue 2

Article No.: 14, Pages 1 - 22

https://doi.org/10.1145/2807702

Published: 04 December 2015 Publication History

Abstract

Although custom (and reconfigurable) computing can provide orders-of-magnitude improvements in energy efficiency and performance for many numeric, data-parallel applications, performance on nonnumeric, sequential code is often worse than conventional superscalar processors. This work attempts to improve sequential performance in custom hardware by (a) switching from a statically scheduled to a dynamically scheduled (dataflow) execution model and (b) developing a new compiler IR for high-level synthesis—the value state flow graph (VSFG)—that enables aggressive exposition of ILP even in the presence of complex control flow. Compared to existing control-data flow graph (CDFG)-based IRs, the VSFG exposes more instruction-level parallelism from control-intensive sequential code by exploiting aggressive speculation, enabling control dependence analysis, as well as execution along multiple flows of control. This new IR is directly implemented as a static-dataflow graph in hardware by our prototype high-level synthesis tool chain and shows an average speedup of 1.13× over equivalent hardware generated using LegUp, an existing CDFG-based HLS tool. Furthermore, the VSFG allows us to further trade area and energy for performance through loop unrolling, increasing the average speedup to 1.55×, with a peak speedup of 4.05×. Our VSFG-based hardware approaches the sequential cycle counts of an Intel Nehalem Core i7 processor while consuming only 0.25× the energy of an in-order Altera Nios IIf processor.

References

[1]

Volker Baumgarte, Gerd Ehlers, Frank May, Armin Nückel, Martin Vorbach, and Markus Weinhardt. 2003. PACT XPP—a self-reconfigurable data processing architecture. Journal of Supercomputing 26, 2, 167--184.

Digital Library

[2]

Geoffrey Blake, Ronald G. Dreslinski, Trevor Mudge, and Krisztián Flautner. 2010. Evolution of thread-level parallelism in desktop applications. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 302--313.

Digital Library

[3]

Mihai Budiu. 2003. Spatial Computation. Ph.D. Dissertation. Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. http://www.cs.cmu.edu/&sim;mihaib/research/thesis.pdf Technical report CMU-CS-03-217

Digital Library

[4]

Mihai Budiu, Pedro V. Artigas, and Seth C. Goldstein. 2005. Dataflow: A complement to superscalar. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’05). 177--186.

Digital Library

[5]

Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen Goldstein. 2004. Spatial computation. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XI). 14--26.

Digital Library

[6]

Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, William Yoder, and the TRIPS Team. 2004. Scaling to the end of silicon with edge architectures. Computer 37, 7, 44--55.

Digital Library

[7]

Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’11). 33--36.

Digital Library

[8]

Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’11).

Digital Library

[9]

Philippe Coussy and Adam Morawiec. 2008. High-Level Synthesis: From Algorithm to Digital Circuit. Springer.

Digital Library

[10]

Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 365--376.

Digital Library

[11]

Hagen Gädke and Andreas Koch. 2008. Accelerating speculative execution in high-level synthesis with cancel tokens. In Proceedings of the 4th International Workshop on Reconfigurable Computing: Architectures, Tools, and Applications (ARC’08). Springer-Verlag, Berlin, 185--195.

Digital Library

[12]

Guang R. Gao. 1989. Algorithmic aspects of balancing techniques for pipelined data flow code generation. Journal of Parallel and Distributed Computing 6, 1, 39--61.

Digital Library

[13]

Guang R. Gao. 1991. Algorithmic aspects of pipeline balancing. In A Code Mapping Scheme for Dataflow Software Pipelining. The Kluwer International Series in Engineering and Computer Science, Vol. 125. Springer, 41--59.

[14]

Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th International Conference on High Performance Computer Architecture (HPCA’11).

Digital Library

[15]

Ed Grochowski and Murali Annavaram. 2006. Energy Per Instruction Trends in Intel® Microprocessors. Retrieved November 10, 2015, from http://www.intel.com/pressroom/kits/core2duo/pdf/epi-trends-final2.pdf.

[16]

Sumit Gupta, Nikil Dutt, Rajesh Gupta, and Alexandru Nicolau. 2004a. Loop shifting and compaction for the high-level synthesis of designs with complex control flow. In Proceedings of the Conference on Design, Automation, and Test in Europe—Volume 1 (DATE’04). 114--119.

Digital Library

[17]

Sumit Gupta, Rajesh Kumar Gupta, Nikil D. Dutt, and Alexandru Nicolau. 2004b. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Transactions on Design Automation of Electronic Systems 9, 4, 441--470.

Digital Library

[18]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA’10). 37--47.

Digital Library

[19]

Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, Hiroaki Takada, and Katsuya Ishii. 2008. CHStone: A benchmark program suite for practical C-based high-level synthesis. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’08). 1192--1195.

[20]

Chao Huang, Srivaths Ravi, Anand Raghunathan, and Niraj K. Jha. 2007. Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 15, 11, 1191--1204.

Digital Library

[21]

Xiangyang Liang, Minh Nguyen, and Hao Che. 2013. Wimpy or brawny cores: A throughput perspective. Journal of Parallel and Distributed Computing 73, 10, 1351--1361.

Digital Library

[22]

Wen-Mei Hwu, Shane Ryoo, Sain-Zee Ueng, John H. Kelm, Isaac Gelado, Sam S. Stone, Robert E. Kidd, Sara S. Baghsorkhi, Aqeel A. Mahesri, Stephanie C. Tsao, Nacho Navarro, Steve S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2007. Implicitly parallel programming models for thousand-core microprocessors. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). 754--759.

Digital Library

[23]

Neil Johnson and Alan Mycroft. 2003. Combined code motion and register allocation using the value state dependence graph. In Proceedings of the 12th International Conference on Compiler Construction (CC’03). 1--16.

Digital Library

[24]

Changkyu Kim, Simha Sethumadhavan, Madhu S. Govindan, Nitya Ranganathan, Divya Gulati, Doug Burger, and Stephen W. Keckler. 2007. Composable lightweight processors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). 381--394.

Digital Library

[25]

Srikanth Kurra, Neeraj Kumar Singh, and Preeti Ranjan Panda. 2007. The impact of loop unrolling on controller delay in high level synthesis. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’07). 391--396.

Digital Library

[26]

Monica S. Lam and Robert P. Wilson. 1992. Limits of control flow on parallelism. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA’92). 46--57.

Digital Library

[27]

Alan C. Lawrence. 2007. Optimizing Compilation with the Value State Dependence Graph. Technical Report UCAM-CL-TR-705. Computer Laboratory, University of Cambridge. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-705.pdf.

[28]

Edward A. Lee. 2006. The problem with threads. Computer 39, 5, 33--42.

Digital Library

[29]

Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO 25). 45--54.

Digital Library

[30]

Jonathan Mak and Alan Mycroft. 2009. Limits of parallelism using dynamic dependence graphs. In Proceedings of the 7th International Workshop on Dynamic Analysis (WODA’09). 42--48.

Digital Library

[31]

Daniel S. McFarlin, Charles Tucker, and Craig Zilles. 2013. Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism? In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 241--252.

Digital Library

[32]

Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein, and Mihai. Budiu. 2006. Tartan: Evaluating spatial computation for whole program execution. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 163--174.

Digital Library

[33]

Rishiyur Nikhil. 2004. Bluespec system verilog: Efficient, correct RTL from high level specifications. In Proceedings of the 2nd ACM and IEEE International Conference on Formal Methods and Models for Co-Design (MEMOCODE’04). 69--70.

Digital Library

[34]

Andrew R. Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and Prasanna Sundararajan. 2008. CHiMPS: A high-level compilation flow for hybrid CPU-FPGA architectures. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays (FPGA’08). 261.

Digital Library

[35]

Jack Sampson, Ganesh Venkatesh, Nathan Goulding-Hotta, Saturnino Garcia, Steven Swanson, and Michael Bedford Taylor. 2011. Efficient complex operators for irregular codes. In Proceedings of the Conference on High Performance Computing Architecture (HPCA’11).

Digital Library

[36]

Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger, Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. 2004. TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP. ACM Transactions on Architecture and Code Optimization 1, 1, 62--93.

Digital Library

[37]

Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar processors. ACM SIGARCH Computer Architecture News 23, 2, 414--425.

Digital Library

[38]

Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. 2007. The wavescalar architecture. ACM Transactions on Computer Systems 25, 2, Article No. 4.

Digital Library

[39]

Girish Venkataramani, Tobias Bjerregaard, Tiberiu Chelcea, and Seth C. Goldstein. 2006. Hardware compilation of application-specific memory-access interconnect. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems 25, 5, 756--771.

Digital Library

[40]

Girish Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael B. Taylor. 2010. Conservation cores: Reducing the energy of mature computations. ACM SIGARCH Computer Architecture News 38, 1, 205--218.

Digital Library

[41]

Paraskevas Yiapanis, Demian Rosas-Ham, Gavin Brown, and Mikel Luján. 2013. Optimizing software runtime systems for speculative parallelization. ACM Transactions on Architecture and Code Optimization 9, 4, Article No. 39.

Digital Library

[42]

Ali Mustafa Zaidi. 2015. Accelerating Control-Flow Intensive Code in Spatial Hardware. Technical Report UCAM-CL-TR-870. Computer Laboratory, University of Cambridge. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-870.pdf.

[43]

Ali Mustafa Zaidi and David J. Greaves. 2014. A new dataflow compiler IR for accelerating control-intensive code in spatial hardware. In Proceedings of the 2014 IEEE International Parallel and Distributed Symposium Workshops (IPDPSW’14). 122--131.

Digital Library

Cited By

Reissmann NMeyer JBahmann HSjälander M(2020)RVSDGACM Transactions on Embedded Computing Systems10.1145/339190219:6(1-28)Online publication date: Dec-2020
https://doi.org/10.1145/3391902

Index Terms

Value State Flow Graph: A Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware

Recommendations

Compiler-Assisted Multiple Instruction Word Retry for VLIW Architectures

Very Long Instruction Word (VLIW) architectures can enhance performance by exploiting fine-grained instruction level parallelism. In this paper, we describe a compiler assisted multiple instruction word retry scheme for VLIW architectures. A read buffer ...
Tuning the continual flow pipeline architecture with virtual register renaming

Continual Flow Pipelines (CFPs) allow a processor core to process hundreds of in-flight instructions without increasing cycle-critical pipeline resources. When a load misses the data cache, CFP checkpoints the processor register state and then moves all ...
Exposing ILP in custom hardware with a dataflow compiler IR
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 9, Issue 2

Special Section on RAW2014

February 2016

146 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/2854101

Editor:
Steve Wilton
Department of Electrical and Computer Engineering/University of British Columbia/Kaiser

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 December 2015

Accepted: 01 July 2015

Revised: 01 May 2015

Received: 01 November 2014

Published in TRETS Volume 9, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

C3D:Communication Centric Computer Design
UK EPSRC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
306
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Reissmann NMeyer JBahmann HSjälander M(2020)RVSDGACM Transactions on Embedded Computing Systems10.1145/339190219:6(1-28)Online publication date: Dec-2020
https://doi.org/10.1145/3391902

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents