[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article
Open access

Inter-cluster communication in VLIW architectures

Published: 01 June 2007 Publication History

Abstract

The traditional VLIW (very long instruction word) architecture with a single register file does not scale up well to address growing performance demands on embedded media processors. However, splitting a VLIW processor in smaller clusters, which are comprised of function units fully connected to local register files, can significantly improve VLSI implementation characteristics of the processor, such as speed, energy consumption, and area. In our paper we reveal that achieving the best characteristics of a clustered VLIW requires a thorough selection of an Inter-cluster Communication (ICC) model, which is the way clustering is exposed in the Instruction Set Architecture. For our study we, first, define a taxonomy of ICC models including copy operations, dedicated issue slots, extended operands, extended results, and multicast. Evaluation of the execution time of the models requires both the dynamic cycle count and clock period. We developed an advanced instruction scheduler for all the five ICC models in order to quantify the dynamic cycle counts of our multimedia C benchmarks. To assess the clock period of the ICC models we designed and laid out VLIW datapaths using the RTL hardware descriptions derived from a deeply pipelined commercial TriMedia processor. In contrast to prior art, our research shows that fully distributed register file architectures (with eight clusters in our study) often underperform compared to moderately clustered machines with two or four clusters because of explosion of the cycle count overhead in the former. Among the evaluated ICC models, performance of the copy operation model, popular both in academia and industry, is severely limited by the copy operations hampering scheduling of regular operations in high ILP (instruction-level parallelism) code. The dedicated issue slots model combats this limitation by dedicating extra VLIW issue slots purely for ICC, reaching the highest 1.74 execution time speedup relative to the unicluster. Furthermore, our VLSI experiments show that the lowest area and energy consumption of 42 and 57% relative to the unicluster, respectively, are achieved by the extended operands model, which, nevertheless, provides higher performance than the copy operation model.

References

[1]
Agarwal, V., Hrishikesh, M. S., Keckler, S. W., and Burger, D. 2000. Clock Rate versus IPC: The end of the road for conventional microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture, Vancouver, Canada, IEEE Computer Society Press, Los Alamitos, CA. 248--259.
[2]
Aletà, A., Codina, J. M., Sanchez, F. J., González, F. J., and Kaeli, D. R. 2002. Exploiting pseudo-schedules to guide data dependence graph partitioning. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Charlottesville, VA. IEEE Computer Society Press, Los Alamitos, CA. 281--290.
[3]
Aletà, A., Codina, J. M., González, F. J., and Kaeli, D. R. 2003. Instruction replication for clustered microarchitectures. In Proceedings of the 36th Annual International Symposium on Microarchitecture, San Diego, CA. IEEE Computer Society Press /ACM Press, New York. 326--335.
[4]
Bekooij, M. 2004. Constraint Driven Operation Assignment for Retargetable VLIW Compilers. PhD thesis, ISBN 90-74445-60-8, Technical University of Eindhoven, Eindhoven, The Netherlands.
[5]
Burd, T. D. and Brodersen, R. W. 2002. Energy Efficient Microprocessor Design. Kluwer Academic Publishers, Novell, MA.
[6]
Chandrakasan, A. P. and Brodersen, R. W. 1995. Low power digital CMOS design. Kluwer Academic Publishers, Novell, Massachusetts.
[7]
Chu, M., Fan, K., and Mahlke, S. 2003. Region-based hierarchical operation partitioning for multicluster processors. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, San Diego, California. ACM Press, New york. 300--311.
[8]
Codina, J. M., Sánchez, J., and González, A. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings of the 10th International Conference on Parallel Architecture and Compilation Techniques, Barcelona, Spain. IEEE Computer Society Press, Los Alamitos, CA. 175--184.
[9]
Colavin, O. and Rizzo, D. 2003. A scalable wide-issue clustered VLIW with a reconfigurable interconnect. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, San Jose, CA. ACM Press, New York. 148--158.
[10]
Ellis, J. R. 1985. Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, MA.
[11]
Faraboschi, P., Brown, G., Fisher, J. A., Desoli, G., Homewood, F. 2000. Lx: A technology platform for customizable VLIW embedded processing. In Proceedings of the 27th Annual International Symposium on Computer Architecture, Vancouver, Canada. IEEE Computer Society Press, Los Alamitos, CA. 203--213.
[12]
Fisher, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, 478--490.
[13]
Fisher, J. A., Faraboschi, P., and Young, C. 2004. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann. San Francisco, CA.
[14]
Gangwar, A., Balakrishnan, M., and Kumar, A. 2003. Impact of Inter-cluster Communication Mechanisms on ILP in Clustered VLIW Architectures, In Proceedings of the 2nd Workshop on Application Specific Processors, San Diego, CA.
[15]
Gangwar, A., Balakrishnan, M., Ranjan Panda, P., and Kumar, A. 2005. Evaluation of bus based interconnect mechanisms in clustered VLIW architectures. In Proceedings of the Design Automation and Test in Europe, Munich, Germany, IEEE Computer Society Press/ACM Press, New York. 730--735.
[16]
Gibert, E., Sánchez, J., and González, A. 2002. Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor. In Proceedings of the 35th International Symposium on Microarchitecture, Istanbul, Turkey, IEEE Computer Society Press/ACM Press, New York. 123--133.
[17]
Gibert, E., Sánchez, J., and González, A. 2003. Flexible compiler-managed L0 buffers for clustered VLIW processors. In Proceedings of the 36th Annual International Symposium on Microarchitecture, San Diego, CA, USA, IEEE Computer Society Press/ACM Press, New York. 315--325.
[18]
Gibert, E., Sánchez, J., and González, A. 2005. Distributed data cache designs for clustered VLIW processors. IEEE Transactions on Computers, 54, 10, 1227--1241.
[19]
Halfhill, T. R. 2004. Best media processor: TriMedia TM5250. Microprocessor Report, 2/9/04, http://www.mpronline.com.
[20]
Havanki, W. A., Banerjia, S., and Conte, T. M. 1998. Treegion scheduling for wide-issue processors. In Proceedings of 4th International Symposium on High Performance Computer Architecture, Las Vegas, NV. IEEE Computer Society Press. 266--276.
[21]
Hekstra, G. J., La Hei, G. D., Bingley, P., and Sijstermans, F. W. 1999. TriMedia CPU64 design space exploration. In Proceedings of the International Conference on Computer Design, Austin, USA, IEEE Computer Society Press, Los Alamitos, CA. 599--606.
[22]
Ho, R., Mai, K., and Horowitz, M. 2001. The future of wires. Proceedings of the IEEE, 89, 4, 490--504.
[23]
Hoogerbrugge, J. and Augusteijn, L. 1999. Instruction scheduling for TriMedia. The Journal of Instruction-Level Parallelism, 1, http://www.jilp.org/.
[24]
Hsu, P. Y. T. and Davidson, E. S. 1986. Highly concurrent scalar processing. In Proceedings of 13th Annual International Symposium on Computer Architecture, Tokyo, Japan. 386--395.
[25]
ITRS Technology Working Groups. 2005. International Technology Roadmap for Semiconductors (ITRS). The ITRS Technology Working Groups. http://www.itrs.net/.
[26]
Janssen, J. 2001. Compiler Strategies for Transport Triggered Architecture. PhD thesis, Technical University of Deflt, The Netherlands.
[27]
Kailas, K., Ebcioglu, K., and Agrawala, A. K. 2001. CARS: A new code generation framework for clustered ILP processors. In Proceedings of the 7th International Symposium on High Performance Computer Architecture, Nuevo Leone, Mexico. IEEE Computer Society Press, Los Alamitos, CA. 133--134.
[28]
Kailas, K., Franklin, M., and Ebcioglu, K. 2002. A register file architecture and compilation scheme for clustered ILP processors. In Proceedings of the 8th International Conference Euro-Par, Paderborn, Germany, Lecture Notes in Computer Science, Springer. 500--511.
[29]
Kapasi, U. J., Dally, W. J., Rixner, S., Owens, J. D., and Khailany, B. 2002. The imagine stream processor. In Proceedings of the 20th International Conference on Computer Design, Freiburg, Germany. IEEE Computer Society, Los Alamitos, CA. 282--288.
[30]
Lam, M. S. and Wilson, R. P. 1992. Limits of control flow on parallelism. In Proceedings of the 19th International Symposium on Computer Architecture, Queensland, Australia. ACM Press, New York. 46--57.
[31]
Lapinskii, V. S., Jacome, M. F., and De Veciana, G. A. 2002. Application-specific clustered VLIW datapaths: Early exploration on a parameterized design space. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 21, 8, 889--903.
[32]
Liao, H. and Wolfe, A. 1997. Available parallelism in video applications. In Proceedings of the 30th International Symposium on Microarchitecture, Research Triangle Park, North Carolina. ACM Press/IEEE Computer Society Press, New York. 321--329.
[33]
Lee, W., Barua, R., Frank, M., Srikrishma, D. 1998. Space-time scheduling of instruction-level parallelism on a RAW machine. In Proceedings of 8th International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA. ACM Press, New York. 46--57.
[34]
Lee, H. H. S., Wu, Y. and Tyson, G. S. 2000. Quantifying instruction-level parallelism limits on an EPIC architecture. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, Austin, TX. IEEE Computer Society Press, Los Alamitos, CA. 21--27.
[35]
Levy, M. 2001. ManArray devours DSP code. Microprocessor report, 10/8/01-01, http://www.mpronline.com/.
[36]
Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E. and Bringmann, R. A. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Oregon. IEEE Computer Society/ACM Press, New York. 45--54.
[37]
Nagpal, R., and Srikant, Y. N. 2004. Integrated temporal and spatial scheduling for extended operand clustered VLIW processors. In Proceedings of the International Conference on Computing Frontiers, Ischia, Italy. ACM Press, New York. 457--470.
[38]
Özer, E., Banerjia, S., and Conte, T. 1998. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In Proceedings of the 31st IEEE/ACM Annual International Symposium on Microarchitecture, Dallas, Texas. IEEE Computer Society Press, Los Alamitos, CA. 308--315.
[39]
Palacharla, S., Jouppi, N. P., and Smith, J. E. 1997. Complexity-effective superscalar processors. In Proceedings of the 24th International Symposium on Computer Architecture, Denver, CO. 206--218.
[40]
Parcerisa, J.-M. L, Sahuquillo, J., González, and A., Duato, J. 2005. On-chip interconnects and instruction steering schemes for clustered microarchitectures. IEEE Transactions on Parallel and Distributed Systems, 16, 2, 130--144.
[41]
Rixner, S., Dally, W. J., Khailany, B., Mattson, P., Kapasi, U. J., Owens, J. D. 1999. Register organization for media processing. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture, Toulouse, France. IEEE Computer Society, Los Alamitos, CA. 375--386.
[42]
Roos, S., Corporaal, H., and Lamberts, R. 2002. Clustering on the Move. In Proceedings of the 4th International Conference on Massively Parallel Computing Systems, Ischia, Italy, IEEE Computer Society Press, Los Alamitos, CA.
[43]
Smith, J. E. 2006. Benchmarking: Science? Art? Neither? In 2006 SPEC Benchmark Workshop, Austin, Texas. http://www.spec.org/workshops/2006/.
[44]
Sudharsanan, S., Sriram, P., Frederickson, and H., Gulati, A. 2000. Image and video processing using Majc 5200. In Proceedings of the International Conference on Image Processing, Vancouver Canada, IEEE Computer Society Press, Los Alamitos, CA. 122--125.
[45]
Terechko, A. S., Le Thénaff, E., van Eijndhoven, J. T. J., and Corporaal, H. 2003a. Inter-cluster communication models for clustered VLIW processors. In Proceedings of the 9th Symposium High Performance Computer Architectures, Anaheim, CA. IEEE Computer Society Press, Los Alamitos, CA. 354--364.
[46]
Terechko, A. S., Le Thénaff, E., and Corporaal, H. 2003b. Cluster assignment of global values for clustered VLIW processors. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, San Jose, CA. ACM Press, New York. 32--40.
[47]
Terechko, A. S., Garg, M., and Corporaal, H. 2005. Evaluation of speed and area of clustered VLIW processors. In Proceedings of 18th International Conference on VLSI Design in conjunction with the 4th International Conference on Embedded Systems Design, Kolkata, India, IEEE Computer Society Press, Los Alamitos, CA. 557--563.
[48]
van Eijndhoven, J. T. J., Vissers, K. A., Pol, E. J. D., Struik, P., Bloks, R. H. J., van der Wolf, P., Vranken, H. P. E., Sijstermans, F. W., Tromp, M. J. A., and Pimentel, A. D. 1999. TriMedia CPU64 architecture. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, Austin, Texas. IEEE Computer Society Press, Los Alamitos, CA. 586--592.
[49]
van De Waerdt, J.-W., Vassiliadis, S., Das, S., Mirolo, S., Yen, C., Zhong, B., Basto, C., van Itegem, J.-P., Amirtharaj, D., Kalra, K., Rodriguez, P., and van Antwerpen, H. 2005. The TM3270 media-processor. In Proceedings of 38th Annual IEEE/ACM International Symposium on Microarchitecture, Barcelona, Spain, IEEE Computer Society Press/ACM Press, New York. 331--342.
[50]
Veredas, F. J., Scheppler, M., Moffat, W., and Mei, B. 2005. Custom implementation of the coarse-grained reconfigurable ADRES architecture for multimedia purposes. In Proceedings of the International Conference on Field Programmable Logic and Applications, Tampere, Finland. IEEE Computer Society Press, Los Alamitos, CA. 106--111.
[51]
Zalamea, J., Llosa, J., Ayguade, E., and Valero, M. 2003. Hierarchical clustered register file organization for VLIW processors. In Proceedings of 17th International Parallel and Distributed Processing Symposium, Nice. IEEE Computer Society Press, Los Alamitos, CA. 77a.

Cited By

View all
  • (2018)Enabling energy-proportional computing on instruction-level parallel processorsThe Journal of Supercomputing10.5555/2733630.273365471:2(391-447)Online publication date: 31-Dec-2018
  • (2014)Deadline-Constrained Clustered Scheduling for VLIW Architectures using Power-Gated Register FilesACM Transactions on Architecture and Code Optimization (TACO)10.1145/263221811:2(1-26)Online publication date: 15-Jul-2014
  • (2014)Enabling energy-proportional computing on instruction-level parallel processorsThe Journal of Supercomputing10.1007/s11227-014-1301-z71:2(391-447)Online publication date: 5-Oct-2014
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 4, Issue 2
June 2007
193 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/1250727
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2007
Published in TACO Volume 4, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Instruction-level parallelism
  2. VLIW
  3. clock frequency
  4. cluster assignment
  5. instruction scheduler
  6. intercluster communication
  7. optimizing compiler
  8. pipelining
  9. register allocation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)224
  • Downloads (Last 6 weeks)19
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Enabling energy-proportional computing on instruction-level parallel processorsThe Journal of Supercomputing10.5555/2733630.273365471:2(391-447)Online publication date: 31-Dec-2018
  • (2014)Deadline-Constrained Clustered Scheduling for VLIW Architectures using Power-Gated Register FilesACM Transactions on Architecture and Code Optimization (TACO)10.1145/263221811:2(1-26)Online publication date: 15-Jul-2014
  • (2014)Enabling energy-proportional computing on instruction-level parallel processorsThe Journal of Supercomputing10.1007/s11227-014-1301-z71:2(391-447)Online publication date: 5-Oct-2014
  • (2013)CAeSaRProceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems10.5555/2555729.2555738(1-10)Online publication date: 29-Sep-2013
  • (2013)A constraint programming approach for integrated spatial and temporal scheduling for clustered architecturesACM Transactions on Embedded Computing Systems (TECS)10.1145/251247013:1(1-23)Online publication date: 5-Sep-2013
  • (2013)LUCASACM SIGPLAN Notices10.1145/2499369.246556548:5(45-54)Online publication date: 20-Jun-2013
  • (2013)LUCASProceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems10.1145/2491899.2465565(45-54)Online publication date: 20-Jun-2013
  • (2013)LUCASProceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems10.1145/2465554.2465565(45-54)Online publication date: 20-Jun-2013
  • (2013)CAeSaR: Unified cluster-assignment scheduling and communication reuse for clustered VLIW processors2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)10.1109/CASES.2013.6662513(1-10)Online publication date: Sep-2013
  • (2013)UCIFF: Unified Cluster Assignment Instruction Scheduling and Fast Frequency Selection for Heterogeneous Clustered VLIW CoresLanguages and Compilers for Parallel Computing10.1007/978-3-642-37658-0_9(127-142)Online publication date: 2013
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media