More Web Proxy on the site http://driver.im/

Article

A scalable wide-issue clustered VLIW with a reconfigurable interconnect

Authors:

Osvaldo Colavin,

Davide RizzoAuthors Info & Claims

CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems

Pages 148 - 158

https://doi.org/10.1145/951710.951731

Published: 30 October 2003 Publication History

Abstract

Clustered VLIW architectures have been widely adopted in modern embedded multimedia applications for their ability to exploit high degrees of ILP with reasonable trade-off in complexity and silicon costs. Studies have however shown limited performance scaling for wide-issue machines. In this paper we describe the architecture of a clustered VLIW with a runtime reconfigurable inter-cluster bus suitable to address such scalability problem. The architecture is aimed at kernel loops acceleration through a coprocessor approach and allows a customization of the interconnect between neighboring register files before each loop execution. We have adopted an inter-cluster communication mechanism based on a constant-complexity interconnect. The complexity and latency independent of the number of clusters preserve the scalability on issue-width. To handle the limited connectivity, the interconnection resources in the inter-cluster bus are exposed to the compiler, and scheduled like other resources with an adapted version of modulo scheduling. Other relevant features include the capability to define shifting queues in the register files, for a more effective software pipelining support. The addition of a limited amount of reconfigurability to the well established VLIW programming model results in low-overhead inter-cluster communications and a scalable ILP architecture. Simulation results show that we can achieve near linear scalability for certain classes of kernel loops.

References

[1]

A. Dasu, W. Panchanathan, "Survey of Media Processing Approaches," IEEE Tr. on Circuits and Systems for Video Technology, v.12, no.8, pp. 633--645, Aug. 2002.

Digital Library

[2]

N. Slingerland, A. J. Smith, "Measuring the Performance of Multimedia Instruction Sets," IEEE Tr. on Computers, Vol. 51, No. 11, pp. 1317--1332, Nov 2002.

Digital Library

[3]

M. Ferretti, "Multi-media Extensions in Super-pipelined Micro-architectures. A new case for SIMD processing?," Proc. Int. Workshop Computer Architectures for Machine Perception, pp.249--258, 2000.

Digital Library

[4]

http://www.ti.com

[5]

http://www.starcore-dsp.com

[6]

B. Dupont de Dinechin et al, "Code Generator Optimizations for the ST100 DSP-MCU Core", Proc. Int. Conf. on Compilers, Architecture and Synthesis for Embedded Systems, November 2000.

Digital Library

[7]

C. Basoglu, W. Lee, J. O'Donnell, "The Equator MAP-CA DSP: An End-To-End Broadband Signal Processor VLIW," IEEE Tr. on Circuits and Systems for Video Technology, v.12 no.8, pp. 646--659, Aug. 2002.

Digital Library

[8]

P. Faraboschi, G. Desoli, J. Fisher, "Clustered Instruction-Level Parallel Processors," Tech. Report HPL-98-204, Hewlett-Packard, Dec. 1998.

[9]

S. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi, J. Owens, "Register Organization for Media Processing," HPCA6, 2000.

[10]

A. Capitanio, N. Dutt, A. Nicolau, "Partitioned register files for VLIWs: A preliminary analysis of tradeoffs," Proc. Int. Symp. on Microarchitecture, pp. 292--300, December 1992.

Digital Library

[11]

Y. Qian, S. Carr, P. Sweany, "Optimizing Loop Performance for Clustered VLIW Architectures," Proc. PACT 2002.

Digital Library

[12]

C. Akturan, M. Jacome, "CALiBeR: A Software Pipelining Algorithm for Clustered Embedded VLIW Processors," Proc. Int. Conf. on Computer-Aided Design (ICCAD'2001), Nov 2001.

Digital Library

[13]

A. Terechko et al, "Inter-cluster Communication Models for Clustered VLIW Processors," Int. Sym. High Performance Computer Architecture, Feb 2003.

Digital Library

[14]

J. Sanchez, A. Gonzalez, "Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture," Proc. Int. Symp. on Microarchitecture, 2000.

Digital Library

[15]

T. Callahan, J. Hauser, J. Wawrzynek, "The GARP Architecture and C compiler," IEEE Computer, pp 62--69, April 2000.

Digital Library

[16]

D. C. Cronquist, P. Franklin, S. G. Berg, C. Ebeling, "Specifying and Compiling Applications for RaPiD," Proc. IEEE Symp. FCCM, 1998.

Digital Library

[17]

J. Hauser, J. Wawrzynek, "Garp: a MIPS Processor with a Reconfigurable Coprocessor," Proc. IEEE Symp. FCCM, 1997, pp.24--33.

Digital Library

[18]

H. Singh, M. H. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, E. C. Filho, "MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications," IEEE Trans. on Computers, Vol.49, No.5, pp.465--481, May 2000.

Digital Library

[19]

S. C. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi R. R.Taylor and R. Laufer, "PipeRench: A Coprocessor for Streaming multimedia Acceleration," Proc. 26th ISCA, pp.28--39, 1999.

Digital Library

[20]

Z. Ye, P. Banerjee, S. Hauck, A. Moshovos, "CHIMAERA: A High-Performance Architecture with a Tightly-Coupled RFU," Proc. 27th ISCA, 2000.

Digital Library

[21]

M. Sima, S. Cotofana, J. T van Eijndhoven, S. Vassiliadis, K. Vissers, "An 8x8 IDCT Implementation on an FPGA-augmented TriMedia," Proc. IEEE Symp. FCCM, 2001.

Digital Library

[22]

R. Maestre, F. Kurdahi, M. Fernandez, R. Hermida, N. Bagherzadeh, "A Framework for Scheduling and Context Allocation in reconfigurable Computing," Proc. Int. Symp. on System Synthesis (ISSS'99), pp. 134--140, 1999.

Digital Library

[23]

B. Khailany, W. J. Dally et al, "Imagine: Media Processing with Streams," IEEE Micro, v.21, no.2, pp 35--46, March/April 2001.

Digital Library

[24]

C. Kozyrakis, D. Patterson, "Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks," Proc. Int. Symp. on Microarchitecture, pp. 283--293, Nov. 2002.

Digital Library

[25]

B. R. Rau, M. Schlansker, S. Michael, P. Tirumalai "Code Generation Schema for Modulo Scheduled Loops," Proc. 25th Int. Symp. on Microarchitecture, pp. 158--169, 1992.

Digital Library

[26]

D. Rizzo and O. Colavin, "A Runtime Reconfigurable Clustered VLIW Architecture for Mediaprocessing", to appear, Proceedings of the ESTIMedia Workshop, 2003.

[27]

P. Faraboschi, G. Brown, J. Fisher, G. Desoli, F. Homewood, "Lx: A technology Platform for Customizable VLIW Embedded Processing," Proc. 27th ISCA, pp.203--213, 2000.

Digital Library

[28]

http://www.projectmayo.com

[29]

M. Wolfe, "High Performance Compilers for Parallel Computing", Addison-Wesley, 1996.

Digital Library

Cited By

Naresh VGope DLipasti M(2017)The CUREACM Transactions on Embedded Computing Systems10.1145/312652716:5s(1-19)Online publication date: 27-Sep-2017
https://dl.acm.org/doi/10.1145/3126527
Wan LDong CChen D(2012)A coarse-grained reconfigurable architecture with compilation for high performanceInternational Journal of Reconfigurable Computing10.1155/2012/1635422012(3-3)Online publication date: 1-Jan-2012
https://dl.acm.org/doi/10.1155/2012/163542
Catthoor FRaghavan PLambrechts AJayapala MKritikakou AAbsar JCatthoor FRaghavan PLambrechts AJayapala MKritikakou AAbsar J(2010)Clustered L0 (Loop) Buffer Organization and Combination with Data ClustersUltra-Low Energy Domain-Specific Instruction-Set Processors10.1007/978-90-481-9528-2_5(115-141)Online publication date: 3-Jul-2010
https://doi.org/10.1007/978-90-481-9528-2_5
Show More Cited By

Index Terms

A scalable wide-issue clustered VLIW with a reconfigurable interconnect
1. Computer systems organization
  1. Architectures

Recommendations

Evaluation of bus based interconnect mechanisms in clustered VLIW architectures

With new sophisticated compiler technology, it is possible to schedule distant instructions efficiently. As a consequence, the amount of exploitable instruction level parallelism (ILP) in applications has gone up considerably. However, monolithic ...
Loop transformations for clustered vliw architectures
Loop fusion for clustered VLIW architectures

Embedded systems require maximum performance from a processor within significant constraints in power consumption and chip cost. Using software pipelining, high-performance digital signal processors can often exploit considerable instruction-level ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems

October 2003

340 pages

ISBN:1581136765

DOI:10.1145/951710

General Chairs:
Jaime Moreno
IBM Research
,
Praveen Murthy
Fujitsu Labs of America
,
Program Chairs:
Tom Conte
North Carolina State University
,
Paolo Faraboschi
HP Labs

Copyright © 2003 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CASES03

Sponsor:

CASES03: 2003 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

October 30 - November 1, 2003

California, San Jose, USA

Acceptance Rates

CASES '03 Paper Acceptance Rate 31 of 162 submissions, 19%;

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
774
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Naresh VGope DLipasti M(2017)The CUREACM Transactions on Embedded Computing Systems10.1145/312652716:5s(1-19)Online publication date: 27-Sep-2017
https://dl.acm.org/doi/10.1145/3126527
Wan LDong CChen D(2012)A coarse-grained reconfigurable architecture with compilation for high performanceInternational Journal of Reconfigurable Computing10.1155/2012/1635422012(3-3)Online publication date: 1-Jan-2012
https://dl.acm.org/doi/10.1155/2012/163542
Catthoor FRaghavan PLambrechts AJayapala MKritikakou AAbsar JCatthoor FRaghavan PLambrechts AJayapala MKritikakou AAbsar J(2010)Clustered L0 (Loop) Buffer Organization and Combination with Data ClustersUltra-Low Energy Domain-Specific Instruction-Set Processors10.1007/978-90-481-9528-2_5(115-141)Online publication date: 3-Jul-2010
https://doi.org/10.1007/978-90-481-9528-2_5
Yazdani SCambonie JPottier B(2008)Reconfiguralbe multimedia accelerator for mobile systems2008 IEEE International SOC Conference10.1109/SOCC.2008.4641529(287-290)Online publication date: Sep-2008
https://doi.org/10.1109/SOCC.2008.4641529
Yan SLin B(2007)Stream execution on wide-issue clustered VLIW architecturesACM SIGPLAN Notices10.1145/1273444.125479742:7(158-160)Online publication date: 13-Jun-2007
https://dl.acm.org/doi/10.1145/1273444.1254797
Yan SLin BPande SLi Z(2007)Stream execution on wide-issue clustered VLIW architecturesProceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems10.1145/1254766.1254797(158-160)Online publication date: 13-Jun-2007
https://dl.acm.org/doi/10.1145/1254766.1254797
Terechko ACorporaal H(2007)Inter-cluster communication in VLIW architecturesACM Transactions on Architecture and Code Optimization10.1145/1250727.12507314:2(11-es)Online publication date: 1-Jun-2007
https://dl.acm.org/doi/10.1145/1250727.1250731
Sykora MPavoni DCambonie JCosta RReghizzi S(2007)Hierarchical Cluster Assignment for Coarse-Grain Reconfigurable Coprocessors2007 IEEE International Parallel and Distributed Processing Symposium10.1109/IPDPS.2007.370381(1-8)Online publication date: Mar-2007
https://doi.org/10.1109/IPDPS.2007.370381

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents