[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/951710.951731acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
Article

A scalable wide-issue clustered VLIW with a reconfigurable interconnect

Published: 30 October 2003 Publication History

Abstract

Clustered VLIW architectures have been widely adopted in modern embedded multimedia applications for their ability to exploit high degrees of ILP with reasonable trade-off in complexity and silicon costs. Studies have however shown limited performance scaling for wide-issue machines. In this paper we describe the architecture of a clustered VLIW with a runtime reconfigurable inter-cluster bus suitable to address such scalability problem. The architecture is aimed at kernel loops acceleration through a coprocessor approach and allows a customization of the interconnect between neighboring register files before each loop execution. We have adopted an inter-cluster communication mechanism based on a constant-complexity interconnect. The complexity and latency independent of the number of clusters preserve the scalability on issue-width. To handle the limited connectivity, the interconnection resources in the inter-cluster bus are exposed to the compiler, and scheduled like other resources with an adapted version of modulo scheduling. Other relevant features include the capability to define shifting queues in the register files, for a more effective software pipelining support. The addition of a limited amount of reconfigurability to the well established VLIW programming model results in low-overhead inter-cluster communications and a scalable ILP architecture. Simulation results show that we can achieve near linear scalability for certain classes of kernel loops.

References

[1]
A. Dasu, W. Panchanathan, "Survey of Media Processing Approaches," IEEE Tr. on Circuits and Systems for Video Technology, v.12, no.8, pp. 633--645, Aug. 2002.
[2]
N. Slingerland, A. J. Smith, "Measuring the Performance of Multimedia Instruction Sets," IEEE Tr. on Computers, Vol. 51, No. 11, pp. 1317--1332, Nov 2002.
[3]
M. Ferretti, "Multi-media Extensions in Super-pipelined Micro-architectures. A new case for SIMD processing?," Proc. Int. Workshop Computer Architectures for Machine Perception, pp.249--258, 2000.
[4]
http://www.ti.com
[5]
http://www.starcore-dsp.com
[6]
B. Dupont de Dinechin et al, "Code Generator Optimizations for the ST100 DSP-MCU Core", Proc. Int. Conf. on Compilers, Architecture and Synthesis for Embedded Systems, November 2000.
[7]
C. Basoglu, W. Lee, J. O'Donnell, "The Equator MAP-CA DSP: An End-To-End Broadband Signal Processor VLIW," IEEE Tr. on Circuits and Systems for Video Technology, v.12 no.8, pp. 646--659, Aug. 2002.
[8]
P. Faraboschi, G. Desoli, J. Fisher, "Clustered Instruction-Level Parallel Processors," Tech. Report HPL-98-204, Hewlett-Packard, Dec. 1998.
[9]
S. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi, J. Owens, "Register Organization for Media Processing," HPCA6, 2000.
[10]
A. Capitanio, N. Dutt, A. Nicolau, "Partitioned register files for VLIWs: A preliminary analysis of tradeoffs," Proc. Int. Symp. on Microarchitecture, pp. 292--300, December 1992.
[11]
Y. Qian, S. Carr, P. Sweany, "Optimizing Loop Performance for Clustered VLIW Architectures," Proc. PACT 2002.
[12]
C. Akturan, M. Jacome, "CALiBeR: A Software Pipelining Algorithm for Clustered Embedded VLIW Processors," Proc. Int. Conf. on Computer-Aided Design (ICCAD'2001), Nov 2001.
[13]
A. Terechko et al, "Inter-cluster Communication Models for Clustered VLIW Processors," Int. Sym. High Performance Computer Architecture, Feb 2003.
[14]
J. Sanchez, A. Gonzalez, "Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture," Proc. Int. Symp. on Microarchitecture, 2000.
[15]
T. Callahan, J. Hauser, J. Wawrzynek, "The GARP Architecture and C compiler," IEEE Computer, pp 62--69, April 2000.
[16]
D. C. Cronquist, P. Franklin, S. G. Berg, C. Ebeling, "Specifying and Compiling Applications for RaPiD," Proc. IEEE Symp. FCCM, 1998.
[17]
J. Hauser, J. Wawrzynek, "Garp: a MIPS Processor with a Reconfigurable Coprocessor," Proc. IEEE Symp. FCCM, 1997, pp.24--33.
[18]
H. Singh, M. H. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, E. C. Filho, "MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications," IEEE Trans. on Computers, Vol.49, No.5, pp.465--481, May 2000.
[19]
S. C. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi R. R.Taylor and R. Laufer, "PipeRench: A Coprocessor for Streaming multimedia Acceleration," Proc. 26th ISCA, pp.28--39, 1999.
[20]
Z. Ye, P. Banerjee, S. Hauck, A. Moshovos, "CHIMAERA: A High-Performance Architecture with a Tightly-Coupled RFU," Proc. 27th ISCA, 2000.
[21]
M. Sima, S. Cotofana, J. T van Eijndhoven, S. Vassiliadis, K. Vissers, "An 8x8 IDCT Implementation on an FPGA-augmented TriMedia," Proc. IEEE Symp. FCCM, 2001.
[22]
R. Maestre, F. Kurdahi, M. Fernandez, R. Hermida, N. Bagherzadeh, "A Framework for Scheduling and Context Allocation in reconfigurable Computing," Proc. Int. Symp. on System Synthesis (ISSS'99), pp. 134--140, 1999.
[23]
B. Khailany, W. J. Dally et al, "Imagine: Media Processing with Streams," IEEE Micro, v.21, no.2, pp 35--46, March/April 2001.
[24]
C. Kozyrakis, D. Patterson, "Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks," Proc. Int. Symp. on Microarchitecture, pp. 283--293, Nov. 2002.
[25]
B. R. Rau, M. Schlansker, S. Michael, P. Tirumalai "Code Generation Schema for Modulo Scheduled Loops," Proc. 25th Int. Symp. on Microarchitecture, pp. 158--169, 1992.
[26]
D. Rizzo and O. Colavin, "A Runtime Reconfigurable Clustered VLIW Architecture for Mediaprocessing", to appear, Proceedings of the ESTIMedia Workshop, 2003.
[27]
P. Faraboschi, G. Brown, J. Fisher, G. Desoli, F. Homewood, "Lx: A technology Platform for Customizable VLIW Embedded Processing," Proc. 27th ISCA, pp.203--213, 2000.
[28]
http://www.projectmayo.com
[29]
M. Wolfe, "High Performance Compilers for Parallel Computing", Addison-Wesley, 1996.

Cited By

View all
  • (2017)The CUREACM Transactions on Embedded Computing Systems10.1145/312652716:5s(1-19)Online publication date: 27-Sep-2017
  • (2012)A coarse-grained reconfigurable architecture with compilation for high performanceInternational Journal of Reconfigurable Computing10.1155/2012/1635422012(3-3)Online publication date: 1-Jan-2012
  • (2010)Clustered L0 (Loop) Buffer Organization and Combination with Data ClustersUltra-Low Energy Domain-Specific Instruction-Set Processors10.1007/978-90-481-9528-2_5(115-141)Online publication date: 3-Jul-2010
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
October 2003
340 pages
ISBN:1581136765
DOI:10.1145/951710
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. IDCT
  2. clustered VLIW
  3. modulo scheduling
  4. reconfigurable co-processor (RCP)

Qualifiers

  • Article

Conference

CASES03
Sponsor:

Acceptance Rates

CASES '03 Paper Acceptance Rate 31 of 162 submissions, 19%;
Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2017)The CUREACM Transactions on Embedded Computing Systems10.1145/312652716:5s(1-19)Online publication date: 27-Sep-2017
  • (2012)A coarse-grained reconfigurable architecture with compilation for high performanceInternational Journal of Reconfigurable Computing10.1155/2012/1635422012(3-3)Online publication date: 1-Jan-2012
  • (2010)Clustered L0 (Loop) Buffer Organization and Combination with Data ClustersUltra-Low Energy Domain-Specific Instruction-Set Processors10.1007/978-90-481-9528-2_5(115-141)Online publication date: 3-Jul-2010
  • (2008)Reconfiguralbe multimedia accelerator for mobile systems2008 IEEE International SOC Conference10.1109/SOCC.2008.4641529(287-290)Online publication date: Sep-2008
  • (2007)Stream execution on wide-issue clustered VLIW architecturesACM SIGPLAN Notices10.1145/1273444.125479742:7(158-160)Online publication date: 13-Jun-2007
  • (2007)Stream execution on wide-issue clustered VLIW architecturesProceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems10.1145/1254766.1254797(158-160)Online publication date: 13-Jun-2007
  • (2007)Inter-cluster communication in VLIW architecturesACM Transactions on Architecture and Code Optimization10.1145/1250727.12507314:2(11-es)Online publication date: 1-Jun-2007
  • (2007)Hierarchical Cluster Assignment for Coarse-Grain Reconfigurable Coprocessors2007 IEEE International Parallel and Distributed Processing Symposium10.1109/IPDPS.2007.370381(1-8)Online publication date: Mar-2007

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media