[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1815961.1815968acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Understanding sources of inefficiency in general-purpose chips

Published: 19 June 2010 Publication History

Abstract

Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units.
The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.

References

[1]
Horowitz, M.; Alon, E.; Patil, D.; Naffziger, S.; Kumar, R.; Bernstein, K., "Scaling, Power and the Future of CMOS," 20th Int'l Conference on VLSI Design, 2007, held jointly with 6th Int'l Conference on Embedded Systems, p. 23, 6--10 Jan. 2007
[2]
Solomatnikov, A; Firoozshahian, A.; Qadeer,W.; Shacham, O.; Kelley, K.; Asgar, Z.; Wachs, M.; Hameed, R.; Horowitz, M., "Chip Multi-Processor Generator," Proceedings of the 44th Annual Design Automation Conference, 2007, pp. 262--263
[3]
Iverson, V.; McVeigh, J.; Reese, B., "Real-time H.264/avc Codec on Intel architectures," IEEE Int. Conf. Image Processing (ICIP'04), 2004.
[4]
Chen, T.-C.; et al., "Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder," Circuits and Systems for Video Technology, IEEE Transactions on, vol.16, no.6, pp. 673--688, June 2006.
[5]
Davis, W.R.; Zhang, N.; Camera, K.; Markovic, D.; Smilkstein, T.; Ammer, M.J.; Yeo, E.; Augsburger, S.; Nikolic, B.; Brodersen, R.W., "A design environment for high-throughput low-power dedicated signal processing systems," Solid-State Circuits, IEEE Journal of, vol.37, no.3, pp.420--431, Mar 2002.
[6]
Balfour, J.; Dally, W.J.; Black-Schaffer, D.; Parikh, V.; Park, J.S., "An Energy-Efficient Processor Architecture for Embedded Systems," Computer Architecture Letters, vol.7, no.1, pp.29--32, January-June 2007.
[7]
McCloud, S.; "Catapult C Synthesis-Based Design Flow: Speeding Implementation and Increasing Flexibility", Mentor Graphics Technical Library, http://www.techonline.com/electronics_directory/techpaper/193102520, August 2004.
[8]
Kathail,V., "Creating power-efficient application engines for SoC design," Synfora, Inc. SoC Central, Feb 1, 2005.
[9]
Rowen, C.; Leibson, S., "Flexible architectures for engineering successful SOCs," Design Automation Conf, 2004. Proceedings. 41st, pp. 692--697, 2004.
[10]
Woh, M., Seo, S., Mahlke, S., Mudge, T., Chakrabarti, C., and Flautner, K. 2009. AnySP: anytime anywhere anyway signal processing. SIGARCH Comp. Arch. News 37, 3 (Jun. 2009), 128--139.
[11]
Intel Corp., Motion Estimation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4). {Online}. Available: http://software.intel.com/en-us/articles/motion-estimation-with-intel-streaming-simd-extensions-4-intel-sse4/
[12]
Intel Corp., Intel SSE4 Programming Reference. {Online}. Available: http://softwarecommunity.intel.com/isn/Downloads/Intel%20SSE4%20Programming%20Reference.pdf.
[13]
Clark, N.T.; Zhong, H.; Mahlke, S.A., "Automated custom instruction generation for domain-specific processor acceleration," Computers, IEEE Transactions on, vol.54, no.10, pp. 1258--1270, Oct. 2005.
[14]
Cong, J., Fan, Y., Han, G., and Zhang, Z. Application-specific instruction generation for configurable processor architectures. In Proceedings of the 2004 ACM/SIGDA 12th international Symposium on Field Programmable Gate Arrays (Monterey, California, USA, February 22-24, 2004). FPGA '04. ACM, New York, NY, 183--189.
[15]
Shojania, H.; Sudharsanan, S., "A VLSI Architecture for High-Performance CABAC Encoding," Visual Communications and Image Processing (SPIE), 2005. Proceedings. vol. 5960, June 2005.
[16]
Ienne, P.; Leupers, R. Customizable Embedded Processors: Design Technologies and Applications (Systems on Silicon). Morgan Kaufmann Publishers Inc. 2006
[17]
Tensilica Inc., "Xtensa LX2 Benchmarks" {Online}. Available: http://www.tensilica.com/products/xtensa-customizable/xtensa-lx2/benchmarks.htm
[18]
Tensilica Inc., "The What, Why, and How of Configurable Processors." {Online}. Available: http://www.tensilica.com/products/literature-docs/white-papers/configurable-processors.htm
[19]
Tensilica Inc., "Implementing the Advanced Encryption Standard on Xtensa® Processors", Application note. {Online}. Available: http://www.tensilica.com/products/ literature-docs/application-notes/tie-application-notes/advanced-encryption-standard.htm
[20]
Tensilica Inc., "Implementing the Fast Fourier Transform (FFT)", Application note. {Online}. Available: http://www.tensilica.com/products/literature-docs/application-notes/tie-application-notes/fast-fourier-transform-fft.htm
[21]
Tensilica Inc., "Xtensa Processor Extensions for Data Encryption Standard (DES)" Application note. {Online}. Available: http://www.tensilica.com/products/literature-docs/application-notes/tie-application-notes/data-encryption-extensions.htm
[22]
Tensilica Inc., "How to Minimize Energy Consumption while Maximizing ASIC and SOC Performance" White Paper. Available: http://www.tensilica.com/uploads/white_papers/Xenergy_Tensilica.pdf
[23]
Weigand, T.; Sullivan, G.; Bjontegaard, G.; Luthra, A., "Overview of the H.264/AVC Coding Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol 13, no 7, July 2003.
[24]
Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, Joint Video Team, ITU-T Recommendation H.264 and ISO/IEC 14496-10 AVC, May 2003.
[25]
Tensilica Inc., "Xtensa Energy Estimator (Xenergy) --- User's Guide"
[26]
Cheveresan, R.;, Ramsay, M.; Feucht, C.; Sharapov, I., "Characteristics of workloads used in high performance and technical computing," Proceedings of the 21st annual international conference on Supercomputing, June 17--21, 2007.
[27]
Rupnow, K.; Rodrigues A.;, Underwood, K.; Compton, K., "Scientific applications vs. SPEC-FP: a comparison of program behavior," Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006.
[28]
Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., and Demmel, J., "Optimization of sparse matrix-vector multiplication on emerging multicore platforms," Proceedings of the 2007 ACM/IEEE conference on Supercomputing, ACM New York, NY, USA, 2007
[29]
Lin, Y.-K., Li, D.-W., Lin, C.-C., Kuo, T.-Y., Wu, S.-J.; Tai, W.-C.; Chang, W.-C. and Chang, T.-S., "A 242mw 10mm2 1080p H.264/AVC high profile encoder chip," Proceedings of the 45th Design Automation Conference, 2008
[30]
Chang, H.-C., Chen, J.-W., Su, C.-L., Yang, Y.-C., Li, Y., Chang, C.-H., Chen, Z.-M., Yang, W.-S., Lin, C.-C., Chen, C.-W., Wang, J.-S. and Guo, J.-I., "A 7mw-to-183mw dynamic quality-scalable h.264 video encoder chip," Proceedings of 2007 IEEE ISSCC Dig. Tech. Papers
[31]
ISO/IEC MPEG & ITU-T VCEG, "Fast Integer Pel and Fractional Pel Motion Estimation for JVT", JVT-F017, 2002
[32]
Yin, P, Tourapis, H.-Y. C., Tourapis, A. M., and Boyce, J., "Fast mode decision and motion estimation for JVT/H.264", Proceedings of IEEE International Conference on Image Processing, 2003
[33]
Chen, C.-Y., Chien, S.-Y., Huang, Y.-W., Chen, T.-C., Wang, T. C. and Chen, L.-Y., "Analysis and Architecture Design of Variable Block-Size Motion Estimation for H.264/AVC", IEEE Transactions on Circuits and Systems, 2006
[34]
Li, S., Wei, X., Ikenaga, T. and Goto, S., "A VLSI architecture design of an edge based fast intra prediction mode decision algorithm for h.264/avc", Proceedings of the 17th ACM Great Lakes Symposium on VLSI
[35]
Chen, T-C., Huang, Y.-W. and Chen, L.-G., "Fully Utilized And Reusable Architecture For Fractional Motion Estimation Of H.264/Avc", Proceedings of IEEE International Conference On Acoustics Speech And Signal Processing, 2004
[36]
Osorio, R. R., Bruguera, J. D., "High-Throughput Architecture for H.264/AVC CABAC Compression System", IEEE Transactions on Circuits and Systems for Video Technology, 2006
[37]
Chen, Y.-K., Li, E. Q., Zhou, X. and Ge, S., "Implementation of H.264 encoder and decoder on personal computers", Journal of Visual Communication and Image Representation, April 2006
[38]
Joint Video Team Reference Software JM8.6, ITU-T
[39]
Firoozshahian, A., Solomatnikov, A., Shacham, O., Asgar, Z., Richardson, S., Kozyrakis, C., Horowitz, M., "A memory system design framework: creating smart memories", Proceedings of the 36th annual international symposium on Computer architecture, 2009
[40]
Solomatnikov, A., Firoozshahian, A. Shacham, O., Asgar, Z., Wachs, M, Qadeer, W, Richardson, S. and Horowitz, M., "Using a Configurable Processor Generator for Computer Architecture Prototyping", Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009.

Cited By

View all
  • (2024)ExaFlexHH: an exascale-ready, flexible multi-FPGA library for biologically plausible brain simulationsFrontiers in Neuroinformatics10.3389/fninf.2024.133087518Online publication date: 12-Apr-2024
  • (2024)Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless CommunicationACM Transactions on Reconfigurable Technology and Systems10.1145/369588017:4(1-32)Online publication date: 18-Sep-2024
  • (2024)Leveraging Difference Recurrence Relations for High-Performance GPU Genome AlignmentProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676894(133-143)Online publication date: 14-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
June 2010
520 pages
ISBN:9781450300537
DOI:10.1145/1815961
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 38, Issue 3
    ISCA '10
    June 2010
    508 pages
    ISSN:0163-5964
    DOI:10.1145/1816038
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ASIC
  2. chip multiprocessor
  3. customization
  4. energy efficiency
  5. h.264
  6. high performance
  7. tensilica

Qualifiers

  • Research-article

Conference

ISCA '10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)411
  • Downloads (Last 6 weeks)84
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ExaFlexHH: an exascale-ready, flexible multi-FPGA library for biologically plausible brain simulationsFrontiers in Neuroinformatics10.3389/fninf.2024.133087518Online publication date: 12-Apr-2024
  • (2024)Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless CommunicationACM Transactions on Reconfigurable Technology and Systems10.1145/369588017:4(1-32)Online publication date: 18-Sep-2024
  • (2024)Leveraging Difference Recurrence Relations for High-Performance GPU Genome AlignmentProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676894(133-143)Online publication date: 14-Oct-2024
  • (2024)SLIDEX: A Novel Architecture for Sliding Window ProcessingProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656613(312-323)Online publication date: 30-May-2024
  • (2024)Application-level Validation of Accelerator Designs Using a Formal Software/Hardware InterfaceACM Transactions on Design Automation of Electronic Systems10.1145/363905129:2(1-25)Online publication date: 14-Feb-2024
  • (2024)Impact of Process Variation in Spin–Orbit Torque-Based Magnetic Tunnel Junctions on the Performance of Spiking Neural NetworksIEEE Transactions on Electron Devices10.1109/TED.2024.345607571:11(6672-6679)Online publication date: Nov-2024
  • (2024)A Graph-Based Method for AI Chip Operator Optimization Using Deep Learning2024 IEEE 6th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC)10.1109/IMCEC59810.2024.10575202(1882-1895)Online publication date: 24-May-2024
  • (2024)Multifunctional In‐Memory Analog‐to‐Digital Converter for Next‐Gen Compute‐in‐Memory SystemsAdvanced Intelligent Systems10.1002/aisy.202400594Online publication date: 24-Nov-2024
  • (2023)A CMOS Image Readout Circuit with On-Chip Defective Pixel Detection and CorrectionSensors10.3390/s2302093423:2(934)Online publication date: 13-Jan-2023
  • (2023)Deep Learning Architecture Improvement Based on Dynamic Pruning and Layer FusionElectronics10.3390/electronics1205120812:5(1208)Online publication date: 2-Mar-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media