[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1815961.1815976acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Aérgia: exploiting packet latency slack in on-chip networks

Published: 19 June 2010 Publication History

Abstract

Traditional Network-on-Chips (NoCs) employ simple arbitration strategies, such as round-robin or oldest-first, to decide which packets should be prioritized in the network. This is counter-intuitive since different packets can have very different effects on system performance due to, e.g., different level of memory-level parallelism (MLP) of applications. Certain packets may be performance-critical because they cause the processor to stall, whereas others may be delayed for a number of cycles with no effect on application-level performance as their latencies are hidden by other outstanding packets'latencies. In this paper, we define slack as a key measure that characterizes the relative importance of a packet. Specifically, the slack of a packet is the number of cycles the packet can be delayed in the network with no effect on execution time. This paper proposes new router prioritization policies that exploit the available slack of interfering packets in order to accelerate performance-critical packets and thus improve overall system performance. When two packets interfere with each other in a router, the packet with the lower slack value is prioritized. We describe mechanisms to estimate slack, prevent starvation, and combine slack-based prioritization with other recently proposed application-aware prioritization mechanisms.
We evaluate slack-based prioritization policies on a 64-core CMP with an 8x8 mesh NoC using a suite of 35 diverse applications. For a representative set of case studies, our proposed policy increases average system throughput by 21.0% over the commonlyused round-robin policy. Averaged over 56 randomly-generated multiprogrammed workload mixes, the proposed policy improves system throughput by 10.3%, while also reducing application-level unfairness by 30.8%.

References

[1]
N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W. king Su. Myrinet - A Gigabit-per-Second Local-Area Network. IEEE Micro, 1995.
[2]
E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS architecture and design process for network on chip. Journal of Systems Arch., 2004.
[3]
E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, and A. Kolodny. The Power of Priority: NoC Based Distributed Cache Coherency. In NOCS'07, 2007.
[4]
D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA-11, 2005.
[5]
J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In ICS-21, 2007.
[6]
A. A. Chien and J. H. Kim. Rotating Combined Queueing (RCQ): Bandwidth and Latency Guarantees in Low-Cost, High-Performance Networks. ISCA-23, 1996.
[7]
W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2003.
[8]
R. Das, O. Mutlu, T. Moscibroda, and C. Das. Application-Aware Prioritization Mechanisms for On-Chip Networks. In MICRO-42, 2009.
[9]
A. Demers, S. Keshav, and S. Shenker. Analysis and simulation of a fair queueing algorithm. In SIGCOMM, 1989.
[10]
J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In ICS-11, 1997.
[11]
E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems. In ASPLOS-XV, 2010.
[12]
S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, May-June 2008.
[13]
B. Fields, R. Bodík, and M. Hill. Slack: Maximizing performance under technological constraints. In ISCA-29, 2002.
[14]
B. Fields, S. Rubin, and R. Bodík. Focusing processor policies via critical-path prediction. In ISCA-28, 2001.
[15]
D. Garcia and W. Watson. Servernet II. Parallel Computing, Routing, and Communication Workshop, June 1997.
[16]
A. Glew. MLP Yes! ILP No! Memory Level Parallelism, or, Why I No Longer Worry About IPC. In ASPLOS Wild and Crazy Ideas Session, 1998.
[17]
B. Grot, S. W. Keckler, and O. Mutlu. Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip. In MICRO-42, 2009.
[18]
L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on cmps: caches as a shared resource. In PACT-15, 2006.
[19]
Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers. In HPCA-16, 2010.
[20]
D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA-8, 1981.
[21]
J. W. Lee, M. C. Ng, and K. Asanovic. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks. In ISCA-35, 2008.
[22]
O. Mutlu, H. Kim, and Y. N. Patt. Efficient runahead execution: Power-efficient memory latency tolerance. IEEE Micro, 2006.
[23]
O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO-40, 2007.
[24]
O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA-35, 2008.
[25]
O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead execution: an alternative to very large instruction windows for out-of-order processors. In HPCA-9, 2003.
[26]
K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems. In MICRO-39, 2006.
[27]
V. G. Oklobdzija and R. K. Krishnamurthy. Energy-Delay Characteristics of CMOS Adders, High-Performance Energy-Efficient Microprocessor Design, chapter 6. Springer US, 2006.
[28]
H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi. Pinpointing Representative Portions of Large Intel Itanium Programs with Dynamic Instrumentation. In MICRO-37, 2004.
[29]
M. Qureshi, D. Lynch, O. Mutlu, and Y. Patt. A Case for MLP-Aware Cache Replacement. In ISCA-33, 2006.
[30]
M. Qureshi and Y. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO-39, 2006.
[31]
E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage, and E. Waterlander. Trade-offs in the design of a router with both guaranteed and best-effort services for networks on chip. DATE, 2003.
[32]
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In MICRO-31, 1998.
[33]
S. Subramaniam, A. Bracy, H. Wang, and G. Loh. Criticality-based optimizations for efficient load processing. In HPCA-15, 2009.
[34]
T. J. Teorey and T. B. Pinkerton. A comparative analysis of disk scheduling policies. Communications of the ACM, 1972.
[35]
R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 1967.
[36]
T. Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction. In MICRO-24, 1991.
[37]
K. H. Yum, E. J. Kim, and C. Das. QoS provisioning in clusters: an investigation of router and NIC design. In ISCA-28, 2001.
[38]
L. Zhang. Virtual clock: a new traffic control algorithm for packet switching networks. SIGCOMM, 1990.

Cited By

View all
  • (2024)TacVar: Tackling Variability in Short-Interval Timing Measurements on X86 Processors2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00062(496-506)Online publication date: 6-May-2024
  • (2021)Intelligent Architectures for Intelligent Computing Systems2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474073(318-323)Online publication date: 1-Feb-2021
  • (2020)FiferProceedings of the 21st International Middleware Conference10.1145/3423211.3425683(280-295)Online publication date: 7-Dec-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
June 2010
520 pages
ISBN:9781450300537
DOI:10.1145/1815961
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 38, Issue 3
    ISCA '10
    June 2010
    508 pages
    ISSN:0163-5964
    DOI:10.1145/1816038
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. arbitration
  2. memory systems
  3. multi-core
  4. on-chip networks
  5. packet scheduling
  6. prioritization

Qualifiers

  • Research-article

Conference

ISCA '10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)4
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)TacVar: Tackling Variability in Short-Interval Timing Measurements on X86 Processors2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00062(496-506)Online publication date: 6-May-2024
  • (2021)Intelligent Architectures for Intelligent Computing Systems2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474073(318-323)Online publication date: 1-Feb-2021
  • (2020)FiferProceedings of the 21st International Middleware Conference10.1145/3423211.3425683(280-295)Online publication date: 7-Dec-2020
  • (2020)Experiences with ML-Driven Design: A NoC Case Study2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00058(637-648)Online publication date: Feb-2020
  • (2019)Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel ExecutionACM Transactions on Architecture and Code Optimization10.1145/332612416:3(1-27)Online publication date: 17-Jun-2019
  • (2018)Critical packet prioritisation by slack-aware re-routing in on-chip networksProceedings of the Twelfth IEEE/ACM International Symposium on Networks-on-Chip10.5555/3306619.3306631(1-8)Online publication date: 4-Oct-2018
  • (2018)Enhancing computation-to-core assignment with physical location informationACM SIGPLAN Notices10.1145/3296979.319238653:4(312-327)Online publication date: 11-Jun-2018
  • (2018)Slim NoCACM SIGPLAN Notices10.1145/3296957.317715853:2(43-55)Online publication date: 19-Mar-2018
  • (2018)SPECTRACM SIGPLAN Notices10.1145/3296957.317319953:2(169-183)Online publication date: 19-Mar-2018
  • (2018)MASKACM SIGPLAN Notices10.1145/3296957.317316953:2(503-518)Online publication date: 19-Mar-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media