More Web Proxy on the site http://driver.im/

research-article

Meeting points: using thread criticality to adapt multicore hardware to parallel regions

Authors:

José González,

Grigorios Magklis,

Pedro Chaparro,

Antonio GonzálezAuthors Info & Claims

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Pages 240 - 249

https://doi.org/10.1145/1454115.1454149

Published: 25 October 2008 Publication History

Abstract

We present a novel mechanism, called meeting point thread characterization, to dynamically detect critical threads in a parallel region. We define the critical thread the one with the longest completion time in the parallel region. Knowing the criticality of each thread has many potential applications. In this work, we propose two applications: thread delaying for multi-core systems and thread balancing for simultaneous multi-threaded (SMT) cores. Thread delaying saves energy consumptions by running the core containing the critical thread at maximum frequency while scaling down the frequency and voltage of the cores containing non-critical threads. Thread balancing improves overall performance by giving higher priority to the critical thread in the issue queue of an SMT core. Our experiments on a detailed microprocessor simulator with the Recognition, Mining, and Synthesis applications from Intel research laboratory reveal that thread delaying can achieve energy savings up to more than 40% with negligible performance loss. Thread balancing can improve performance from 1% to 20%.

References

[1]

S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The impact of performance asymmetry in emerging multicore architectures. In Proceedings of the 32nd annual international symposium on Computer Architecture, pages 506--517, Washington, DC, USA, 2005. IEEE Computer Society

Digital Library

[2]

OpenMP Architecture Review Board. Openmp application program interface, 2005.

[3]

S. Y. Borkar. Platform 2015: Intel processor and platform evolution for the next decode. Intel White Paper, 2005

[4]

David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattcy7sh: A framework for architectural-level power analysis and optimizations. ACM SIGARCH Computer Architec-ture News, 28, 2000.

Digital Library

[5]

T.D. Burd and R.W. Brodersen. Energy efficient cmos microprocessor design. System Sciences. Proceedings of the Twenty-Eighth Hawaii International Conference, 1995.

Digital Library

[6]

Francisco J. Cazorla, Alex Ramirez, Mateo Valero, and Enrique Fernandez. Dynamically controlled resource allocation in smt processors. Microarchitecture, 2004.

Digital Library

[7]

T. J. Chaney and C. E. Molnar. Anomalous behavior of synchronizer and arbiter circuits. IEEE Transactions on Computer, 22(4), 1973.

Digital Library

[8]

P. Chaparro, J. Gonzalez, G. Magklis, Q. Cai, and A. Gonzalez. Understanding the termal implications of multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 18(8), 2007.

Digital Library

[9]

T. Chelcea and S. M. Nowick. Robust interfaces for mixed-timing systems with application to latency-insensitive protocols. Proceedings of the 38th Design Automation Conference, 2001.

Digital Library

[10]

Intel Corporation. Computer intenstive, highly parallel application and uses. Intel Technology Journal, 9(2), 2005.

[11]

Intel Corporation. Intel's tera-scale research prepares for tens, hundreds of cores, 2006.

[12]

A. El-Moursy and D.H. Albonesi. Front-end policies for improved issue efficiency in smt processors. High-Performance Computer Architecture, 2003.

Digital Library

[13]

S. Fischer. Technical overview of the 45nm next generation intel core microarchitecture (penryn), 2007.

[14]

T. Fischer, J. Desai, B. Doyle, S. Naffziger, and B. Patella. A 90-nm variable frequency clock system for a power-managed itanium architecture processor. IEEE Journal of Solid-State Circuits, 41, 2006.

[15]

S. Gochman, R. Ronen, I. Anati, A. Berkovits, T. Kurts, A. Naveh, A. Saeed, Z. Sperber, and R. Valentine. The intel pentium m processor: Microarchitecture and performance. Intel Technology Journal, 7(2), 2003.

[16]

P. Hazucha, T. Karnik, B.A. Bloechel, C. Parsons, D. Finan, and S. Borkar. Area-efficient linear regulator with ultra-fast load regulation. Solid-State Circuits, IEEE Journal of, 40, 2005.

[17]

H. Homayoun, K.F. Li, and S. Rafatirad. Thread scheduling based on low-quality instruction prediction for simultaneous multithreaded processors. IEEE-NEWCAS Conference, 2005.

[18]

Chenming Hu. Low-voltage cmos device scaling. Solid-State Circuits Conference, 1994.

[19]

Anoop Iyer and Diana Marculescu. Power and performance evaluation of globally asynchronous locally synchronous processors. ACM SIGARCH Computer Architecture News, 30, 2002.

Digital Library

[20]

R. Jain, C. Hughes, and S. Adve. Soft real-time scheduling on simultaneous multithreaded processors. In 23rd IEEE International Real-Time Systems Symposium, 2002.

Digital Library

[21]

P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: a 32-way multithreaded sparc processor. Micro, IEEE, 25, 2005.

Digital Library

[22]

Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, and Dean M. Tullsen. Single-isa heterogeneous multi-core architectures: The potential for processor power reduction. In MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, page 81, Washington, DC, USA, 2003. IEEE Computer Society.

Digital Library

[23]

Rakesh Kumar, Dean M. Tullsen, Parthasarathy Ranganathan, Norman P. Jouppi, and Keith I. Farkas. Single-isa heterogeneous multi-core architectures for multithreaded workload performance. Proceedings of the 31st annual international symposium on Computer architecture, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[24]

J. Li, J.F. Martinez, and M.C. Huang. The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors. High Performance Computer Architecture, 2004.

Digital Library

[25]

C. Liu, A. Sivasubramaniam, M. Kandemir, and M.J. Irwin. Exploiting barriers to optimize power consumption of cmps. Parallel and Distributed Processing Symposium, 2005.

Digital Library

[26]

Jacob R. Lorch and Alan Jay Smith. Improving dynamic voltage scaling algorithms with pace. ACM SIGMETRICS, 2001.

Digital Library

[27]

G Magklis, P. Chaparro, J. Gonzalez, and A. Gonzalez. Independent front-end and back-end dynamic voltage scaling for a gals microarchitecture. ISLPED, 2006.

Digital Library

[28]

G Magklis, J. Gonzalez, and A. Gonzalez. Frontend frequency-voltage adaptation for optimal energy-delay2. International Conference on Computer Design, 2004.

Digital Library

[29]

Pedro Marcuello, Antonio Gonzlez, and Jordi Tubella. Speculative multithreaded processors. Supercomputing, 1998.

Digital Library

[30]

T. Olsson, P. Nilsson, T. Meincke, A. Hemam, and M. Torkelson. A digitally controlled low-power clock multiplier for globally asynchronous locally synchronous designs. ISCAS 2000 Geneva.

[31]

K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. ACM SIGOPS Operating Systems Review, 30, 1996.

Digital Library

[32]

B. Robatmili, N. Yazdani, S. Sardashti, and M. Nourani. Thread-sensitive instruction issue for smt processors. Computer Architecture Letters, IEEE, 3, 2004.

Digital Library

[33]

G. Semeraro, D. H. Albonesi, G. Magklis, M. L. Scott, S. Dropsho, and S. Dwarkadas. Hiding synchronization delays in a gals processor microarchitecture. Proceedings of the 10th International Symposium on Asynchronous Circuits and Systems, 2004.

[34]

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. ACM SIGARCH Computer Architecture News, 24, 1996.

Digital Library

[35]

R. Uhlig, R. Fishtein, O. Gershon, I. Hirsh, and H. Wang. Softsdv: A pre-silicon software development environment for the ia-64 architecture. Intel Technology Journal, 3(4), 1999.

[36]

Q. Wu, P. Juang, M. Martonosi, and D.W. Clark. Voltage and frequency control with adaptive reaction time in multiple-clock-domain processors. High-Performance Computer Architecture, 2005.

Digital Library

[37]

W. Zhu, J. del Cuvillo, and G. R. Gao. Performance characteristics of openmp language constructs on a many-core-on-a-chip architecuture. The 2nd International Workshop on OpenMP (IWOMP), 2006.

Digital Library

Cited By

Bitalebi HSafaei F(2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
https://doi.org/10.1007/s11227-022-04657-3
Ortega CAlvarez LCasas MBertran RBuyuktosunoglu AEichenberger ABose PMoreto M(2021)Intelligent Adaptation of Hardware Knobs for Improving Performance and Power ConsumptionIEEE Transactions on Computers10.1109/TC.2020.298023070:1(1-16)Online publication date: 1-Jan-2021
https://doi.org/10.1109/TC.2020.2980230
Tian ZChen LLi XFeng JXu J(2021)Multi-Core Power Management through Deep Reinforcement Learning2021 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS51556.2021.9401447(1-5)Online publication date: May-2021
https://doi.org/10.1109/ISCAS51556.2021.9401447
Show More Cited By

Index Terms

Meeting points: using thread criticality to adapt multicore hardware to parallel regions
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors

We provide an analysis of thread-management techniques that increase performance or reduce energy in multicore and Simultaneous Multithreaded (SMT) cores. Thread delaying reduces energy consumption by running the core containing the critical thread at ...
Thread fusion
ISLPED '08: Proceedings of the 2008 international symposium on Low Power Electronics & Design

This work proposes Thread Fusion as an effective way of reducing power consumption when a Simultaneous Multi-Threaded (SMT) core is executing two threads from a homogeneous parallel application. Two dynamic instances of the same static instruction, each ...
Free atomics: hardware atomic operations without fences
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

Atomic Read-Modify-Write (RMW) instructions are primitive synchronization operations implemented in hardware that provide the building blocks for higher-abstraction synchronization mechanisms to programmers. According to publicly available documentation,...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

October 2008

328 pages

ISBN:9781605582825

DOI:10.1145/1454115

General Chair:
Andreas Moshovos
University of Toronto, Canada
,
Program Chairs:
David Tarditi
Microsoft, USA
,
Kunle Olukotun
Stanford University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '08

Sponsor:

PACT '08: International Conference on Parallel Architectures and Compilation Techniques

October 25 - 29, 2008

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

68
Total Citations
View Citations
469
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bitalebi HSafaei F(2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
https://doi.org/10.1007/s11227-022-04657-3
Ortega CAlvarez LCasas MBertran RBuyuktosunoglu AEichenberger ABose PMoreto M(2021)Intelligent Adaptation of Hardware Knobs for Improving Performance and Power ConsumptionIEEE Transactions on Computers10.1109/TC.2020.298023070:1(1-16)Online publication date: 1-Jan-2021
https://doi.org/10.1109/TC.2020.2980230
Tian ZChen LLi XFeng JXu J(2021)Multi-Core Power Management through Deep Reinforcement Learning2021 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS51556.2021.9401447(1-5)Online publication date: May-2021
https://doi.org/10.1109/ISCAS51556.2021.9401447
Dimić VMoretó MCasas MValero M(2021)PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory HierarchyEuro-Par 2021: Parallel Processing10.1007/978-3-030-85665-6_37(599-615)Online publication date: 25-Aug-2021
https://doi.org/10.1007/978-3-030-85665-6_37
Wang BLu Z(2020)Advance Virtual Channel ReservationIEEE Transactions on Computers10.1109/TC.2020.297198269:9(1320-1334)Online publication date: 1-Sep-2020
https://doi.org/10.1109/TC.2020.2971982
Yao YLu Z(2020)Pursuing Extreme Power Efficiency With PPCC Guided NoC DVFSIEEE Transactions on Computers10.1109/TC.2019.294980769:3(410-426)Online publication date: 1-Mar-2020
https://doi.org/10.1109/TC.2019.2949807
Wang BLu Z(2019)Advance Virtual Channel Reservation2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715104(1178-1183)Online publication date: Mar-2019
https://doi.org/10.23919/DATE.2019.8715104
Shrivastava RNandivada V(2017)Energy-Efficient Compilation of Irregular Task-Parallel LoopsACM Transactions on Architecture and Code Optimization10.1145/313606314:4(1-29)Online publication date: 14-Nov-2017
https://dl.acm.org/doi/10.1145/3136063
Padmanabha SLukefahr ADas RMahlke SHunter HMoreno JEmer JSanchez D(2017)Mirage coresProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123969(745-758)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123969
Li JLi MXue COuyang YShen F(2017)Thread Criticality Assisted Replication and Migration for Chip Multiprocessor CachesIEEE Transactions on Computers10.1109/TC.2017.270567866:10(1747-1762)Online publication date: 1-Oct-2017
https://doi.org/10.1109/TC.2017.2705678
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten