Compiler-Guided Parallelism Adaption Based on Application Partition for Power-Gated ILP Processor
Pages 1329 - 1341
Abstract
Instruction-level parallelism (ILP) processors have been widely used to improve speed for several decades. However, the requirement of parallelism changes between applications, even within an application. Fixed high parallelism could result in poor utilization and extra leakage energy. Designing energy-efficient ILP processors to trade off power/speed has been a critical issue in current research. In this paper, a compiler-guided parallelism adaption based on an application partition algorithm is proposed to implement parallelism adaption with applications running on ILP processors. The aim is to minimize energy consumption without degrading the execution time. The main idea is described as follows: 1) partition the application into several power gating regions (PGRs); 2) assign adapted parallelism for each region by analyzing the requirements of resources and energy efficiency; and 3) reschedule each region with its own parallelism and insert power-gating instructions into the application to control hardware ON/OFF. The experimental results of evaluation with the CoreMarkPro benchmark suits show the expected savings of leakage energy. Our algorithm could reduce the leakage energy in register files by 30.46% and 64.06% for applications with high variance on software-inherent parallelism. Furthermore, the overhead energy originated from state transition is much lower than Tabkhi’s algorithm.
References
[1]
H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” IEEE Micro, vol. Volume 32, no. Issue 3, pp. 122–134, 2012.
[2]
Y. Shin, J. Seomun, K.-M. Choi, and T. Sakurai, “Power gating: Circuits, design methodologies, and best practice for standard-cell VLSI designs,” ACM Trans. Design Autom. Elect. Syst., vol. Volume 15, no. Issue 4, 2010, Art. no. .
[3]
A. Lambrechts et al., “Power breakdown analysis for a heterogeneous NoC platform running a video application,” in Proc. IEEE ASAP, Jul. 2005, pp. 179–184.
[4]
V. Zyuban and P. Kogge, “The energy complexity of register files,” in Proc. Int. Symp. Low Power Electron. Design (ISLPED), 1998, pp. 305–310.
[5]
S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens, “Register organization for media processing,” in Proc. 6th Int. Symp. High-Perform. Comput. Archit. (HPCA), 2000, pp. 375–386.
[6]
Z. Liang, W. Zhang, and Y.-C. Ma, “Deadline-constrained clustered scheduling for VLIW architectures using power-gated register files,” ACM Trans. Archit. Code Optim., vol. Volume 11, no. Issue 2, 2014, Art. no. .
[7]
EEMBC Industry-Standard Benchmarks for Embedded Systems, accessed on 2016. {Online}. Available: http://www.eembc.org/coremark/index.php?b=pro.htm
[8]
. (2011). 40nm Technology . {Online}. Available: http://www.tsmc.com/english/dedicatedfoundry/technology/40nm.htm
[9]
A. B. Kahng, S. Kang, R. Kumar, and J. Sartori, “Enhancing the efficiency of energy-constrained DVFS designs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. Volume 21, no. Issue 10, pp. 1769–1782, 2013.
[10]
K. K. Rangan, G.-Y. Wei, and D. Brooks, “Thread motion: Fine-grained power management for multi-core systems,” ACM SIGARCH Comput. Archit. News, vol. Volume 37, no. Issue 3, pp. 302–313, 2009.
[11]
Q. Cai, J. González, G. Magklis, P. Chaparro, and A. González, “Thread shuffling: Combining DVFS and thread migration to reduce energy consumptions for multi-core systems,” in Proc. Int. Symp. Low Power Electron. Design (ISLPED), Aug. 2011, pp. 379–384.
[12]
Y.-F. Tsai, A. H. Ankadi, N. Vijaykrishnan, M. J. Irwin, and T. Theocharides, “ChipPower: An architecture-level leakage simulator,” in Proc. IEEE Int. SoC Conf., Sep. 2004, pp. 395–398.
[13]
S. Roy, N. Ranganathan, and S. Katkoori, “A framework for power-gating functional units in embedded microprocessors,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. Volume 17, no. Issue 11, pp. 1640–1649, 2009.
[14]
M. Wang, Y. Wang, D. Liu, Z. Qin, and Z. Shao, “Compiler-assisted leakage-aware loop scheduling for embedded VLIW DSP processors,” J. Syst. Softw., vol. Volume 83, no. Issue 5, pp. 772–785, 2010.
[15]
R. Nagpal and Y. N. Srikant, “Compiler-assisted power optimization for clustered VLIW architectures,” Parallel Comput., vol. Volume 37, no. Issue 1, pp. 42–59, 2011.
[16]
H. S. Kim, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Adapting instruction level parallelism for optimizing leakage in VLIW architectures,” in Proc. LCTES, 2013, pp. 275–283.
[17]
M. B. Henry and L. Nazhandali, “NEMS-based functional unit powergating: Design, analysis, and optimization,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. Volume 60, no. Issue 2, pp. 290–302, 2013.
[18]
Y.-P. You, C.-W. Huang, and J. K. Lee, “Compilation for compact power-gating controls,” ACM Trans. Design Autom. Elect. Syst., vol. Volume 12, no. Issue 4, 2007, Art. no. .
[19]
M. Kondo et al., “Design and evaluation of fine-grained power-gating for embedded microprocessors,” in Proc. Conf. Design, Autom. Test Eur. (DATE), 2014, Art. no. .
[20]
D. Patti, M. Palesi, and V. Catania, Merging Compilation and Microarchitectural Configuration Spaces for Performance/Power Optimization in VLIW-Based Systems, vol. Volume 285 . New York, NY, USA: Springer, 2014, pp. 203–212.
[21]
S. Cao, Z. Li, F. Wang, and S. Wei, “Compiler-assisted leakage- and temperature-aware instruction-level VLIW scheduling,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. Volume 22, no. Issue 6, pp. 1416–1428, 2014.
[22]
N. Goel, A. Kumar, and P. R. Panda, “Shared-port register file architecture for low-energy VLIW processors,” ACM Trans. Archit. Code Optim., vol. Volume 11, no. Issue 1, 2014, Art. no. .
[23]
V. Porpodas and M. Cintra, “CAeSaR: Unified cluster-assignment scheduling and communication reuse for clustered VLIW processors,” in Proc. Int. Conf. Compilers, Archit. Synthesis Embedded Syst. (CASES), Sep. 2013, pp. 1–10.
[24]
Q. Cai, J. M. Codina, J. Gonzalez, and A. Gonzalez, “A software-hardware hybrid steering mechanism for clustered microarchitectures,” in Proc. IEEE Int. Symp. Parallel Distrib. Process. (IPDPS), Apr. 2008, pp. 1–12.
[25]
J. M. Codina, J. Sanchez, and A. Gonzalez, “Virtual cluster scheduling through the scheduling graph,” in Proc. CGO, 2007, pp. 89–101.
[26]
S. Roy, N. Ranganathan, and S. Katkoori, “State-retentive power gating of register files in multicore processors featuring multithreaded in-order cores,” IEEE Trans. Comput., vol. Volume 60, no. Issue 11, pp. 1547–1560, 2011.
[27]
H. Tabkhi and G. Schirner, “Application-guided power gating reducing register file static power,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. Volume 22, no. Issue 12, pp. 2513–2526, 2014.
[28]
C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in Proc. CGO, Mar. 2004, pp. 75–86.
Recommendations
Natural instruction level parallelism-aware compiler for high-performance QueueCore processor architecture
This work presents a static method implemented in a compiler for extracting high instruction level parallelism for the 32-bit QueueCore, a queue computation-based processor. The instructions of a queue processor implicitly read and write their operands, ...
Comments
Please enable JavaScript to view thecomments powered by Disqus.Information & Contributors
Information
Published In
Copyright © 2017.
Publisher
IEEE Educational Activities Department
United States
Publication History
Published: 01 April 2017
Qualifiers
- Research-article
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 0Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Reflects downloads up to 17 Dec 2024