Abstract
With the increase of demand for computing resources and the struggle to provide the necessary energy, power-aware resource management is becoming a major issue for the High-performance computing (HPC) community. Including reliable energy management to a supercomputer’s resource and job management system (RJMS) is not an easy task. The energy consumption of jobs is rarely known in advance and the workload of every machine is unique and different from the others.
We argue that the first step toward properly managing energy is to deeply understand the energy consumption of the workload, which involves predicting the workload’s power consumption and exploiting it by using smart power-aware scheduling algorithms. Crucial questions are (i) how sophisticated a prediction method needs to be to provide accurate workload power predictions, and (ii) to what point an accurate workload’s power prediction translates into efficient energy management.
In this work, we propose a method to predict and exploit HPC workloads’ power consumption, with the objective of reducing the supercomputer’s power consumption while maintaining the management (scheduling) performance of the RJMS. Our method exploits workload submission logs with power monitoring data, and relies on a mix of light-weight power prediction methods and a classical EASY Backfillling inspired heuristic.
We base this study on logs of Marconi 100, a 980 servers supercomputer. We show using simulation that a light-weight history-based prediction method can provide accurate enough power prediction to improve the energy management of a large scale supercomputer compared to energy-unaware scheduling algorithms. These improvements have no significant negative impact on performance.
INP—Institute of engineering Univ. Grenoble Alpes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The term homogeneous in this paper means that all computing nodes have the same CPU/accelerators configuration.
References
Antici, F., Yamamoto, K., Domke, J., Kiziltan, Z.: Augmenting ml-based predictive modelling with NLP to forecast a job’s power consumption. In: Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, pp. 1820–1830 (2023)
Bates, N., et al.: Electrical grid and supercomputing centers: an investigative analysis of emerging opportunities and challenges. Informatik-Spektrum 38(2), 111–127 (2015)
Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Predictive modeling for job power consumption in HPC systems. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 181–199. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41321-1_10
Borghesi, A., et al.: M100 ExaData: a data collection campaign on the CINECA’s marconi100 tier-0 supercomputer. Sci. Data 10(1), 288 (2023)
Bugbee, B., Phillips, C., Egan, H., Elmore, R., Gruchalla, K., Purkayastha, A.: Prediction and characterization of application power use in a high-performance computing environment. Stat. Anal. Data Mining ASA Data Sci. J. 10(3), 155–165 (2017)
Casanova, H., Giersch, A., Legrand, A., Quinson, M., Suter, F.: Versatile, scalable, and accurate simulation of distributed applications and platforms. J. Parallel Distrib. Comput. 74(10), 2899–2917 (2014)
Chasapis, D., Moretó, M., Schulz, M., Rountree, B., Valero, M., Casas, M.: Power efficient job scheduling by predicting the impact of processor manufacturing variability. In: Proceedings of the ACM International Conference on Supercomputing, pp. 296–307 (2019)
Da Costa, G., Pierson, J.M., Fontoura-Cupertino, L.: Mastering system and power measures for servers in datacenter. Sustain. Comput. Inform. Syst. 15, 28–38 (2017). https://doi.org/10.1016/j.suscom.2017.05.003
Dutot, P.F., Mercier, M., Poquet, M., Richard, O.: Batsim: a realistic language-independent resources and jobs management systems simulator. In: 20th Workshop on Job Scheduling Strategies for Parallel Processing, Chicago, United States (2016). https://hal.science/hal-01333471
Emeras, J.: Workload Traces Analysis and Replay in Large Scale Distributed Systems. Theses, Université de Grenoble (2013)
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14
Feitelson, D.G., Weil, A.M.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pp. 542–546. IEEE (1998)
Gaussier, E., Glesser, D., Reis, V., Trystram, D.: Improving backfilling by using machine learning to predict running times. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2015. Association for Computing Machinery, New York (2015)
Khan, K.N., Hirki, M., Niemi, T., Nurminen, J.K., Ou, Z.: RAPL in action: experiences in using RAPL for power measurements. ACM Trans. Model. Perform. Eval. Comput. Syst. 3(2) (2018). https://doi.org/10.1145/3177754
Kocot, B., Czarnul, P., Proficz, J.: Energy-aware scheduling for high-performance computing systems: a survey. Energies 16(2), 890 (2023)
Oak Ridge National Laboratory: Frontier’s architecture (2023). https://olcf.ornl.gov/wp-content/uploads/Frontiers-Architecture-Frontier-Training-Series-final.pdf. Accessed 29 Nov 2023
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Poquet, M., Carastan-Santos, D., Da Costa, G., Stolf, P., Trystram, D.: Artifact data of article “light-weight prediction for improving energy consumption in HPC platforms. Euro-Par 2024 (2024). https://doi.org/10.5281/zenodo.11173631
Saillant, T., Weill, J.-C., Mougeot, M.: Predicting job power consumption based on RJMS submission data in HPC systems. In: Sadayappan, P., Chamberlain, B.L., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12151, pp. 63–82. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50743-5_4
Shoukourian, H., Wilde, T., Auweter, A., Bode, A.: Predicting the energy and power consumption of strong and weak scaling HPC applications. Supercomput. Front. Innovations 1(2), 20–41 (2014)
Storlie, C., Sexton, J., Pakin, S., Lang, M., Reich, B., Rust, W.: Modeling and predicting power consumption of high performance computing jobs (2015)
Wikipedia: 2021 Texas power crisis (2023). https://en.wikipedia.org/wiki/2021_Texas_power_crisis. Accessed 29 Nov 2023
Zrigui, S., de Camargo, R.Y., Legrand, A., Trystram, D.: Improving the performance of batch schedulers using online job runtime classification. J. Parallel Distrib. Comput. 164, 83–95 (2022)
Acknowledgements and Artifact Availability
Experiments presented in this paper were carried out using the Grid’5000 testbed. This work was supported by the research program on Edge Intelligence of the Multi-disciplinary Institute on Artificial Intelligence MIAI at Grenoble Alpes (ANR-19-P3IA-0003), ENERGUMEN (ANR-18-CE25-0008), the France 2030 NumPEx Exa-SofT (ANR-22-EXNU-0003) and Cloud CareCloud (ANR-23-PECL-0003) projects managed by the French National Research Agency (ANR), REGALE (H2020-JTI-EuroHPC-2019-1 agreement n. 956560), and LIGHTAIDGE (HORIZON-MSCA-2022-PF-01 agreement n. 101107953). A CC-BY public copyright licence has been applied by the authors to the present document and will be applied to all subsequent versions up to the Author Accepted Manuscript arising from this submission, in accordance with the grants’ open access conditions. We thank Salah Zrigui for starting the study on the job energy profiles. We also thank Francesco Antici for curating and sharing the Marconi100 dataset. The experiments described in this article have been made with open science and reproducibility concerns in mind. Code, data and documentation to reproduce our work is available on Zenodo [18].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Carastan-Santos, D., Da Costa, G., Poquet, M., Stolf, P., Trystram, D. (2024). Light-Weight Prediction for Improving Energy Consumption in HPC Platforms. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14801. Springer, Cham. https://doi.org/10.1007/978-3-031-69577-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-69577-3_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69576-6
Online ISBN: 978-3-031-69577-3
eBook Packages: Computer ScienceComputer Science (R0)