[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/SC41406.2024.00062acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Free access

MCBound: An Online Framework to Characterize and Classify Memory/Compute-bound HPC Jobs

Published: 17 November 2024 Publication History

Abstract

Modern High-Performance Computing (HPC) systems play a fundamental role in driving scientific research, as they execute computationally intensive jobs originating from diverse domains. However, HPC jobs are characterized by conflicting computational requirements, which may cause inefficiencies in resource usage, system throughput and energy consumption. One approach to tackling this problem is to distinguish between memory-bound and compute-bound jobs at their submission time, with the goal of making informed decisions about their execution. In this paper, we present MCBound, the first online data-driven framework to classify HPC jobs as memory/compute-bound before job execution, without user intervention. We propose a systematic characterization technique to generate a reference dataset from historical data for initial classification model training. Using the proposed characterization technique, we analyze the data of 2.2 million job runs on the Supercomputer Fugaku1, a production HPC system installed at the RIKEN Center for Computational Science, in Japan. We implement MCBound for Fugaku and classify the jobs executed during February 2024. Our approach is proven effective, as it obtains an F1-macro average score of at least 0.89 as prediction quality, while incurring a negligible overhead on the system's operations. Our Python-based implementation of MCBound can be seamlessly configured and deployed in other HPC systems.

References

[1]
B. Aksar, E. Sencan, B. Schwaller, O. Aaziz, V. J. Leung, J. Brandt, B. Kulis, M. Egele, and A. K. Coskun, "Prodigy: Towards unsupervised anomaly detection in production hpc systems," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1--14.
[2]
F. Antici, A. Bartolini, J. Domke, Z. Kiziltan, and K. Yamamoto, "F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems," Jun. 2024. [Online]. Available: https://doi.org/10.5281/zenodo.11467483
[3]
F. Antici, A. Borghesi, and Z. Kiziltan, "Online job failure prediction in an hpc system," in Euro-Par 2023: Parallel Processing Workshops: Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28-September 1, 2023, Revised Selected Papers. Springer Nature, 2023.
[4]
F. Antici, K. Yamamoto, J. Domke, and Z. Kiziltan, "Augmenting ml-based predictive modelling with nlp to forecast a job's power consumption," in Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, 2023, pp. 1820--1830.
[5]
K. Asifuzzaman, M. A. H. Monil, F. Liu, and J. S. Vetter, "Evaluating hpc kernels for processing in memory," in Proceedings of the 2022 International Symposium on Memory Systems, 2022, pp. 1--6.
[6]
A. Borghesi, A. Bartolini, M. Lombardi, M. Milano, and L. Benini, "Predictive modeling for job power consumption in hpc systems," in High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings. Springer, 2016, pp. 181--199.
[7]
L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5--32, 2001.
[8]
J. Breitbart, S. Pickartz, S. Lankes, J. Weidendorfer, and A. Monti, "Dynamic co-scheduling driven by main memory bandwidth utilization," in 2017 IEEE International Conference on Cluster Computing (CLUSTER), 2017, pp. 400--409.
[9]
J. Breitbart, J. Weidendorfer, and C. Trinitis, "Case study on co-scheduling for hpc applications," in 2015 44th International Conference on Parallel Processing Workshops, 2015, pp. 277--285.
[10]
B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, "Borg, omega, and kubernetes," Communications of the ACM, vol. 59, no. 5, pp. 50--57, 2016.
[11]
J. Devlin, M.-W. Chang, K. Lee, and et al., "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of the 2019 NAACL: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171--4186.
[12]
N. Ding and S. Williams, An instruction roofline model for gpus. IEEE, 2019.
[13]
H. Feng, V. Misra, and D. Rubenstein, "Pbs: a unified priority-based scheduler," in Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 2007, pp. 203--214.
[14]
E. Fix and J. L. Hodges, "Discriminatory analysis. nonparametric discrimination: Consistency properties," International Statistical Review/Revue Internationale de Statistique, vol. 57, no. 3, pp. 238--247, 1989.
[15]
Fujitsu Limited, "A64fx pmu events," 2019. [Online]. Available: https://raw.githubusercontent.com/fujitsu/A64FX/master/doc/A64FX_PMU_Events_v1.2.pdf
[16]
A. Ilic, F. Pratas, and L. Sousa, "Cache-aware roofline model: Upgrading the loft," IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21--24, 2013.
[17]
M. S. Keller, "Take command: cron: Job scheduler," Linux Journal, vol. 1999, no. 65es, pp. 15-es, 1999.
[18]
Y. Kodama, T. Odajima, E. Arima, and M. Sato, "Evaluation of power management control on the supercomputer fugaku," in 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2020, pp. 484--493.
[19]
D. Lee, I. Dinov, B. Dong, B. Gutman, I. Yanovsky, and A. W. Toga, "Cuda optimization strategies for compute-and memory-bound neuroimaging algorithms," Computer methods and programs in biomedicine, vol. 106, no. 3, pp. 175--187, 2012.
[20]
A. Li, W. Liu, M. R. Kristensen, B. Vinter, H. Wang, K. Hou, A. Marquez, and S. L. Song, "Exploring and analyzing the real impact of modern on-package memory on hpc scientific kernels," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1--14.
[21]
D. Marques, H. Duarte, A. Ilic, L. Sousa, R. Belenov, P. Thierry, and Z. A. Matveev, "Performance analysis with cache-aware roofline model in intel advisor," in 2017 International Conference on High Performance Computing & Simulation (HPCS). IEEE, 2017, pp. 898--907.
[22]
K. Menear, A. Nag, J. Perr-Sauer, M. Lunacek, K. Potter, and D. Duplyakin, "Mastering hpc runtime prediction: From observing patterns to a methodological approach," in Practice and Experience in Advanced Research Computing, 2023, pp. 75--85.
[23]
A. Netti, Z. Kiziltan, O. Babaoglu, A. Sîrbu, A. Bartolini, and A. Borghesi, "A machine learning approach to online fault classification in hpc systems," Future Generation Computer Systems, vol. 110, pp. 1009--1022, 2020.
[24]
M. Orenes-Vera, E. Tureci, D. Wentzlaff, and M. Martonosi, "Dalorex: A data-local program execution and architecture for memory-bound applications," in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 718--730.
[25]
B. B. Rad, H. J. Bhatti, and M. Ahmadi, "An introduction to docker and analysis of its performance," International Journal of Computer Science and Network Security (IJCSNS), vol. 17, no. 3, p. 228, 2017.
[26]
N. Reimers and I. Gurevych, "Sentence-bert: Sentence embeddings using siamese bert-networks," arXiv preprint arXiv:1908.10084, 2019.
[27]
E. R. Rodrigues, R. L. Cunha, M. A. Netto, and M. Spriggs, "Helping hpc users specify job memory requirements via machine learning," in 2016 Third International Workshop on HPC User Support Tools (HUST). IEEE, 2016, pp. 6--13.
[28]
L. Rokach and O. Maimon, "Decision trees," Data mining and knowledge discovery handbook, pp. 165--192, 2005.
[29]
T. Saillant, J.-C. Weill, and M. Mougeot, "Predicting job power consumption based on rjms submission data in hpc systems," in High Performance Computing: 35th International Conference, ISC High Performance 2020, Frankfurt/Main, Germany, June 22-25, 2020, Proceedings 35. Springer, 2020, pp. 63--82.
[30]
M. Sato, Y. Ishikawa, H. Tomita, Y. Kodama, T. Odajima, M. Tsuji, H. Yashiro, M. Aoki, N. Shida, I. Miyoshi et al., "Co-design for a64fx manycore processor and" fugaku"," in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1--15.
[31]
A. Sîrbu and O. Babaoglu, "Power consumption modeling and prediction in a hybrid cpu-gpu-mic supercomputer," in Euro-Par 2016: Parallel Processing: 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings 22. Springer, 2016, pp. 117--130.
[32]
M. Sokolova, N. Japkowicz, and S. Szpakowicz, "Beyond accuracy, f-score and roc: A family of discriminant measures for performance evaluation," vol. Vol. 4304, 01 2006, pp. 1015--1021.
[33]
X. Tian, X. Li, J. Zhang, Z. Zhao, C. Wang, X. Wang, and J. Wang, "An online incremental learning framework for hpc job power consumption prediction," in Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications, 2023, pp. 176--183.
[34]
M. Wahib and N. Maruyama, "Scalable kernel fusion for memory-bound gpu applications," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 191--202.
[35]
Q. Wang, H. Zhang, J. Li, Y. Shen, and X. Liu, "Predicting job finish time based on parameter features and running logs in supercomputing system," The Journal of Supercomputing, vol. 78, no. 17, pp. 18 551-18 577, 2022.
[36]
S. Williams, "Roofline: An insightful visual performance model for floating-point programs and multicore," ACM Communications, p. 16, 2009.
[37]
K. Yamamoto, Y. Tsujita, and A. Uno, "Classifying jobs and predicting applications in hpc systems," in High Performance Computing: 33rd International Conference, ISC High Performance 2018, Frankfurt, Germany, June 24-28, 2018, Proceedings 33. Springer, 2018, pp. 81--99.
[38]
F. V. Zacarias, P. Carpenter, and V. Petrucci, "Memory demands in disaggregated hpc: How accurate do we need to be?" in 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, 2021, pp. 1--6.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2024
1758 pages
ISBN:9798350352917

Sponsors

Publisher

IEEE Press

Publication History

Published: 17 November 2024

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SC '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 103
    Total Downloads
  • Downloads (Last 12 months)103
  • Downloads (Last 6 weeks)103
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media