Abstract
Optimized for parallel operations, Intel’s second generation Xeon Phi processor, code-named Knights Landing (KNL), is actively utilized in high performance computing systems based on its highly integrated cores and high-bandwidth on-package memory, Multi-Channel DRAM (MCDRAM). Recently, the emergence of data-intensive applications and the utilization of many-core processors have further increased I/O performance requirements of high performance computing systems. Therefore, it is necessary to understand and analyze the I/O characteristics of many integrated core systems. In this paper, we experimentally analyze the I/O characteristics of KNL, focusing on single-thread, buffered-write operations. We determine that KNL has a bottleneck in its buffered write operation that utilizes page cache. To find this bottleneck point and identify its cause, we conduct the experiments in two different ways. First, we measure the execution time of kernel functions through kernel I/O path. Second, we measure the occurrence count of system events such as cache-misses and branch-misses. With results from these experiments, we discuss the characteristics on KNL’s I/O performance involving the performance bottlenecks.
Similar content being viewed by others
Data availability
Available upon request
References
Asaadi, H., Khaldi, D., Chapman, B.: A comparative survey of the hpc and big data paradigms: Analysis and experiments. In: Proceedings of the 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 423–432 (2016)
Han, J., Koo, D., Lockwood, G.K., Lee, J., Eom, H., Hwang, S.: Accelerating a burst buffer via user-level i/o isolation. In: Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 245–255 (2017)
Koo, D., Lee, J., Liu, J., Byun, E.-K., Kwak, J.-H., Lockwood, G.K., Hwang, S., Antypas, K., Wu, K., Eom, H.: An empirical study of i/o separation for burst buffers in hpc systems. J. Parallel Distrib. Comput. 148, 96–108 (2021)
Xuan, P., Ligon, W.B., Srimani, P.K., Ge, R., Luo, F.: Accelerating big data analytics on hpc clusters using two-level storage. Parallel Comput. 61, 18–34 (2017), special Issue on 2015 Workshop on Data Intensive Scalable Computing Systems (DISCS-2015). http://www.sciencedirect.com/science/article/pii/S0167819116300631
Zhao, D., Liu, N., Kimpe, D., Ross, R., Sun, X., Raicu, I.: Towards exploring data-intensive scientific applications at extreme scales through systems and simulations. IEEE Trans. Parallel Distrib. Syst. 27(6), 1824–1837 (2016)
Leak, S.: Introduction to Cori. NERSC User Engagement Group. https://www.nersc.gov/assets/Uploads/Intro-to-Cori.pdf (2017)
“Kisti nurion,” https://www.ksc.re.kr/eng/resource/overview
“Kisti pushes the boundaries of science and technology with nurion,” Intel®, Case Study Report, https://www.intel.co.kr/content/www/kr/ko/products/docs/network-io/high-performance-fabrics/opa-xeon-scalable-kisti-nurion-study.html
Agelastos, A.M. et al.: Performance on trinity phase 2 (a cray xc40 utilizing intel xeon phi processors) with acceptance applications and benchmarks. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), Tech. Rep. (2017)
Sodani, A.: Knights landing (knl): 2nd generation intel®xeon phi processor. In: Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–24 (Aug 2015)
Sodani, A., et al.: Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)
Woo, J., Choi, H., Lee, J.: Empirical performance analysis of collective communication for distributed deep learning in a many-core cpu environment. Appl. Sci. 10(19), 6717 (2020)
Chen, L., Peng, B., Zhang, B., Liu, T., Zou, Y., Jiang, L., Henschel, R., Stewart, C., Zhang, Z., McCallum, E., Tom, Z., Jon, O., Qiu, J.: Benchmarking harp-daal: High performance hadoop on knl clusters. In: Proceedings of the 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 82–89 (2017)
Byun, C., Kepner, J., Arcand, W., Bestor, D., Bergeron, B., Gadepally, V., Houle, M., Hubbell, M., Jones, M., Klein, A., Michaleas, P., Milechin, L., Mullen, J., Prout, A., Rosa, A., Samsi, S., Yee, C., Reuther, A.: Benchmarking data analysis and machine learning applications on the intel knl many-core processor. In: Proceedings of the 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2017)
“Cgroups,” https://en.wikipedia.org/wiki/Cgroups
S. A. et al.: Improving i/o resource sharing of linux cgroup for nvme ssds on multi-core systems. In: 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16). Denver, CO: USENIX Association. https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/ahn (2016)
Oh, K., Park, J., Eom, Y.I.: Weight-based page cache management scheme for enhancing i/o proportionality of cgroups. In: Proceedings of the 2019 IEEE International Conference on Consumer Electronics (ICCE), pp. 1–3 (2019)
“Ior wiki,” https://wiki.lustre.org/IOR
Kljajić, J., Bogdanović, N., Nankovski, M., Tončev, M., Djordjević, B.: Performance analysis of 64-bit ext4, xfs and btrfs filesystems on the solid-state disk technology. INFOTEH-JAHORINA 15, 563–566 (2016)
“How to choose your red hat enterprise linux file system,” https://access.redhat.com/articles/3129891
“Linux perf profiler,” https://en.wikipedia.org/wiki/Perf_(Linux)
Bang, J., Kim, C., Kim, S., Chen, Q., Lee, C., Byun, E.-K., Lee, J., Eom, H.: Finer-lru: A scalable page management scheme for hpc manycore architectures, submitted to IPDPS‘21 (May 2021)
Liu, J. et al.: Understanding the i/o performance gap between cori knl and haswell. Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), Tech. Rep. (2017)
“Intel silvermont microarchitecture,” https://en.wikipedia.org/wiki/Silvermont
Xie, B., Liu, X., McKee, S.A., Zhan, J., Jia, Z., Wang, L., Zhang, L.: Understanding data analytics workloads on intel(r) xeon phi(r). In: Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 206–215 (2016)
D’Agostino, D., et al.: Performance and economic evaluations in adopting low power architectures: A real case analysis. In: Pham, C., Altmann, J., Bañares, J.Á. (eds.) Economics of Grids, Clouds, Systems, and Services, pp. 177–189. Springer International Publishing, Cham (2017)
Mittal, S.: A survey of techniques for architecting tlbs. Concurr. Comput. 29(10), e4061 (2017)
“Translation lookaside buffer (tlb),” https://en.wikipedia.org/wiki/Translation_lookaside_buffer
Jabbie, I.A. et al.: Performance comparison of intel xeon phi knights landing. SIAM Undergraduate Research Online (SIURO), vol. 10 (2017)
Park, G., Rho, S., Kim, J.-S., Nam, D.: Towards optimal scheduling policy for heterogeneous memory architecture in many-core system. Clust. Comput. 22(1), 121–133 (2019)
Ahn, S., La, K., Kim, J.: Improving i/o resource sharing of linux cgroup for nvme ssds on multi-core systems. In: Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16). Denver, CO: USENIX Association. https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/ahn (2016)
Pathak, A.R., Pandey, M., Rautaray, S.S.: Approaches of enhancing interoperations among high performance computing and big data analytics via augmentation. Cluster Computing, pp. 1–36. Springer, New York (2019)
Li, D., Dong, M., Tang, Y., Ota, K.: A novel disk i/o scheduling framework of virtualized storage system. Clust. Comput. 22(1), 2395–2405 (2019)
Funding
This work was supported by the Korea Institute of Science and Technology Information (K-21-L02-C08-S01), the National Supercomputing Center with supercomputing resources including technical support (KSC-2020-INO-0044), the PF Class Heterogeneous High Performance Computer Development Program (NRF-2016M3C4A7952587), the Next-Generation Information Computing Development Program (NRF-2015M3C4A7065646), the Basic Science Research Program (NRF-2020R1F1A1072696), BK21 FOUR Intelligence Computing (4199990214639, Dept. of Computer Science and Engineering, Seoul National University) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT, Seoul R&D Program (CY20038) “Commercializing of technology for CAT Pro Web service based on AI” and the Technology development Program (S2878336) funded by the Ministry of SMEs and Startups (MSS, Korea)
Author information
Authors and Affiliations
Contributions
CL: Software, Writing—original draft, Review, Validation. JL: Conceptualization, Supervision, Writing—original draft, Funding acquisition, Project administration. DK: Software, Validation, Data curation. CK: Writing—original, Validation, Methodology. JB: Methodology, Data curation. EB: Supervision, Resources, Funding acquisition. HE: Project administration, Funding acquisition, Conceptualization.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lee, C., Lee, J., Koo, D. et al. Towards enhanced I/O performance of a highly integrated many-core processor by empirical analysis. Cluster Comput 26, 2643–2655 (2023). https://doi.org/10.1007/s10586-021-03288-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03288-2