research-article

Public Access

Minimization of Xeon Phi Core Use with Negligible Execution Time Impact

Authors:

Roberto Camacho Barranco,

Patricia J. Teller,

Michael GerndtAuthors Info & Claims

XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

Article No.: 16, Pages 1 - 8

https://doi.org/10.1145/2949550.2949581

Published: 17 July 2016 Publication History

PDF eReader

Abstract

For many years GPUs have been components of HPC clusters (Titan and Piz Daint), while only in recent years has the Intel® Xeon Phi™ been included (Tianhe-2 and Stampede). For example, GPUs are in 14% of systems in the November 2015 Top500 list, while the Xeon Phi™ is in 6%. Intel® came out with Xeon Phi™ to compete with NVIDIA GPUs by offering a unified environment that supports OpenMP and MPI, and by providing competitive and easier-to-utilize processing power with less energy consumption. Maximum Xeon Phi™ execution-time performance requires that programs have high data parallelism and good scalability, and use parallel algorithms. And, improved Phi™ power performance and throughput can be achieved by reducing the number of cores employed for application execution. Accordingly, the objectives of this paper are to: (1) Demonstrate that some applications can be executed with fewer cores than are available to users with a negligible impact on execution time: For 59.3% of the 27 application instances studied, doing this results in better performance and for 37% using less than half of the available cores results in performance degradation of not more than 10% in the worst case. (2) Develop a tool that provides the user with the optimal number of cores to employ: We designed an algorithm and developed a plugin for the Periscope Tuning Framework, an automatic performance tuner, that for a given application provide the user with an estimation of this number. (3) Understand if performance metrics can be used to identify applications that can be executed with fewer cores with a negligible impact on execution time: We identified, via statistical analysis, the following three metrics that are indicative of this, at least for the application instances studied: low L1 Compute to Data Access ratio, i.e., the average number of computations that are performed per byte of data loaded/stored in the L1 cache, high use of data bandwidth, and, to a lesser extent, low vectorization intensity.

References

[1]

ADEPT. Rodinia applications benchmark suite (HPC). Retrieved from: http://www.adept-project.eu/case-studies/rodinia-applications-benchmark-suite-hpc.html, 2014.

Abstract

References

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi

MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers

Comments

Information

Published In

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations