[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2949550.2949581acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article
Public Access

Minimization of Xeon Phi Core Use with Negligible Execution Time Impact

Published: 17 July 2016 Publication History

Abstract

For many years GPUs have been components of HPC clusters (Titan and Piz Daint), while only in recent years has the Intel® Xeon Phi™ been included (Tianhe-2 and Stampede). For example, GPUs are in 14% of systems in the November 2015 Top500 list, while the Xeon Phi™ is in 6%. Intel® came out with Xeon Phi™ to compete with NVIDIA GPUs by offering a unified environment that supports OpenMP and MPI, and by providing competitive and easier-to-utilize processing power with less energy consumption. Maximum Xeon Phi™ execution-time performance requires that programs have high data parallelism and good scalability, and use parallel algorithms. And, improved Phi™ power performance and throughput can be achieved by reducing the number of cores employed for application execution. Accordingly, the objectives of this paper are to: (1) Demonstrate that some applications can be executed with fewer cores than are available to users with a negligible impact on execution time: For 59.3% of the 27 application instances studied, doing this results in better performance and for 37% using less than half of the available cores results in performance degradation of not more than 10% in the worst case. (2) Develop a tool that provides the user with the optimal number of cores to employ: We designed an algorithm and developed a plugin for the Periscope Tuning Framework, an automatic performance tuner, that for a given application provide the user with an estimation of this number. (3) Understand if performance metrics can be used to identify applications that can be executed with fewer cores with a negligible impact on execution time: We identified, via statistical analysis, the following three metrics that are indicative of this, at least for the application instances studied: low L1 Compute to Data Access ratio, i.e., the average number of computations that are performed per byte of data loaded/stored in the L1 cache, high use of data bandwidth, and, to a lesser extent, low vectorization intensity.

References

[1]
ADEPT. Rodinia applications benchmark suite (HPC). Retrieved from: http://www.adept-project.eu/case-studies/rodinia-applications-benchmark-suite-hpc.html, 2014.
[2]
A. Argüeta, R. Camacho Barranco, E. Gallardo, P. J. Teller, L. Fialho, and J. Browne. Quick Start Guide for Using Intel's Xeon Phis on Stampede. 2014. Retrieved from: https://www.tacc.utexas.edu/documents/1084364/1157236/QSG-Main-Final-PJT.pdf/bedea045-4a1b-45ce-8a99-2269ebc1836d.
[3]
K. Asanovic, J. Wawrzynek, D. Wessel, K. Yelick, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, and N. Morgan. A view of the parallel computing landscape. Communications of the ACM, 52(10):56, 2009.
[4]
R. Camacho Barranco and P. J. Teller. Analysis of the Execution Time Variation of OpenMP-based Applications on the Intel Xeon Phi. Technical Report UTEP-CS-16-31, Dept. of Computer Science, University of Texas at El Paso, El Paso, TX, June 2016.
[5]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the Int. Symp. on Workload Characterization, pages 44--54. IEEE, 2009.
[6]
R. Cochran, C. Hankendi, A. Coskun, and S. Reda. Identifying the optimal energy-efficient operating points of parallel workloads. In Proc. of the Int. Conf. on Computer-Aided Design, pages 608--615. IEEE/ACM, Nov 2011.
[7]
T. Cramer, D. Schmidl, M. Klemm, and D. an Mey. OpenMP programming on Intel Xeon Phi coprocessors: An early performance comparison. In Proc. of the Many Core Appl. Res. Community Symp., pages 38--44, 2012.
[8]
E. Gallardo. A Case Study of Accelerator Performance. M.S. thesis, University of Texas at El Paso, El Paso, TX, 2015. Retrieved from: http://digitalcommons.utep.edu/dissertations/AAI10000797.
[9]
Georgia Institute of Technology. High Performance Computing, 2016. Retrieved from: http://www.cse.gatech.edu/content/high-performance-computing.
[10]
M. Gerndt and M. Ott. Automatic performance analysis with Periscope. Concurrency and Computation: Practice and Experience, 22(6):736--748, April 2009.
[11]
A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, A. G. Shet, G. Chrysos, and P. Dubey. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel Xeon Phi coprocessor. In Proc. of the 27th Int. Symp. on Parallel and Distributed Processing, pages 126--137. IEEE, May 2013.
[12]
Intel. Measuring performance with Intel MKL support functions. Retrieved from: https://software.intel.com/en-us/node/529745.
[13]
Intel. OpenMP thread affinity control, 2012. Retrieved from: https://software.intel.com/en-us/articles/openmp-thread-affinity-control.
[14]
Intel. Optimization and performance tuning for Intel Xeon Phi coprocessors, Part 2: Understanding and using hardware events, 2012. Retrieved from: https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi\-coprocessors-part-2-understanding.
[15]
Intel. Intel Xeon Phi coprocessor: Software developers guide, 2013. Retrieved from: http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi\-coprocessor-system-software-developers-guide.html.
[16]
Intel. Measuring performance in HPC, 2013. Retrieved from: http://software.intel.com/en-us/articles/measuring-performance-in-hpc.
[17]
Intel. Building a native application for Intel Xeon Phi coprocessors, 2014. Retrieved from: https://software.intel.com/en-us/articles/building\\-a-native-application-for-intel-xeon-phi-coprocessors.
[18]
B. Li, H.-C. Chang, S. Song, C.-Y. Su, T. Meyer, J. Mooring, and K. W. Cameron. The power-performance tradeoffs of the Intel Xeon Phi on HPC applications. In Proc. of the Int. Parallel Distributed Processing Symp. Workshops, pages 1448--1456. IEEE, May 2014.
[19]
J. Li, J. Shu, Y. Chen, D. Wang, and W. Zheng. Analysis of factors affecting execution performance of OpenMP programs. Tsinghua Science and Technology, 10(3):304--308, June 2005.
[20]
Y. Luo, V. Packirisamy, W.-C. Hsu, and A. Zhai. Energy efficient speculative threads: dynamic thread allocation in Same-Isa heterogeneous multicore systems. In Proc. of the 19th Int. Conf. on Parallel Architectures and Compilation Techniques, pages 453--464. ACM, 2010.
[21]
G. Misra, N. Kurkure, A. Das, M. Valmiki, S. Das, and A. Gupta. Evaluation of Rodinia codes on Intel Xeon Phi. In Proc. of the 4th Int. Conf. on Intelligent Systems, Modelling and Simulation, pages 415--419. IEEE, Jan 2013.
[22]
G. Nelissen, V. Berten, J. Goossens, and D. Milojevic. Techniques optimizing the number of processors to schedule multi-threaded tasks. In Proc. of the 24th Euromicro Conf. on Real-Time Systems, pages 321--330, July 2012.
[23]
V. Petrucci, O. Loques, D. Mosse, R. Melhem, N. A. Gazala, and S. Gobriel. Thread assignment optimization with real-time performance and memory bandwidth guarantees for energy-efficient heterogeneous multi-core systems. In Proc. of the 18th Real Time and Embedded Technology and Applications Symp., pages 263--272. IEEE, April 2012.
[24]
R. Reed. Performance tuning for Intel Xeon Phi coprocessors, 2013. Retrieved from: https://software.intel.com/sites/default/files/Slides-Performance-analysis-and-events-with-Intel\\-Vtune-Amplifier-XE.pdf.
[25]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu. Program optimization space pruning for a multithreaded GPU. In Proc. of the 6th Annu. Int. Symp. on Code Generation and Optimization, pages 195--204. ACM/IEEE, 2008.
[26]
D. Schmidl, T. Cramer, S. Wienke, C. Terboven, and M. S. Müller. Assessing the performance of OpenMP programs on the Intel Xeon Phi. In Proc. of the Euro-Par 2013 Parallel Processing, pages 547--558, Springer Berlin Heidelberg, 2013.
[27]
Y. S. Shao and D. Brooks. Energy characterization and instruction-level energy model of Intel's Xeon Phi processor. In Proc. of the Int. Symp. on Low Power Electronics and Design, pages 389--394. ACM, 2013.
[28]
Texas Advanced Computing Center. TACC Stampede user guide, 2014. Retrieved from: https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide.
  1. Minimization of Xeon Phi Core Use with Negligible Execution Time Impact

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale
        July 2016
        405 pages
        ISBN:9781450347556
        DOI:10.1145/2949550
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 17 July 2016

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Intel® Xeon Phi™
        2. Performance
        3. Periscope
        4. autotuning
        5. parallel programming

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        XSEDE16

        Acceptance Rates

        Overall Acceptance Rate 129 of 190 submissions, 68%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 256
          Total Downloads
        • Downloads (Last 12 months)50
        • Downloads (Last 6 weeks)8
        Reflects downloads up to 30 Dec 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media