[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Task Packing: : Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

Published: 01 December 2019 Publication History

Abstract

Load imbalance in parallel systems can be generated by external factors to the currently running applications like operating system noise or the underlying hardware like a heterogeneous cluster. HPC applications working on irregular data structures can also have difficulties to balance their computations across the parallel tasks. In this article we extend, improve and evaluate more deeply the Task Packing mechanism proposed in a previous work. The main idea of the mechanism is to concentrate the idle cycles of unbalanced applications in such a way that one or more CPUs are freed from execution. To achieve this, CPUs are stressed with just useful work of the parallel application tasks, provided performance is not degraded. The packing is solved by an algorithm based on the Knapsack problem, in a minimum number of CPUs and using oversubscription. We design and implement a more efficient version of such mechanism. To that end, we perform the Task Packing “in place”, taking advantage of idle cycles generated at synchronization points of unbalanced applications. Evaluations are carried out on a heterogeneous platform using FT and miniFE benchmarks. Results showed that our proposal generates low overhead. In addition the amount of freed CPUs are related to a load imbalance metric which can be used as a prediction for it.

Highlights

A scalable and transparent load balancing mechanism is proposed.
The mechanism proposed aims to maximize throughput and potential energy saving.
The mechanism is based on a particular case of the Knapsack algorithm.
All the experiments were carried out on a heterogeneous platform
For the evaluations it was used well-known HPC benchmarks: FT and miniFE.

References

[1]
Adiga N.R., Almási G., Almasi G.S., Aridor Y., Barik R., Beece D., Bellofatto R., Bhanot G., Bickford R., Blumrich M., et al., An overview of the bluegene/l supercomputer, in: Supercomputing, ACM/IEEE 2002 Conference, IEEE, 2002, p. 60.
[2]
Asanovic K., Bodik R., Demmel J., Keaveny T., Keutzer K., Kubiatowicz J., Morgan N., Patterson D., Sen K., Wawrzynek J., Wessel D., Yelick K., A view of the parallel computing landscape, Commun. ACM 52 (10) (2009) 56–67,. URL http://doi.acm.org/10.1145/1562764.1562783.
[3]
Augonnet C., Thibault S., Namyst R., Wacrenier P.-A., Starpu: a unified platform for task scheduling on heterogeneous multicore architectures, Concurr. Comput.: Pract. Exper. 23 (2) (2011) 187–198.
[4]
Bailey D.H., Barszcz E., Barton J.T., Browning D.S., Carter R.L., Dagum L., Fatoohi R.A., Frederickson P.O., Lasinski T.A., Schreiber R.S., Simon H.D., Venkatakrishnan V., Weeratunga S.K., The nas parallel benchmarks summary and preliminary results, in: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing ’91), 1991, pp. 158–165,.
[5]
Barcelona Supercomputing Center, Marenostrum 3, https://www.bsc.es/support/MareNostrum3-ug.pdf 0000.
[6]
Barcelona Supercomputing Center, Paraver: a flexible performance analysis tool, http://www.bsc.es/computer-sciences/performance-tools/paraver/general-overview 0000.
[7]
Bohme D., Wolf F., De Supinski B.R., Schulz M., Geimer M., Scalable critical-path based performance analysis, in: Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, IEEE, 2012, pp. 1330–1340.
[8]
Boneti C., Gioiosa R., Cazorla F.J., Valero M., A dynamic scheduler for balancing hpc applications, in: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, in: SC ’08, IEEE Press, Piscataway, NJ, USA, 2008, pp. 41:1–41:12. URL http://dl.acm.org/citation.cfm?id=1413370.1413412.
[9]
Bosilca G., Bouteiller A., Danalis A., Faverge M., Hérault T., Dongarra J.J., Parsec: Exploiting heterogeneity to enhance scalability, Comput. Sci. Eng. 15 (6) (2013) 36–45.
[10]
Buyya R., Yeo C.S., Venugopal S., Broberg J., Brandic I., Cloud computing and emerging it platforms: Vision, hype, and reality for delivering computing as the 5th utility, Future Gener. Comput. Syst. 25 (6) (2009) 599–616.
[11]
Camargo V.C., Mattiolli L., Toledo F.M., A knapsack problem as a tool to solve the production planning problem in small foundries, Comput. Oper. Res. 39 (1) (2012) 86–92.
[12]
Chen T., Raghavan R., Dale J.N., Iwata E., Cell broadband engine architecture and its first implementation—a performance view, IBM J. Res. Dev. 51 (5) (2007) 559–572.
[13]
Dinan J., Olivier S., Sabin G., Prins J., Sadayappan P., Tseng C.W., Dynamic load balancing of unbalanced computations using message passing, in: 2007 IEEE International Parallel and Distributed Processing Symposium, 2007, pp. 1–8.
[14]
Dongarra J., Lastovetsky A.L., High Performance Heterogeneous Computing, Vol. 78, John Wiley & Sons, 2009.
[15]
Dongarra J., Luszczek P., LINPACK Benchmark, in: Padua D. (Ed.), Encyclopedia of Parallel Computing, Springer US, Boston, MA, 2011, pp. 1033–1036,.
[16]
Dorigo M., Optimization, Learning and Natural Algorithms, (Ph.D. thesis) Politecnico di Milano, 1992.
[17]
Dorigo M., Stützle T., Ant colony optimization: overview and recent advances, in: Handbook of Metaheuristics, Springer, 2019, pp. 311–351.
[18]
Eeckhout L., Is moore’s law slowing down? what’s next?, IEEE Micro 37 (4) (2017) 4–5,.
[19]
El-Ghazawi T.A., Carlson W.W., Draper J.M., UPC Language Specifications, v1.1.1 ed., 2003.
[20]
Ellsworth D.A., Malony A.D., Rountree B., Schulz M., Dynamic power sharing for higher job throughput, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, in: SC ’15, ACM, New York, NY, USA, 2015, pp. 80:1–80:11,. URL http://doi.acm.org/10.1145/2807591.2807643.
[21]
Etinski M., Corbalan J., Labarta J., Valero M., Utilization driven power-aware parallel job scheduling, Comput. Sci.-Res. Dev. 25 (3–4) (2010) 207–216.
[22]
Ferreira K.B., Bridges P.G., Brightwell R., Pedretti K.T., The impact of system design parameters on application noise sensitivity, Cluster Comput. 16 (1) (2013) 117–129,.
[23]
Gao G.R., Sterling T., Stevens R., Hereld M., Zhu W., Parallex: A study of a new parallel computation model, in: Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, IEEE, 2007, pp. 1–6.
[24]
Garcia M.G., (Ph.D. thesis) Dynamic Load Balancing for Hybrid Applications, Universitat Politecnica de Catalunya, 2017, URL http://hdl.handle.net/2117/108227.
[25]
Goldman A., Trystram D., An efficient parallel algorithm for solving the knapsack problem on hypercubes, J. Parallel Distrib. Comput. 64 (11) (2004) 1213–1222,. URL http://www.sciencedirect.com/science/article/pii/S0743731504000966.
[26]
Goss S., Aron S., Deneubourg J.-L., Pasteels J.M., Self-organized shortcuts in the argentine ant, Naturwissenschaften 76 (12) (1989) 579–581.
[27]
Graham R.L., The MPI 2.2 standard and the emerging MPI 3 standard, in: Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Springer-Verlag, Berlin, Heidelberg, 2009, p. 2,.
[28]
Hoefler T., Schneider T., Lumsdaine A., Characterizing the influence of system noise on large-scale applications by simulation, in: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, in: SC ’10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 1–11,.
[29]
Horowitz E., Sahni S., Computing partitions with applications to the knapsack problem, J. ACM 21 (2) (1974) 277–292.
[30]
Howard J., Dighe S., Hoskote Y., Vangal S., Finan D., Ruhl G., Jenkins D., Wilson H., Borkar N., Schrom G., Pailet F., Jain S., Jacob T., Yada S., Marella S., Salihundam P., Erraguntla V., Konow M., Riepen M., Droege G., Lindemann J., Gries M., Apel T., Henriss K., Lund-Larsen T., Steibl S., Borkar S., De V., Wijngaart R.V.D., Mattson T., A 48-core ia-32 message-passing processor with dvfs in 45nm cmos, in: 2010 IEEE International Solid-State Circuits Conference - (ISSCC), 2010, pp. 108–109,.
[31]
HPCG, The High Performance Conjugate Gradients (HPCG) Benchmark, http://hpcg-benchmark.org 0000.
[32]
Huang C., Lawlor O., Kalé L.V., Rauchwerger L. (Ed.), Languages and Compilers for Parallel Computing: 16th International Workshop, LCPC 2003, College Station, TX, USA, October 2-4, 2003. Revised Papers, Springer Berlin Heidelberg, Berlin, Heidelberg, 2004, pp. 306–322,. (Chapter). Adaptive MPI.
[33]
Hunter W., Hunter J., Statistics for Experimenters: Design, Innovation and Discovery, Box, G.E. Wiley, New York, 2005.
[34]
Iancu C., Hofmeyr S., Blagojeviš F., Zheng Y., Oversubscription on multicore processors, in: Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1–11.
[35]
Imbernon B., Prades J., Gimenez D., Cecilia J.M., Silla F., Enhancing large-scale docking simulation on heterogeneous systems: An mpi vs rcuda study, Future Gener. Comput. Syst. 79 (2018) 26–37.
[36]
Jeffers J., Reinders J., Intel Xeon Phi Coprocessor High-Performance Programming, Newnes, 2013.
[37]
Kellerer H., Strusevich V.A., Fully polynomial approximation schemes for a symmetric quadratic knapsack problem and its scheduling applications, Algorithmica 57 (4) (2010) 769–795.
[38]
Kindratenko V., Novel computing architectures, Comput. Sci. Eng. 11 (3) (2009) 54–57,.
[39]
Lindahl E., Hess B., van der Spoel D., GROMACS 3.0: A package for molecular simulation and trajectory analysis, J. Mol. Mod. 7 (2001) 306–317.
[40]
Lindholm E., Nickolls J., Oberman S., Montrym J., NVIDIA tEsla: A unified graphics and computing architecture, IEEE Micro 28 (2) (2008).
[41]
Llanes A., Cecilia J.M., Sánchez A., García J.M., Amos M., Ujaldón M., Dynamic load balancing on heterogeneous clusters for parallel ant colony optimization, Cluster Comput. 19 (1) (2016) 1–11.
[42]
Álvarez Llorente J.M., Díaz-Martín J.C., Rico-Gallego J.A., Formal modeling and performance evaluation of a run-time rank remapping technique in broadcast, allgather and allreduce MPI collective operations, in: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, in: CCGrid ’17, IEEE Press, Piscataway, NJ, USA, 2017, pp. 963–972,.
[43]
Mantevo suite benchmarks, MiniFE from Mantevo suite benchmarks, http://mantevo.org/ 0000.
[44]
Merkle R.C., Hellman M.E., Hiding information and signatures in trapdoor knapsacks, IEEE Trans. Inf. Theory 24 (5) (1978) 525–530.
[45]
Mirtaheri S.L., Fatemi S.A., Grandinetti L., Adaptive load balancing dashboard in dynamic distributed systems, Supercomput. Front. Innov. 4 (4) (2017) 34–49.
[46]
Mirtaheri S.L., Grandinetti L., Dynamic load balancing in distributed exascale computing systems, Cluster Comput. 20 (4) (2017) 3677–3689.
[47]
B. Nikolic, P. Wortmann, P. Alexander, Parametric models of SDP compute requirements, http://ska-sdp.org/sites/default/files/attachments/ska-tel-sdp-0000013_05_rep_sdp_performance_model_view_part_1_-_signed.pdf 0000.
[48]
Odlyzko A.M., The rise and fall of knapsack cryptosystems, Cryptol. Comput. Number Theory 42 (1990) 75–88.
[49]
Parsons B.S., Pai V.S., Exploiting process imbalance to improve MPI collective operations in hierarchical systems, in: Proceedings of the 29th ACM on International Conference on Supercomputing, ACM, 2015, pp. 57–66.
[50]
Petrini F., Kerbyson D.J., Pakin S., The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q, in: Supercomputing, 2003 ACM/IEEE Conference, IEEE, 2003, p. 55.
[51]
Planas J., Badia R.M., Ayguadé E., Labarta J., Hierarchical task-based programming with starss, Int. J. High Perform. Comput. Appl. 23 (3) (2009) 284–299.
[52]
Reaño C., Silla F., Shainer G., Schultz S., Local and remote gpus perform similar with edr 100g infiniband, in: Proceedings of the Industrial Track of the 16th International Middleware Conference, ACM, 2015, p. 4.
[53]
SchedMD, SLURM job scheduling system, https://slurm.schedmd.com/ [cited 01/10/18]. URL https://slurm.schedmd.com/ 0000.
[54]
Si M., Pena A.J., Hammond J., Balaji P., Takagi M., Ishikawa Y., Casper: An asynchronous progress model for mpi rma on many-core architectures, in: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, IEEE, 2015, pp. 665–676.
[55]
Springel V., Yoshida N., White S.D., GADGET: a code for collisionless and gasdynamical cosmological simulations, New Astron. 6 (2) (2001) 79–117,. URL http://www.sciencedirect.com/science/article/pii/S1384107601000422.
[56]
Square Kilometre Array, The SKA Project, http://www.skatelescope.org 0000.
[57]
Srinivasan S., Kettimuthu R., Subramani V., Sadayappan P., Characterization of backfilling strategies for parallel job scheduling, in: Parallel Processing Workshops, 2002. Proceedings. International Conference on, 2002, pp. 514–519,.
[58]
Sterling T., Anderson M., Bohan P.K., Brodowicz M., Kulkarni A., Zhang B., Towards exascale co-design in a runtime system, in: Markidis S., Laure E. (Eds.), Solving Software Challenges for Exascale: International Conference on Exascale Applications and Software, EASC 2014, Stockholm, Sweden, April 2-3, 2014, Revised Selected Papers, Springer International Publishing, Cham, 2015, pp. 85–99,.
[59]
Utrera G., Corbalán J., Labarta J., Another approach to backfilled jobs: Applying virtual malleability to expired windows, in: Proceedings of the 19th Annual International Conference on Supercomputing, in: ICS ’05, ACM, New York, NY, USA, 2005, pp. 313–322,. URL http://doi.acm.org/10.1145/1088149.1088191.
[60]
Utrera G., Corbalán J., Labarta J., Labarta J., Joe K., Sato T. (Eds.), High-Performance Computing: 6th International Symposium, ISHPC 2005, Nara, Japan, September 7-9, 2005, First International Workshop on Advanced Low Power Systems, ALPS 2006, Revised Selected Papers, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 117–129,.
[61]
Utrera G., Corbalan J., Labarta J., Scheduling parallel jobs on multicore clusters using cpu oversubscription, J. Supercomput. 68 (3) (2014) 1113–1140,.
[62]
Utrera G., Farreras M., Fornes J., Task packing: Getting the best from MPI unbalanced applications, in: 2017 25th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2017, pp. 547–550,.
[63]
Vasquez M., Hao J.-K., A “logic-constrained” knapsack formulation and a tabu algorithm for the daily photograph scheduling of an earth observation satellite, Comput. Optim. Appl. 20 (2) (2001) 137–157.

Cited By

View all
  • (2023)AutoRS: Environment-Dependent Real-Time Scheduling for End-to-End Autonomous DrivingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332397534:12(3238-3252)Online publication date: 1-Dec-2023

Index Terms

  1. Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Journal of Parallel and Distributed Computing
          Journal of Parallel and Distributed Computing  Volume 134, Issue C
          Dec 2019
          237 pages

          Publisher

          Academic Press, Inc.

          United States

          Publication History

          Published: 01 December 2019

          Author Tags

          1. MPI
          2. HPC
          3. Oversubscription
          4. Load balancing
          5. Knapsack algorithm

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 17 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)AutoRS: Environment-Dependent Real-Time Scheduling for End-to-End Autonomous DrivingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332397534:12(3238-3252)Online publication date: 1-Dec-2023

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media