More Web Proxy on the site http://driver.im/

research-article

Task Packing: : Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

Authors:

Montse Farreras,

Jordi FornesAuthors Info & Claims

Volume 134, Issue C

Pages 37 - 49

https://doi.org/10.1016/j.jpdc.2019.08.003

Published: 01 December 2019 Publication History

Abstract

Load imbalance in parallel systems can be generated by external factors to the currently running applications like operating system noise or the underlying hardware like a heterogeneous cluster. HPC applications working on irregular data structures can also have difficulties to balance their computations across the parallel tasks. In this article we extend, improve and evaluate more deeply the Task Packing mechanism proposed in a previous work. The main idea of the mechanism is to concentrate the idle cycles of unbalanced applications in such a way that one or more CPUs are freed from execution. To achieve this, CPUs are stressed with just useful work of the parallel application tasks, provided performance is not degraded. The packing is solved by an algorithm based on the Knapsack problem, in a minimum number of CPUs and using oversubscription. We design and implement a more efficient version of such mechanism. To that end, we perform the Task Packing “in place”, taking advantage of idle cycles generated at synchronization points of unbalanced applications. Evaluations are carried out on a heterogeneous platform using FT and miniFE benchmarks. Results showed that our proposal generates low overhead. In addition the amount of freed CPUs are related to a load imbalance metric which can be used as a prediction for it.

Highlights

•

A scalable and transparent load balancing mechanism is proposed.

•

The mechanism proposed aims to maximize throughput and potential energy saving.

•

The mechanism is based on a particular case of the Knapsack algorithm.

•

All the experiments were carried out on a heterogeneous platform

•

For the evaluations it was used well-known HPC benchmarks: FT and miniFE.

References

[1]

Adiga N.R., Almási G., Almasi G.S., Aridor Y., Barik R., Beece D., Bellofatto R., Bhanot G., Bickford R., Blumrich M., et al., An overview of the bluegene/l supercomputer, in: Supercomputing, ACM/IEEE 2002 Conference, IEEE, 2002, p. 60.

[2]

Asanovic K., Bodik R., Demmel J., Keaveny T., Keutzer K., Kubiatowicz J., Morgan N., Patterson D., Sen K., Wawrzynek J., Wessel D., Yelick K., A view of the parallel computing landscape, Commun. ACM 52 (10) (2009) 56–67,. URL http://doi.acm.org/10.1145/1562764.1562783.

Digital Library

[3]

Augonnet C., Thibault S., Namyst R., Wacrenier P.-A., Starpu: a unified platform for task scheduling on heterogeneous multicore architectures, Concurr. Comput.: Pract. Exper. 23 (2) (2011) 187–198.

Digital Library

[4]

Bailey D.H., Barszcz E., Barton J.T., Browning D.S., Carter R.L., Dagum L., Fatoohi R.A., Frederickson P.O., Lasinski T.A., Schreiber R.S., Simon H.D., Venkatakrishnan V., Weeratunga S.K., The nas parallel benchmarks summary and preliminary results, in: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing ’91), 1991, pp. 158–165,.

Digital Library

[5]

Barcelona Supercomputing Center, Marenostrum 3, https://www.bsc.es/support/MareNostrum3-ug.pdf 0000.

[6]

Barcelona Supercomputing Center, Paraver: a flexible performance analysis tool, http://www.bsc.es/computer-sciences/performance-tools/paraver/general-overview 0000.

[7]

Bohme D., Wolf F., De Supinski B.R., Schulz M., Geimer M., Scalable critical-path based performance analysis, in: Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, IEEE, 2012, pp. 1330–1340.

[8]

Boneti C., Gioiosa R., Cazorla F.J., Valero M., A dynamic scheduler for balancing hpc applications, in: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, in: SC ’08, IEEE Press, Piscataway, NJ, USA, 2008, pp. 41:1–41:12. URL http://dl.acm.org/citation.cfm?id=1413370.1413412.

[9]

Bosilca G., Bouteiller A., Danalis A., Faverge M., Hérault T., Dongarra J.J., Parsec: Exploiting heterogeneity to enhance scalability, Comput. Sci. Eng. 15 (6) (2013) 36–45.

Digital Library

[10]

Buyya R., Yeo C.S., Venugopal S., Broberg J., Brandic I., Cloud computing and emerging it platforms: Vision, hype, and reality for delivering computing as the 5th utility, Future Gener. Comput. Syst. 25 (6) (2009) 599–616.

Digital Library

[11]

Camargo V.C., Mattiolli L., Toledo F.M., A knapsack problem as a tool to solve the production planning problem in small foundries, Comput. Oper. Res. 39 (1) (2012) 86–92.

[12]

Chen T., Raghavan R., Dale J.N., Iwata E., Cell broadband engine architecture and its first implementation—a performance view, IBM J. Res. Dev. 51 (5) (2007) 559–572.

[13]

Dinan J., Olivier S., Sabin G., Prins J., Sadayappan P., Tseng C.W., Dynamic load balancing of unbalanced computations using message passing, in: 2007 IEEE International Parallel and Distributed Processing Symposium, 2007, pp. 1–8.

[14]

Dongarra J., Lastovetsky A.L., High Performance Heterogeneous Computing, Vol. 78, John Wiley & Sons, 2009.

[15]

Dongarra J., Luszczek P., LINPACK Benchmark, in: Padua D. (Ed.), Encyclopedia of Parallel Computing, Springer US, Boston, MA, 2011, pp. 1033–1036,.

[16]

Dorigo M., Optimization, Learning and Natural Algorithms, (Ph.D. thesis) Politecnico di Milano, 1992.

[17]

Dorigo M., Stützle T., Ant colony optimization: overview and recent advances, in: Handbook of Metaheuristics, Springer, 2019, pp. 311–351.

[18]

Eeckhout L., Is moore’s law slowing down? what’s next?, IEEE Micro 37 (4) (2017) 4–5,.

Digital Library

[19]

El-Ghazawi T.A., Carlson W.W., Draper J.M., UPC Language Specifications, v1.1.1 ed., 2003.

[20]

Ellsworth D.A., Malony A.D., Rountree B., Schulz M., Dynamic power sharing for higher job throughput, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, in: SC ’15, ACM, New York, NY, USA, 2015, pp. 80:1–80:11,. URL http://doi.acm.org/10.1145/2807591.2807643.

Digital Library

[21]

Etinski M., Corbalan J., Labarta J., Valero M., Utilization driven power-aware parallel job scheduling, Comput. Sci.-Res. Dev. 25 (3–4) (2010) 207–216.

[22]

Ferreira K.B., Bridges P.G., Brightwell R., Pedretti K.T., The impact of system design parameters on application noise sensitivity, Cluster Comput. 16 (1) (2013) 117–129,.

Digital Library

[23]

Gao G.R., Sterling T., Stevens R., Hereld M., Zhu W., Parallex: A study of a new parallel computation model, in: Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, IEEE, 2007, pp. 1–6.

[24]

Garcia M.G., (Ph.D. thesis) Dynamic Load Balancing for Hybrid Applications, Universitat Politecnica de Catalunya, 2017, URL http://hdl.handle.net/2117/108227.

[25]

Goldman A., Trystram D., An efficient parallel algorithm for solving the knapsack problem on hypercubes, J. Parallel Distrib. Comput. 64 (11) (2004) 1213–1222,. URL http://www.sciencedirect.com/science/article/pii/S0743731504000966.

Digital Library

[26]

Goss S., Aron S., Deneubourg J.-L., Pasteels J.M., Self-organized shortcuts in the argentine ant, Naturwissenschaften 76 (12) (1989) 579–581.

[27]

Graham R.L., The MPI 2.2 standard and the emerging MPI 3 standard, in: Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Springer-Verlag, Berlin, Heidelberg, 2009, p. 2,.

Digital Library

[28]

Hoefler T., Schneider T., Lumsdaine A., Characterizing the influence of system noise on large-scale applications by simulation, in: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, in: SC ’10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 1–11,.

Digital Library

[29]

Horowitz E., Sahni S., Computing partitions with applications to the knapsack problem, J. ACM 21 (2) (1974) 277–292.

Digital Library

[30]

Howard J., Dighe S., Hoskote Y., Vangal S., Finan D., Ruhl G., Jenkins D., Wilson H., Borkar N., Schrom G., Pailet F., Jain S., Jacob T., Yada S., Marella S., Salihundam P., Erraguntla V., Konow M., Riepen M., Droege G., Lindemann J., Gries M., Apel T., Henriss K., Lund-Larsen T., Steibl S., Borkar S., De V., Wijngaart R.V.D., Mattson T., A 48-core ia-32 message-passing processor with dvfs in 45nm cmos, in: 2010 IEEE International Solid-State Circuits Conference - (ISSCC), 2010, pp. 108–109,.

[31]

HPCG, The High Performance Conjugate Gradients (HPCG) Benchmark, http://hpcg-benchmark.org 0000.

[32]

Huang C., Lawlor O., Kalé L.V., Rauchwerger L. (Ed.), Languages and Compilers for Parallel Computing: 16th International Workshop, LCPC 2003, College Station, TX, USA, October 2-4, 2003. Revised Papers, Springer Berlin Heidelberg, Berlin, Heidelberg, 2004, pp. 306–322,. (Chapter). Adaptive MPI.

[33]

Hunter W., Hunter J., Statistics for Experimenters: Design, Innovation and Discovery, Box, G.E. Wiley, New York, 2005.

[34]

Iancu C., Hofmeyr S., Blagojeviš F., Zheng Y., Oversubscription on multicore processors, in: Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1–11.

[35]

Imbernon B., Prades J., Gimenez D., Cecilia J.M., Silla F., Enhancing large-scale docking simulation on heterogeneous systems: An mpi vs rcuda study, Future Gener. Comput. Syst. 79 (2018) 26–37.

[36]

Jeffers J., Reinders J., Intel Xeon Phi Coprocessor High-Performance Programming, Newnes, 2013.

Digital Library

[37]

Kellerer H., Strusevich V.A., Fully polynomial approximation schemes for a symmetric quadratic knapsack problem and its scheduling applications, Algorithmica 57 (4) (2010) 769–795.

[38]

Kindratenko V., Novel computing architectures, Comput. Sci. Eng. 11 (3) (2009) 54–57,.

Digital Library

[39]

Lindahl E., Hess B., van der Spoel D., GROMACS 3.0: A package for molecular simulation and trajectory analysis, J. Mol. Mod. 7 (2001) 306–317.

[40]

Lindholm E., Nickolls J., Oberman S., Montrym J., NVIDIA tEsla: A unified graphics and computing architecture, IEEE Micro 28 (2) (2008).

[41]

Llanes A., Cecilia J.M., Sánchez A., García J.M., Amos M., Ujaldón M., Dynamic load balancing on heterogeneous clusters for parallel ant colony optimization, Cluster Comput. 19 (1) (2016) 1–11.

Digital Library

[42]

Álvarez Llorente J.M., Díaz-Martín J.C., Rico-Gallego J.A., Formal modeling and performance evaluation of a run-time rank remapping technique in broadcast, allgather and allreduce MPI collective operations, in: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, in: CCGrid ’17, IEEE Press, Piscataway, NJ, USA, 2017, pp. 963–972,.

Digital Library

[43]

Mantevo suite benchmarks, MiniFE from Mantevo suite benchmarks, http://mantevo.org/ 0000.

[44]

Merkle R.C., Hellman M.E., Hiding information and signatures in trapdoor knapsacks, IEEE Trans. Inf. Theory 24 (5) (1978) 525–530.

[45]

Mirtaheri S.L., Fatemi S.A., Grandinetti L., Adaptive load balancing dashboard in dynamic distributed systems, Supercomput. Front. Innov. 4 (4) (2017) 34–49.

[46]

Mirtaheri S.L., Grandinetti L., Dynamic load balancing in distributed exascale computing systems, Cluster Comput. 20 (4) (2017) 3677–3689.

[47]

B. Nikolic, P. Wortmann, P. Alexander, Parametric models of SDP compute requirements, http://ska-sdp.org/sites/default/files/attachments/ska-tel-sdp-0000013_05_rep_sdp_performance_model_view_part_1_-_signed.pdf 0000.

[48]

Odlyzko A.M., The rise and fall of knapsack cryptosystems, Cryptol. Comput. Number Theory 42 (1990) 75–88.

[49]

Parsons B.S., Pai V.S., Exploiting process imbalance to improve MPI collective operations in hierarchical systems, in: Proceedings of the 29th ACM on International Conference on Supercomputing, ACM, 2015, pp. 57–66.

[50]

Petrini F., Kerbyson D.J., Pakin S., The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q, in: Supercomputing, 2003 ACM/IEEE Conference, IEEE, 2003, p. 55.

[51]

Planas J., Badia R.M., Ayguadé E., Labarta J., Hierarchical task-based programming with starss, Int. J. High Perform. Comput. Appl. 23 (3) (2009) 284–299.

[52]

Reaño C., Silla F., Shainer G., Schultz S., Local and remote gpus perform similar with edr 100g infiniband, in: Proceedings of the Industrial Track of the 16th International Middleware Conference, ACM, 2015, p. 4.

[53]

SchedMD, SLURM job scheduling system, https://slurm.schedmd.com/ [cited 01/10/18]. URL https://slurm.schedmd.com/ 0000.

[54]

Si M., Pena A.J., Hammond J., Balaji P., Takagi M., Ishikawa Y., Casper: An asynchronous progress model for mpi rma on many-core architectures, in: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, IEEE, 2015, pp. 665–676.

[55]

Springel V., Yoshida N., White S.D., GADGET: a code for collisionless and gasdynamical cosmological simulations, New Astron. 6 (2) (2001) 79–117,. URL http://www.sciencedirect.com/science/article/pii/S1384107601000422.

[56]

Square Kilometre Array, The SKA Project, http://www.skatelescope.org 0000.

[57]

Srinivasan S., Kettimuthu R., Subramani V., Sadayappan P., Characterization of backfilling strategies for parallel job scheduling, in: Parallel Processing Workshops, 2002. Proceedings. International Conference on, 2002, pp. 514–519,.

[58]

Sterling T., Anderson M., Bohan P.K., Brodowicz M., Kulkarni A., Zhang B., Towards exascale co-design in a runtime system, in: Markidis S., Laure E. (Eds.), Solving Software Challenges for Exascale: International Conference on Exascale Applications and Software, EASC 2014, Stockholm, Sweden, April 2-3, 2014, Revised Selected Papers, Springer International Publishing, Cham, 2015, pp. 85–99,.

[59]

Utrera G., Corbalán J., Labarta J., Another approach to backfilled jobs: Applying virtual malleability to expired windows, in: Proceedings of the 19th Annual International Conference on Supercomputing, in: ICS ’05, ACM, New York, NY, USA, 2005, pp. 313–322,. URL http://doi.acm.org/10.1145/1088149.1088191.

Digital Library

[60]

Utrera G., Corbalán J., Labarta J., Labarta J., Joe K., Sato T. (Eds.), High-Performance Computing: 6th International Symposium, ISHPC 2005, Nara, Japan, September 7-9, 2005, First International Workshop on Advanced Low Power Systems, ALPS 2006, Revised Selected Papers, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 117–129,.

[61]

Utrera G., Corbalan J., Labarta J., Scheduling parallel jobs on multicore clusters using cpu oversubscription, J. Supercomput. 68 (3) (2014) 1113–1140,.

Digital Library

[62]

Utrera G., Farreras M., Fornes J., Task packing: Getting the best from MPI unbalanced applications, in: 2017 25th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2017, pp. 547–550,.

[63]

Vasquez M., Hao J.-K., A “logic-constrained” knapsack formulation and a tabu algorithm for the daily photograph scheduling of an earth observation satellite, Comput. Optim. Appl. 20 (2) (2001) 137–157.

Cited By

Ma JLi LXu C(2023)AutoRS: Environment-Dependent Real-Time Scheduling for End-to-End Autonomous DrivingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332397534:12(3238-3252)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3323975

Index Terms

Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

Index terms have been assigned to the content through auto-classification.

Recommendations

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures
GPGPU '19: Proceedings of the 12th Workshop on General Purpose Processing Using GPUs

The CUDA Unified Memory (UM) interface enables a significantly simpler programming paradigm and has the potential to fundamentally change the way programmers write CUDA applications in the future. Although UM leads to high productivity in programming ...
Efficient design for MPI asynchronous progress without dedicated resources
Highlights
- Presents a scalable asynchronous progress design that requires - No additional software or hardware resources. - No interrupts from the network adapter. - No ...
Abstract
The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network ...
Efficient Asynchronous Communication Progress for MPI without Dedicated Resources
EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting

The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing Volume 134, Issue C

Dec 2019

237 pages

ISSN:0743-7315

Issue’s Table of Contents

Elsevier Inc.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 December 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ma JLi LXu C(2023)AutoRS: Environment-Dependent Real-Time Scheduling for End-to-End Autonomous DrivingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332397534:12(3238-3252)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3323975

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents