Abstract
A job management system is a critical component of a production supercomputing environment, permitting oversubscribed resources to be shared fairly and efficiently. Job management systems that were originally designed for traditional vector supercomputers are not appropriate for the distributed-memory parallel supercomputers that are becoming increasingly important in the high performance computing industry. Newer job management systems offer new functionality but do not solve fundamental problems. We address some of the main issues in resource allocation and job scheduling we have encountered on two parallel computers — a 160- node IBM SP2 and a cluster of 20 high performance workstations located at the Numerical Aerodynamic Simulation facility. We describe the requirements for resource allocation and job management that are necessary to provide a production supercomputing environment on these machines, prioritizing according to difficulty and importance, and advocating a return to fundamental issues.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
“Adaptive Parallelism and Piranha”, N. Carriero, E. Freeman, D. Gelernter and D. Kaminsky, IEEE Computer, January 1995.
“Distributed Job Manager Administration Guide” AHPCRC, Minnesota Supercomputer Center, 1993.
“Research Toward a Heterogeneous Networked Computing Cluster: The Distributed Queuing System Version 3.0,” D. Duke, T. Green, J. Pasko, Supercomputer Computations Research Institute, Florida State University, March, 1994.
“Dynamic Process Management in an MPI Setting”, W. Gropp and E. Lusk, draft report, ANL, 1995.
“IBM Loadleveler Administration and Installation Guide”, Document Number SH26-7220-02, IBM Kingston Research Center, January 1994.
“The Network Queuing System,” B.A. Kinsbury, Cosmic Software, NASA Ames Research Center, 1986.
“Condor: A Hunter of Idle Workstations”, M. Litzkow, M. Livny and M. Mutka, Proceedings of the 8th International Conference on Distributed Computing Systems, San Jose, June 1988.
“GLUnix: A New Approach to Operating Systems for Networks of Workstations.” D. Patterson and T. Anderson, Proceedings of the First Workshop on Networks of Workstations, San Jose, October 1994.
“LSF: Load Sharing Facility Administrator's Guide”, Platform Computing Corporation, December 1994.
“Portable Batch System: External Reference Specification”, Revision 1.4, NAS, NASA Ames Research Center, January 1995.
“PVM3 Users Guide and Reference Manual,” Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, Vaidy Sunderam, Oak Ridge National Lab TM-12187, September, 1994.
“Parallel Computation of 3-D Navier-Stokes Flowfields for Supersonic Vehicles”, J.S. Ryan and S.K. Weeratunga, AIAA 93-0064, Jan. 1993.
“Connection Machine CM-5 Technical Summary”, Thinking Machines Corporation, November 1992.
“Utopia: a Load Sharing Facility for Large, Heterogeneous Distributed Computer Systems,” S. Zhou, X. Zheng, J. Wang, and P. Delisle, Software-Practice and Experience, Vol. 23, December 1993.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1995 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Saphir, W., Tanner, L.A., Traversat, B. (1995). Job management requirements for nas parallel systems and clusters. In: Feitelson, D.G., Rudolph, L. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 1995. Lecture Notes in Computer Science, vol 949. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60153-8_37
Download citation
DOI: https://doi.org/10.1007/3-540-60153-8_37
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60153-1
Online ISBN: 978-3-540-49459-1
eBook Packages: Springer Book Archive