CN112882828A

CN112882828A - Upgrade processor management and scheduling method based on SLURM job scheduling system

Info

Publication number: CN112882828A
Application number: CN202110096508.6A
Authority: CN
Inventors: 马银萍; 樊春; 杨宏辉; 李若淼
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-06-01
Anticipated expiration: 2041-01-25
Also published as: CN112882828B

Abstract

The invention discloses a management and scheduling method for an Ascend processor based on a SLURM job scheduling system. The Ascend processor is used as an NPU similar to a GPU, and the GRES plug-in in SLURM is used to manage and schedule the NPU, so as to realize the management and scheduling of the Ascend processor. For management and scheduling, the specific steps include: adding an NPU plug-in module, adding a job requesting NPU resource function, adding an NPU module to a GRES plug-in, recompiling the SLURM source code, modifying the SLURM configuration file, and starting the SLURM service. The invention uses the Ascend processor as a GRES general resource, and uses SLURM for scheduling. This is the first time that the Ascend processor is combined with the job scheduling system of a high-performance cluster, so that the Ascend processor can be quickly applied to a super-large scale across nodes. In computing scenarios, broaden the application scenarios of the Ascend processor, enrich the resource categories of high-performance clusters, improve the computing power of high-performance clusters, and save job computing time.

Description

Upgrade processor management and scheduling method based on SLURM job scheduling system

Technical Field

The present invention relates to the field of processor application, and more particularly to a processor promotion management and scheduling method based on SLURM job scheduling system.

Background

The soaring processor is a brand new AI processor developed in China, aims to provide a chip with higher computing power and lower energy consumption for deep learning research, development and deployment, and is a domestic AI processor which is the first turn of computing power in China. However, the soaring processor is not widely used in the high performance computing field.

At present, high-performance computing operation scheduling software for scheduling the soar processor is not available at home and abroad, so that a high-performance computing operation scheduling system and the soar processor need to be fused, the application scene of the soar processor is further expanded to the high-performance computing field, and the ecological environment of the home processor is further improved. High performance computing usually uses parallelization technology to efficiently and quickly run application programs, usually multiple processors or multiple servers execute the same operation in parallel, if the turbo processor is applied to the high performance computing field, the computing power of the high performance computing can be greatly increased, the efficiency of operation can be increased, and the application scenario of the turbo processor can be expanded.

An SLURM (simple Linux Utility for Resource management) is a highly scalable and fault-tolerant cluster manager and job scheduling system with the widest application range in the high-performance computing field, a Resource management module of the SLURM is mainly responsible for managing, distributing and collecting system resources, a master control process (slurmctld) resides on a master control node, namely a management node, and a monitoring process (slurmd) resides on a computing node. And invoking a corresponding resource collection information function by the slurmd to collect local resource information. Initially, a node daemon process of a computing node sends information registration to a central daemon process, and then a main control process (slarmctld) periodically inquires about the node so as to know the condition of the whole system. SLURM also maintains a queue of pending jobs and manages the overall resource utilization of the jobs. SLURM also manages available compute nodes in an exclusive manner, distributes jobs to a set of allocated nodes to execute jobs and monitors jobs for completion.

The GRES plug-in of SLURM can manage and schedule resources such as GPU, Intel's MIC (Man Integrated core) resources, CUDA multithreading service (MPS), and NIC.

Therefore, the SLURM cluster manager and the operation scheduler which are most widely used in the high performance field are deeply integrated with the soaring processor, so that the SLURM can monitor and schedule the soaring processor, the management and scheduling efficiency of the soaring processor is improved, the computing power is the productivity, and the popularization and application of the domestic AI processor further promote the progress of the industry and the economic development.

However, since the Itanium processor was the artificial intelligence processor released in 2018, and the native da Vinci architecture is adopted, the current mainstream high performance computing and dispatching system (including SLURM) mainly supports the processors such as CPU and GPU, and does not support the Itanium processor chip,

therefore, high-performance computing operation scheduling software for scheduling the soar processor is not available at home and abroad at present, and the soar processor is not widely applied in the high-performance computing field, so that the progress of the industry and the economic development are limited.

Disclosure of Invention

The object of the present invention is to provide a management and scheduling method for promotion processor based on SLURM job scheduling system. The invention can dispatch the soar processor, so that the soar processor can be widely applied in the high-performance computing field, and the progress of industry and the economic development are promoted.

The technical scheme of the invention is as follows: the upgrade processor management and scheduling method based on the SLURM job scheduling system uses the upgrade processor as the NPU similar to the GPU, and manages and schedules the NPU through the GRES plug-in the SLURM, so as to manage and schedule the upgrade processor; the method comprises the following specific steps:

A. adding an NPU plug-in module: acquiring hardware information of the NPU through an interface;

B. add job application NPU resource function: applying for NPU resources through salloc, srun or sbatch commands;

C. adding an NPU module by the GRES plug-in unit: adding an NPU module in a GRES plug-in to distribute and manage NPU resources;

D. recompile SLURM source code: adding a compiling option for the NPU module, and recompiling the SLURM;

E. modify the SLURM configuration file: modify the SLURM configuration file to satisfy support for the NPU;

F. start SLURM service to manage and schedule the promotion processor.

In the aforementioned upgrade processor management and scheduling method based on the SLURM job scheduling system, the details of the NPU plug-in module added in step a are as follows:

setting or acquiring hardware information of the NPU through a DSMI interface function, wherein the hardware information at least comprises the chip number and the chip model of the NPU; the interfaces provided by the elevator processor for acquiring hardware information include ADMI, DCMI and DSMI, but the interface functions provided by the ADMI and the DCMI cannot satisfy the information required by SLURM scheduling so as to determine to use the DSMI interface; with the evolution of the ecology of the soaring processor, it is possible that the SLURM scheduling requirements can be satisfied by other interfaces, not only by using the DSMI interface.

In the upgrade processor management and scheduling method based on the SLURM job scheduling system, the NPU module is added to the GRES plug-in step C, and the specific contents are as follows:

a folder named NPU is added under the src/plugin/GRES folder, and GRES _ npi.c files in the folder realize functions of initialization, environment variable setting, operation information acquisition, NPU resource list acquisition, operation parameter setting and the like of an NPU module in the GRES plug-in, so that the function of adding the NPU module to the GRES plug-in is completed.

In the aforementioned upgrade processor management and scheduling method based on the SLURM job scheduling system, the recompilation of the SLURM source code in step D includes the following specific contents:

d1, add-with-dsmi option in the slurn file; when a-with-DSMI parameter is used during compiling, library files on which the DSMI interface depends are required to be searched, and the added NPU related code files are compiled;

d2, adding an x _ ac _ dsmi.m4 file in an auxdir folder of the SLURM root directory; used to specify the library file on which the DSMI interface depends;

d3, adding support for the NPU module in src/plugins/GRES/makefile.am files of the GRES plug-in;

d4, adding support for the NPU module in a makefile.am file in an src/plugin folder under a root directory;

d5, adding support for the Makefile added by the NPU in the configuration file under the root directory;

d6, recompiling the modified SLURM code.

In the aforementioned method for managing and scheduling an upgrade processor based on the SLURM job scheduling system, the specific content of modifying the SLURM configuration file in step E is as follows:

e1, set "GresTypes npu" in slurm. conf;

e2, setting the number of NPU resources of the NPU node in slarm.conf;

e3, in GRES configuration file GRES. conf, specifying the node with NPU resources, and the device file of the node NPU device;

e4, adding relationship devices to cgroup. conf file to make SLURM able to schedule resources in GRES units instead of in nodes;

in the aforementioned method for managing and scheduling an upgrade processor based on the SLURM job scheduling system, if the cluster includes GPU resources, step E1 may be set to "gresttypes NPU, GPU", which indicates that the NPU and the GPU are supported simultaneously;

compared with the prior art, the invention uses the soar processor as a GRES general resource to perform scheduling by SLURM, which combines the soar processor with the operation scheduling system of the high-performance cluster for the first time, so that the soar processor can be quickly applied to a cross-node ultra-large-scale calculation scene, the application scene of the soar processor is widened, the resource category of the high-performance cluster is enriched, the calculation power of the high-performance cluster is improved, and the operation calculation time is saved.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention is further illustrated by the following figures and examples, which are not to be construed as limiting the invention.

Examples are given. A soaring processor management and scheduling method based on an SLURM job scheduling system, as shown in FIG. 1, uses a soaring processor as an NPU similar to a GPU, and manages and schedules the NPU through a GRES plug-in the SLURM to realize the management and scheduling of the soaring processor; the method comprises the following specific steps:

F. start SLURM service to manage and schedule the promotion processor.

The specific contents of the NPU plug-in module added in the step A are as follows:

setting or acquiring hardware information of the NPU through a DSMI interface function, wherein the hardware information at least comprises the chip number and the chip model of the NPU;

the DSMI interface functions mainly used for adding the NPU plug-in module are shown in the following table:

TABLE 1 DSMI interface

DSMI interface	Description of the interface
		dsmi_get_version	Obtaining interface versions
dsmi_get_chip_info	Obtaining chip information
		dsmi_get_device_count	Obtain the number of chips
dsmi_get_memory_info	Obtaining memory information
		dsmi_get_device_frequency	Obtaining chip frequency
dsmi_get_phyid_from_logicid	Logical ID to physical ID translation
		dsmi_get_logicid_from_phyid	Physical ID to logical ID translation

The specific implementation scheme is as follows: the name of a newly added folder under the src/plugin directory of SLURM is npu, and the file structure of an internal directory in the npu folder is as follows:

at NPU _ general.c, for a file for adding an NPU module specifically, makefile.am and makefile.in are compiling auxiliary files mainly used for informing the SLURM system how to compile NPU _ general.c files, and NPU _ general.c files mainly function in acquiring hardware information of an NPU and transmitting the information to the SLURM scheduling system, for example, a function _ get _ system _ NPU _ list _ dsmi in the file is defined as:

static List_get_system_npu_list_dsmi(node_config_load_t

*node_config){}

the function functions as: if the NPU resource is detected in the node, returning a definition list of the NPU resource, and acquiring NPU resource information including NPU drive versions, NPU chip information, the number of NPU chips and the like.

The function also calls the DSMI interface functions DSMI _ get _ version, DSMI _ get _ chip _ info, and DSMI _ get _ device _ count in table 1 to implement this function.

In addition, the npu _ genetic.c file contains the following functions:

table 2 npu _ genetic.c. file section main function

npu _ genetic.c File Primary function	Description of functions
		init	NPU plug-in initialization
fini	NPU plug-in call termination
		_dsmi_get_mem_freqs	Obtaining NPU chip memory frequency information
_dsmi_get_gfx_freqs	Obtaining NPU chip frequency information
		dsmiDeviceGetName	Obtaining NPU chip information
dsmiDeviceGetMinorNumber	Acquire minor number of equipment
		dsmiSystemGetDSMIVersion	Obtaining DSMI interface version name
npu_p_get_system_npu_list	Detecting NPU resource information of a node

The NPU resource function is applied by adding operation in step B, after NPU is applied by salloc as GRES universal resource scheduling, the user applies for NPU resource command addition parameter-GRES-NPU: 4 to apply for 4 NPU cards, at this time, the scheduling system will select eligible server nodes with 4 idle promotion processors, distribute to the operation, and open the usage right of the 4 promotion processors to the user, and grant the user to log in the server node to use resource right, then run the operation;

the NPU related parameters in the salloc instruction are shown in the following table:

TABLE 3 NPU-related parameters of salloc instruction

Taking the salloc instruction as an example, the specific implementation manner of adding the application for using the NPU resource in the salloc instruction is as follows:

b1, add instruction parameters for NPU support in the definition of the structure of slm _ opt _ t of src/common/slm _ opt.h:

b2, in the _ file _ jobdesc _ from _ ops function in src/salloc/salloc.c file, the following code is added:

and C, adding an NPU module to the GRES plug-in unit, wherein the specific content is as follows:

a folder named npu is added under the src/plugin/gres folder, and the internal directory file structure in the npu folder is as follows:

the GRES _ npu.c is a file for adding an NPU module to the GRES plug-in, and both makefile.am and makefile.in are compiling auxiliary files and are mainly used for informing the SLURM system of compiling the GRES _ npu.c file.

The gres _ npu.c file also contains the following:

table 4 gres _ npu.c file section main functions

The recompiling SLURM source code described in the step D has the following specific contents:

d6, recompiling the modified SLURM code.

The specific content of the modified SLURM configuration file described in step E is as follows:

e1, set "GresTypes npu" in slurm. conf; set to "GresTypes-NPU, GPU", indicating simultaneous support for NPU and GPU;

e2, setting the number of NPU resources of the NPU node in slarm.conf;

e4, add in cgroup. conf file:

ConstrainCores＝yes；

ConstrainRAMSpace＝yes；

ConstrainDevices＝yes；

the configuration of constraint devices may enable jobs to be scheduled according to GRES, that is, resources can be allocated according to the unit of an NPU card, for example, a node of 8 NPU cards may run 8 tasks applying for each NPU card at the same time.

An Atlas800 server with 8 NPU cards, the nodes of which are configured as follows: NodeName Huawei CPUs 192Gres npu 8threads percore 1RealMemory 785000;

consf files are recorded as follows: NodeName, Huawei Name, npu File, dev, davinci [0-7 ].

Step F, starting SLURM service, taking centros 7 as an example:

starting the slarmctld service at the management node: system start slarmctld;

starting the slurmd service at all the SLURM management and computing nodes: systemctl start slurmd.

After SLURM and upgrade processor (NPU) are merged, the upgrade server can schedule the upgrade processor resource through SLURM, the user can queue and wait for the NPU resource to be allocated and run when submitting the operation application NPU resource, and in addition, the user can also inquire the cluster state, inquire the node resource, inquire the partition resource, submit the operation and inquire the operation state.

The following are examples of functions implemented by the present invention:

1) querying cluster status

The sinfo instruction can check the states of all nodes in the whole cluster, including the states of the CPU, GPU and NPU nodes, and the following figure shows that there are 1 node in the huawei partition, the node is named huawei, the state of the node is idle, and the user can run immediately when submitting a job in the partition.

2) Querying node resources

The scontrol show node instruction can check the resource condition of the node and the current operation state of the node, and the lower graph can see that the huawei node has 192 CPU cores, 9 NPU cards and 785000M memory, and the current state of the node is idle and no operation runs on the node.

3) Querying partition resources

The scontrol show partition instruction can check the state of the partition, and the following figure shows that only one node of the huawei partition is provided, all accounts of the partition are allowed to submit jobs, 192 CPU cores are provided in the partition, and the like.

4) Submitting a job

The following figure demonstrates a command of the srun to submit a job, the command applies for a node, applies for 1 NPU card, runs a simple hostname command, and the job runs successfully and outputs the hostname of the node where the job runs.

5) Querying job status

The squeue command can check the job status, and the lower diagram shows that the job with the job number 158, the submitter is the user huawei, the current job status is running, and the job name is bash.

The scontrol show job command can view the specific information of the job, the command in the figure can see that the job with the job number of 158 runs on the node of the huawei, the job submission time is 2021 year 1, 6 days 19:39:33, the QOS used by the job is normal, and the like.

6) Viewing job records

The sacct instruction can acquire data from the operating system to view the resource usage of the running or running job, and the instruction in the figure shows that the job 158 runs on the node huawei, uses 1 CPU core, and uses 1 NPU card.

Analysis of the job 158 shows that the job 158 runs at the huawei node, which has 8 NPU cards, but the job 158 only applies for 1 NPU card, which indicates that the present invention realizes that the SLURM schedules a single NPU node according to the NPU card instead of the whole node, and thus a node with 8 NPU cards can run 8 NPU job tasks at most simultaneously, which greatly improves the utilization rate of NPU resources, and also meets the requirement of the user for jobs with various requirements such as 1 NPU card, 2 NPU card, 4 NPU card, etc., and this fine-grained scheduling mode can improve the utilization rate of resources, meet the running of small tasks requiring NPU resources, and improve the job throughput rate of the whole cluster.

At present, tests in a plurality of high-performance computing clusters prove that the method can normally use a series of operations of querying cluster states, querying node resources, querying partition resources, submitting jobs, querying job states and the like, and all the operations can return results within 3 seconds.

The cluster state is checked, the node resources are inquired, the partition resources are inquired, the instructions such as the job submitting and the like need the SLURM scheduler to acquire the resource state of the nodes (including the NPU nodes), the SLURM can check the node state regularly, if the nodes fail, the nodes can be removed from an available (idle) queue, the job state is set to an unavailable state (down), and the operations embody the management of the SLURM scheduler on the NPU node resources.

When an application NPU resource job is submitted, the SLURM needs to select a node which is most suitable for the job from a plurality of NPU nodes, and set the environment of the job and the node, so that the job can be successfully operated on the corresponding NPU node, and the scheduling function of the SLURM scheduler on the NPU resource is embodied.

Therefore, all commands required by a series of high-performance cluster management scheduling such as cluster state query, node resource query, partition resource query, job submission and job status query of the high-performance cluster including NPU resources are realized, and the management and scheduling method of SLURM to the upgrade processor (NPU) is also realized.

Claims

1. The soaring processor management and scheduling method based on SLURM job scheduling system is characterized in that: using the promotion processor as the NPU similar to the GPU to manage and schedule the NPU through the GRES plug-in SLURM, so as to manage and schedule the promotion processor; the method comprises the following specific steps:

F. start SLURM service to manage and schedule the promotion processor.

2. The method as claimed in claim 1, wherein the NPU plug-in module of step A is as follows:

and setting or acquiring the hardware information of the NPU through the DSMI interface function, wherein the hardware information at least comprises the chip number and the chip model of the acquired NPU.

3. The method as claimed in claim 1, wherein the step C of adding an NPU module to the GRES plug-in module includes the following steps:

adding a folder named NPU under the src/plugin/GRES folder, wherein the GRES _ npi.c file in the folder realizes the functions of initializing an NPU module in the GRES plug-in, setting an environment variable, acquiring operation information, acquiring an NPU resource list and setting operation parameters of the operation, and the function of adding the NPU module to the GRES plug-in is completed.

4. The method as claimed in claim 1, wherein the recompiling SLURM source code of step D comprises:

d1, add-with-dsmi option in the slurn file;

d2, adding an x _ ac _ dsmi.m4 file in an auxdir folder of the SLURM root directory;

d6, recompiling the modified SLURM code.

5. The system of claim 1, wherein the SLURM configuration file is modified in step E by:

e1, set "GresTypes npu" in slurm. conf;

e2, setting the number of NPU resources of the NPU node in slarm.conf;

e4, adding relationship devices to cgroup. conf file, so that SLURM can schedule resources in GRES units.