Disclosure of Invention
The object of the present invention is to provide a management and scheduling method for promotion processor based on SLURM job scheduling system. The invention can dispatch the soar processor, so that the soar processor can be widely applied in the high-performance computing field, and the progress of industry and the economic development are promoted.
The technical scheme of the invention is as follows: the upgrade processor management and scheduling method based on the SLURM job scheduling system uses the upgrade processor as the NPU similar to the GPU, and manages and schedules the NPU through the GRES plug-in the SLURM, so as to manage and schedule the upgrade processor; the method comprises the following specific steps:
A. adding an NPU plug-in module: acquiring hardware information of the NPU through an interface;
B. add job application NPU resource function: applying for NPU resources through salloc, srun or sbatch commands;
C. adding an NPU module by the GRES plug-in unit: adding an NPU module in a GRES plug-in to distribute and manage NPU resources;
D. recompile SLURM source code: adding a compiling option for the NPU module, and recompiling the SLURM;
E. modify the SLURM configuration file: modify the SLURM configuration file to satisfy support for the NPU;
F. start SLURM service to manage and schedule the promotion processor.
In the aforementioned upgrade processor management and scheduling method based on the SLURM job scheduling system, the details of the NPU plug-in module added in step a are as follows:
setting or acquiring hardware information of the NPU through a DSMI interface function, wherein the hardware information at least comprises the chip number and the chip model of the NPU; the interfaces provided by the elevator processor for acquiring hardware information include ADMI, DCMI and DSMI, but the interface functions provided by the ADMI and the DCMI cannot satisfy the information required by SLURM scheduling so as to determine to use the DSMI interface; with the evolution of the ecology of the soaring processor, it is possible that the SLURM scheduling requirements can be satisfied by other interfaces, not only by using the DSMI interface.
In the upgrade processor management and scheduling method based on the SLURM job scheduling system, the NPU module is added to the GRES plug-in step C, and the specific contents are as follows:
a folder named NPU is added under the src/plugin/GRES folder, and GRES _ npi.c files in the folder realize functions of initialization, environment variable setting, operation information acquisition, NPU resource list acquisition, operation parameter setting and the like of an NPU module in the GRES plug-in, so that the function of adding the NPU module to the GRES plug-in is completed.
In the aforementioned upgrade processor management and scheduling method based on the SLURM job scheduling system, the recompilation of the SLURM source code in step D includes the following specific contents:
d1, add-with-dsmi option in the slurn file; when a-with-DSMI parameter is used during compiling, library files on which the DSMI interface depends are required to be searched, and the added NPU related code files are compiled;
d2, adding an x _ ac _ dsmi.m4 file in an auxdir folder of the SLURM root directory; used to specify the library file on which the DSMI interface depends;
d3, adding support for the NPU module in src/plugins/GRES/makefile.am files of the GRES plug-in;
d4, adding support for the NPU module in a makefile.am file in an src/plugin folder under a root directory;
d5, adding support for the Makefile added by the NPU in the configuration file under the root directory;
d6, recompiling the modified SLURM code.
In the aforementioned method for managing and scheduling an upgrade processor based on the SLURM job scheduling system, the specific content of modifying the SLURM configuration file in step E is as follows:
e1, set "GresTypes npu" in slurm. conf;
e2, setting the number of NPU resources of the NPU node in slarm.conf;
e3, in GRES configuration file GRES. conf, specifying the node with NPU resources, and the device file of the node NPU device;
e4, adding relationship devices to cgroup. conf file to make SLURM able to schedule resources in GRES units instead of in nodes;
in the aforementioned method for managing and scheduling an upgrade processor based on the SLURM job scheduling system, if the cluster includes GPU resources, step E1 may be set to "gresttypes NPU, GPU", which indicates that the NPU and the GPU are supported simultaneously;
compared with the prior art, the invention uses the soar processor as a GRES general resource to perform scheduling by SLURM, which combines the soar processor with the operation scheduling system of the high-performance cluster for the first time, so that the soar processor can be quickly applied to a cross-node ultra-large-scale calculation scene, the application scene of the soar processor is widened, the resource category of the high-performance cluster is enriched, the calculation power of the high-performance cluster is improved, and the operation calculation time is saved.
Detailed Description
The invention is further illustrated by the following figures and examples, which are not to be construed as limiting the invention.
Examples are given. A soaring processor management and scheduling method based on an SLURM job scheduling system, as shown in FIG. 1, uses a soaring processor as an NPU similar to a GPU, and manages and schedules the NPU through a GRES plug-in the SLURM to realize the management and scheduling of the soaring processor; the method comprises the following specific steps:
A. adding an NPU plug-in module: acquiring hardware information of the NPU through an interface;
B. add job application NPU resource function: applying for NPU resources through salloc, srun or sbatch commands;
C. adding an NPU module by the GRES plug-in unit: adding an NPU module in a GRES plug-in to distribute and manage NPU resources;
D. recompile SLURM source code: adding a compiling option for the NPU module, and recompiling the SLURM;
E. modify the SLURM configuration file: modify the SLURM configuration file to satisfy support for the NPU;
F. start SLURM service to manage and schedule the promotion processor.
The specific contents of the NPU plug-in module added in the step A are as follows:
setting or acquiring hardware information of the NPU through a DSMI interface function, wherein the hardware information at least comprises the chip number and the chip model of the NPU;
the DSMI interface functions mainly used for adding the NPU plug-in module are shown in the following table:
TABLE 1 DSMI interface
DSMI interface
|
Description of the interface
|
dsmi_get_version
|
Obtaining interface versions
|
dsmi_get_chip_info
|
Obtaining chip information
|
dsmi_get_device_count
|
Obtain the number of chips
|
dsmi_get_memory_info
|
Obtaining memory information
|
dsmi_get_device_frequency
|
Obtaining chip frequency
|
dsmi_get_phyid_from_logicid
|
Logical ID to physical ID translation
|
dsmi_get_logicid_from_phyid
|
Physical ID to logical ID translation |
The specific implementation scheme is as follows: the name of a newly added folder under the src/plugin directory of SLURM is npu, and the file structure of an internal directory in the npu folder is as follows:
at NPU _ general.c, for a file for adding an NPU module specifically, makefile.am and makefile.in are compiling auxiliary files mainly used for informing the SLURM system how to compile NPU _ general.c files, and NPU _ general.c files mainly function in acquiring hardware information of an NPU and transmitting the information to the SLURM scheduling system, for example, a function _ get _ system _ NPU _ list _ dsmi in the file is defined as:
static List_get_system_npu_list_dsmi(node_config_load_t
*node_config){}
the function functions as: if the NPU resource is detected in the node, returning a definition list of the NPU resource, and acquiring NPU resource information including NPU drive versions, NPU chip information, the number of NPU chips and the like.
The function also calls the DSMI interface functions DSMI _ get _ version, DSMI _ get _ chip _ info, and DSMI _ get _ device _ count in table 1 to implement this function.
In addition, the npu _ genetic.c file contains the following functions:
table 2 npu _ genetic.c. file section main function
npu _ genetic.c File Primary function
|
Description of functions
|
init
|
NPU plug-in initialization
|
fini
|
NPU plug-in call termination
|
_dsmi_get_mem_freqs
|
Obtaining NPU chip memory frequency information
|
_dsmi_get_gfx_freqs
|
Obtaining NPU chip frequency information
|
dsmiDeviceGetName
|
Obtaining NPU chip information
|
dsmiDeviceGetMinorNumber
|
Acquire minor number of equipment
|
dsmiSystemGetDSMIVersion
|
Obtaining DSMI interface version name
|
npu_p_get_system_npu_list
|
Detecting NPU resource information of a node |
The NPU resource function is applied by adding operation in step B, after NPU is applied by salloc as GRES universal resource scheduling, the user applies for NPU resource command addition parameter-GRES-NPU: 4 to apply for 4 NPU cards, at this time, the scheduling system will select eligible server nodes with 4 idle promotion processors, distribute to the operation, and open the usage right of the 4 promotion processors to the user, and grant the user to log in the server node to use resource right, then run the operation;
the NPU related parameters in the salloc instruction are shown in the following table:
TABLE 3 NPU-related parameters of salloc instruction
Taking the salloc instruction as an example, the specific implementation manner of adding the application for using the NPU resource in the salloc instruction is as follows:
b1, add instruction parameters for NPU support in the definition of the structure of slm _ opt _ t of src/common/slm _ opt.h:
b2, in the _ file _ jobdesc _ from _ ops function in src/salloc/salloc.c file, the following code is added:
and C, adding an NPU module to the GRES plug-in unit, wherein the specific content is as follows:
a folder named npu is added under the src/plugin/gres folder, and the internal directory file structure in the npu folder is as follows:
the GRES _ npu.c is a file for adding an NPU module to the GRES plug-in, and both makefile.am and makefile.in are compiling auxiliary files and are mainly used for informing the SLURM system of compiling the GRES _ npu.c file.
The gres _ npu.c file also contains the following:
table 4 gres _ npu.c file section main functions
The recompiling SLURM source code described in the step D has the following specific contents:
d1, add-with-dsmi option in the slurn file; when a-with-DSMI parameter is used during compiling, library files on which the DSMI interface depends are required to be searched, and the added NPU related code files are compiled;
d2, adding an x _ ac _ dsmi.m4 file in an auxdir folder of the SLURM root directory; used to specify the library file on which the DSMI interface depends;
d3, adding support for the NPU module in src/plugins/GRES/makefile.am files of the GRES plug-in;
d4, adding support for the NPU module in a makefile.am file in an src/plugin folder under a root directory;
d5, adding support for the Makefile added by the NPU in the configuration file under the root directory;
d6, recompiling the modified SLURM code.
The specific content of the modified SLURM configuration file described in step E is as follows:
e1, set "GresTypes npu" in slurm. conf; set to "GresTypes-NPU, GPU", indicating simultaneous support for NPU and GPU;
e2, setting the number of NPU resources of the NPU node in slarm.conf;
e3, in GRES configuration file GRES. conf, specifying the node with NPU resources, and the device file of the node NPU device;
e4, add in cgroup. conf file:
ConstrainCores=yes;
ConstrainRAMSpace=yes;
ConstrainDevices=yes;
the configuration of constraint devices may enable jobs to be scheduled according to GRES, that is, resources can be allocated according to the unit of an NPU card, for example, a node of 8 NPU cards may run 8 tasks applying for each NPU card at the same time.
An Atlas800 server with 8 NPU cards, the nodes of which are configured as follows: NodeName Huawei CPUs 192Gres npu 8threads percore 1RealMemory 785000;
consf files are recorded as follows: NodeName, Huawei Name, npu File, dev, davinci [0-7 ].
Step F, starting SLURM service, taking centros 7 as an example:
starting the slarmctld service at the management node: system start slarmctld;
starting the slurmd service at all the SLURM management and computing nodes: systemctl start slurmd.
After SLURM and upgrade processor (NPU) are merged, the upgrade server can schedule the upgrade processor resource through SLURM, the user can queue and wait for the NPU resource to be allocated and run when submitting the operation application NPU resource, and in addition, the user can also inquire the cluster state, inquire the node resource, inquire the partition resource, submit the operation and inquire the operation state.
The following are examples of functions implemented by the present invention:
1) querying cluster status
The sinfo instruction can check the states of all nodes in the whole cluster, including the states of the CPU, GPU and NPU nodes, and the following figure shows that there are 1 node in the huawei partition, the node is named huawei, the state of the node is idle, and the user can run immediately when submitting a job in the partition.
2) Querying node resources
The scontrol show node instruction can check the resource condition of the node and the current operation state of the node, and the lower graph can see that the huawei node has 192 CPU cores, 9 NPU cards and 785000M memory, and the current state of the node is idle and no operation runs on the node.
3) Querying partition resources
The scontrol show partition instruction can check the state of the partition, and the following figure shows that only one node of the huawei partition is provided, all accounts of the partition are allowed to submit jobs, 192 CPU cores are provided in the partition, and the like.
4) Submitting a job
The following figure demonstrates a command of the srun to submit a job, the command applies for a node, applies for 1 NPU card, runs a simple hostname command, and the job runs successfully and outputs the hostname of the node where the job runs.
5) Querying job status
The squeue command can check the job status, and the lower diagram shows that the job with the job number 158, the submitter is the user huawei, the current job status is running, and the job name is bash.
The scontrol show job command can view the specific information of the job, the command in the figure can see that the job with the job number of 158 runs on the node of the huawei, the job submission time is 2021 year 1, 6 days 19:39:33, the QOS used by the job is normal, and the like.
6) Viewing job records
The sacct instruction can acquire data from the operating system to view the resource usage of the running or running job, and the instruction in the figure shows that the job 158 runs on the node huawei, uses 1 CPU core, and uses 1 NPU card.
Analysis of the job 158 shows that the job 158 runs at the huawei node, which has 8 NPU cards, but the job 158 only applies for 1 NPU card, which indicates that the present invention realizes that the SLURM schedules a single NPU node according to the NPU card instead of the whole node, and thus a node with 8 NPU cards can run 8 NPU job tasks at most simultaneously, which greatly improves the utilization rate of NPU resources, and also meets the requirement of the user for jobs with various requirements such as 1 NPU card, 2 NPU card, 4 NPU card, etc., and this fine-grained scheduling mode can improve the utilization rate of resources, meet the running of small tasks requiring NPU resources, and improve the job throughput rate of the whole cluster.
At present, tests in a plurality of high-performance computing clusters prove that the method can normally use a series of operations of querying cluster states, querying node resources, querying partition resources, submitting jobs, querying job states and the like, and all the operations can return results within 3 seconds.
The cluster state is checked, the node resources are inquired, the partition resources are inquired, the instructions such as the job submitting and the like need the SLURM scheduler to acquire the resource state of the nodes (including the NPU nodes), the SLURM can check the node state regularly, if the nodes fail, the nodes can be removed from an available (idle) queue, the job state is set to an unavailable state (down), and the operations embody the management of the SLURM scheduler on the NPU node resources.
When an application NPU resource job is submitted, the SLURM needs to select a node which is most suitable for the job from a plurality of NPU nodes, and set the environment of the job and the node, so that the job can be successfully operated on the corresponding NPU node, and the scheduling function of the SLURM scheduler on the NPU resource is embodied.
Therefore, all commands required by a series of high-performance cluster management scheduling such as cluster state query, node resource query, partition resource query, job submission and job status query of the high-performance cluster including NPU resources are realized, and the management and scheduling method of SLURM to the upgrade processor (NPU) is also realized.