[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112882828A - Upgrade processor management and scheduling method based on SLURM job scheduling system - Google Patents

Upgrade processor management and scheduling method based on SLURM job scheduling system Download PDF

Info

Publication number
CN112882828A
CN112882828A CN202110096508.6A CN202110096508A CN112882828A CN 112882828 A CN112882828 A CN 112882828A CN 202110096508 A CN202110096508 A CN 202110096508A CN 112882828 A CN112882828 A CN 112882828A
Authority
CN
China
Prior art keywords
npu
slurm
gres
plug
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110096508.6A
Other languages
Chinese (zh)
Other versions
CN112882828B (en
Inventor
马银萍
樊春
杨宏辉
李若淼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202110096508.6A priority Critical patent/CN112882828B/en
Publication of CN112882828A publication Critical patent/CN112882828A/en
Application granted granted Critical
Publication of CN112882828B publication Critical patent/CN112882828B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了基于SLURM作业调度系统的昇腾处理器管理和调度方法,将昇腾处理器作为与GPU类似的NPU,通过SLURM中的GRES插件来对NPU进行管理调度,实现对昇腾处理器进行管理和调度,其具体步骤包括:添加NPU插件模块、添加作业申请NPU资源功能、GRES插件添加NPU模块、重编译SLURM源代码、修改SLURM配置文件、启动SLURM服务。本发明将昇腾处理器作为一种GRES通用资源,用SLURM进行调度,这是首次将昇腾处理器与高性能集群的作业调度系统结合,使得昇腾处理器能够快速应用到跨节点超大规模计算场景中,拓宽昇腾处理器的应用场景,丰富高性能集群的资源类别,提高高性能集群计算力、节约作业计算时间。

Figure 202110096508

The invention discloses a management and scheduling method for an Ascend processor based on a SLURM job scheduling system. The Ascend processor is used as an NPU similar to a GPU, and the GRES plug-in in SLURM is used to manage and schedule the NPU, so as to realize the management and scheduling of the Ascend processor. For management and scheduling, the specific steps include: adding an NPU plug-in module, adding a job requesting NPU resource function, adding an NPU module to a GRES plug-in, recompiling the SLURM source code, modifying the SLURM configuration file, and starting the SLURM service. The invention uses the Ascend processor as a GRES general resource, and uses SLURM for scheduling. This is the first time that the Ascend processor is combined with the job scheduling system of a high-performance cluster, so that the Ascend processor can be quickly applied to a super-large scale across nodes. In computing scenarios, broaden the application scenarios of the Ascend processor, enrich the resource categories of high-performance clusters, improve the computing power of high-performance clusters, and save job computing time.

Figure 202110096508

Description

Upgrade processor management and scheduling method based on SLURM job scheduling system
Technical Field
The present invention relates to the field of processor application, and more particularly to a processor promotion management and scheduling method based on SLURM job scheduling system.
Background
The soaring processor is a brand new AI processor developed in China, aims to provide a chip with higher computing power and lower energy consumption for deep learning research, development and deployment, and is a domestic AI processor which is the first turn of computing power in China. However, the soaring processor is not widely used in the high performance computing field.
At present, high-performance computing operation scheduling software for scheduling the soar processor is not available at home and abroad, so that a high-performance computing operation scheduling system and the soar processor need to be fused, the application scene of the soar processor is further expanded to the high-performance computing field, and the ecological environment of the home processor is further improved. High performance computing usually uses parallelization technology to efficiently and quickly run application programs, usually multiple processors or multiple servers execute the same operation in parallel, if the turbo processor is applied to the high performance computing field, the computing power of the high performance computing can be greatly increased, the efficiency of operation can be increased, and the application scenario of the turbo processor can be expanded.
An SLURM (simple Linux Utility for Resource management) is a highly scalable and fault-tolerant cluster manager and job scheduling system with the widest application range in the high-performance computing field, a Resource management module of the SLURM is mainly responsible for managing, distributing and collecting system resources, a master control process (slurmctld) resides on a master control node, namely a management node, and a monitoring process (slurmd) resides on a computing node. And invoking a corresponding resource collection information function by the slurmd to collect local resource information. Initially, a node daemon process of a computing node sends information registration to a central daemon process, and then a main control process (slarmctld) periodically inquires about the node so as to know the condition of the whole system. SLURM also maintains a queue of pending jobs and manages the overall resource utilization of the jobs. SLURM also manages available compute nodes in an exclusive manner, distributes jobs to a set of allocated nodes to execute jobs and monitors jobs for completion.
The GRES plug-in of SLURM can manage and schedule resources such as GPU, Intel's MIC (Man Integrated core) resources, CUDA multithreading service (MPS), and NIC.
Therefore, the SLURM cluster manager and the operation scheduler which are most widely used in the high performance field are deeply integrated with the soaring processor, so that the SLURM can monitor and schedule the soaring processor, the management and scheduling efficiency of the soaring processor is improved, the computing power is the productivity, and the popularization and application of the domestic AI processor further promote the progress of the industry and the economic development.
However, since the Itanium processor was the artificial intelligence processor released in 2018, and the native da Vinci architecture is adopted, the current mainstream high performance computing and dispatching system (including SLURM) mainly supports the processors such as CPU and GPU, and does not support the Itanium processor chip,
therefore, high-performance computing operation scheduling software for scheduling the soar processor is not available at home and abroad at present, and the soar processor is not widely applied in the high-performance computing field, so that the progress of the industry and the economic development are limited.
Disclosure of Invention
The object of the present invention is to provide a management and scheduling method for promotion processor based on SLURM job scheduling system. The invention can dispatch the soar processor, so that the soar processor can be widely applied in the high-performance computing field, and the progress of industry and the economic development are promoted.
The technical scheme of the invention is as follows: the upgrade processor management and scheduling method based on the SLURM job scheduling system uses the upgrade processor as the NPU similar to the GPU, and manages and schedules the NPU through the GRES plug-in the SLURM, so as to manage and schedule the upgrade processor; the method comprises the following specific steps:
A. adding an NPU plug-in module: acquiring hardware information of the NPU through an interface;
B. add job application NPU resource function: applying for NPU resources through salloc, srun or sbatch commands;
C. adding an NPU module by the GRES plug-in unit: adding an NPU module in a GRES plug-in to distribute and manage NPU resources;
D. recompile SLURM source code: adding a compiling option for the NPU module, and recompiling the SLURM;
E. modify the SLURM configuration file: modify the SLURM configuration file to satisfy support for the NPU;
F. start SLURM service to manage and schedule the promotion processor.
In the aforementioned upgrade processor management and scheduling method based on the SLURM job scheduling system, the details of the NPU plug-in module added in step a are as follows:
setting or acquiring hardware information of the NPU through a DSMI interface function, wherein the hardware information at least comprises the chip number and the chip model of the NPU; the interfaces provided by the elevator processor for acquiring hardware information include ADMI, DCMI and DSMI, but the interface functions provided by the ADMI and the DCMI cannot satisfy the information required by SLURM scheduling so as to determine to use the DSMI interface; with the evolution of the ecology of the soaring processor, it is possible that the SLURM scheduling requirements can be satisfied by other interfaces, not only by using the DSMI interface.
In the upgrade processor management and scheduling method based on the SLURM job scheduling system, the NPU module is added to the GRES plug-in step C, and the specific contents are as follows:
a folder named NPU is added under the src/plugin/GRES folder, and GRES _ npi.c files in the folder realize functions of initialization, environment variable setting, operation information acquisition, NPU resource list acquisition, operation parameter setting and the like of an NPU module in the GRES plug-in, so that the function of adding the NPU module to the GRES plug-in is completed.
In the aforementioned upgrade processor management and scheduling method based on the SLURM job scheduling system, the recompilation of the SLURM source code in step D includes the following specific contents:
d1, add-with-dsmi option in the slurn file; when a-with-DSMI parameter is used during compiling, library files on which the DSMI interface depends are required to be searched, and the added NPU related code files are compiled;
d2, adding an x _ ac _ dsmi.m4 file in an auxdir folder of the SLURM root directory; used to specify the library file on which the DSMI interface depends;
d3, adding support for the NPU module in src/plugins/GRES/makefile.am files of the GRES plug-in;
d4, adding support for the NPU module in a makefile.am file in an src/plugin folder under a root directory;
d5, adding support for the Makefile added by the NPU in the configuration file under the root directory;
d6, recompiling the modified SLURM code.
In the aforementioned method for managing and scheduling an upgrade processor based on the SLURM job scheduling system, the specific content of modifying the SLURM configuration file in step E is as follows:
e1, set "GresTypes npu" in slurm. conf;
e2, setting the number of NPU resources of the NPU node in slarm.conf;
e3, in GRES configuration file GRES. conf, specifying the node with NPU resources, and the device file of the node NPU device;
e4, adding relationship devices to cgroup. conf file to make SLURM able to schedule resources in GRES units instead of in nodes;
in the aforementioned method for managing and scheduling an upgrade processor based on the SLURM job scheduling system, if the cluster includes GPU resources, step E1 may be set to "gresttypes NPU, GPU", which indicates that the NPU and the GPU are supported simultaneously;
compared with the prior art, the invention uses the soar processor as a GRES general resource to perform scheduling by SLURM, which combines the soar processor with the operation scheduling system of the high-performance cluster for the first time, so that the soar processor can be quickly applied to a cross-node ultra-large-scale calculation scene, the application scene of the soar processor is widened, the resource category of the high-performance cluster is enriched, the calculation power of the high-performance cluster is improved, and the operation calculation time is saved.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention is further illustrated by the following figures and examples, which are not to be construed as limiting the invention.
Examples are given. A soaring processor management and scheduling method based on an SLURM job scheduling system, as shown in FIG. 1, uses a soaring processor as an NPU similar to a GPU, and manages and schedules the NPU through a GRES plug-in the SLURM to realize the management and scheduling of the soaring processor; the method comprises the following specific steps:
A. adding an NPU plug-in module: acquiring hardware information of the NPU through an interface;
B. add job application NPU resource function: applying for NPU resources through salloc, srun or sbatch commands;
C. adding an NPU module by the GRES plug-in unit: adding an NPU module in a GRES plug-in to distribute and manage NPU resources;
D. recompile SLURM source code: adding a compiling option for the NPU module, and recompiling the SLURM;
E. modify the SLURM configuration file: modify the SLURM configuration file to satisfy support for the NPU;
F. start SLURM service to manage and schedule the promotion processor.
The specific contents of the NPU plug-in module added in the step A are as follows:
setting or acquiring hardware information of the NPU through a DSMI interface function, wherein the hardware information at least comprises the chip number and the chip model of the NPU;
the DSMI interface functions mainly used for adding the NPU plug-in module are shown in the following table:
TABLE 1 DSMI interface
DSMI interface Description of the interface
dsmi_get_version Obtaining interface versions
dsmi_get_chip_info Obtaining chip information
dsmi_get_device_count Obtain the number of chips
dsmi_get_memory_info Obtaining memory information
dsmi_get_device_frequency Obtaining chip frequency
dsmi_get_phyid_from_logicid Logical ID to physical ID translation
dsmi_get_logicid_from_phyid Physical ID to logical ID translation
The specific implementation scheme is as follows: the name of a newly added folder under the src/plugin directory of SLURM is npu, and the file structure of an internal directory in the npu folder is as follows:
Figure BDA0002914192680000061
at NPU _ general.c, for a file for adding an NPU module specifically, makefile.am and makefile.in are compiling auxiliary files mainly used for informing the SLURM system how to compile NPU _ general.c files, and NPU _ general.c files mainly function in acquiring hardware information of an NPU and transmitting the information to the SLURM scheduling system, for example, a function _ get _ system _ NPU _ list _ dsmi in the file is defined as:
static List_get_system_npu_list_dsmi(node_config_load_t
*node_config){}
the function functions as: if the NPU resource is detected in the node, returning a definition list of the NPU resource, and acquiring NPU resource information including NPU drive versions, NPU chip information, the number of NPU chips and the like.
The function also calls the DSMI interface functions DSMI _ get _ version, DSMI _ get _ chip _ info, and DSMI _ get _ device _ count in table 1 to implement this function.
In addition, the npu _ genetic.c file contains the following functions:
table 2 npu _ genetic.c. file section main function
npu _ genetic.c File Primary function Description of functions
init NPU plug-in initialization
fini NPU plug-in call termination
_dsmi_get_mem_freqs Obtaining NPU chip memory frequency information
_dsmi_get_gfx_freqs Obtaining NPU chip frequency information
dsmiDeviceGetName Obtaining NPU chip information
dsmiDeviceGetMinorNumber Acquire minor number of equipment
dsmiSystemGetDSMIVersion Obtaining DSMI interface version name
npu_p_get_system_npu_list Detecting NPU resource information of a node
The NPU resource function is applied by adding operation in step B, after NPU is applied by salloc as GRES universal resource scheduling, the user applies for NPU resource command addition parameter-GRES-NPU: 4 to apply for 4 NPU cards, at this time, the scheduling system will select eligible server nodes with 4 idle promotion processors, distribute to the operation, and open the usage right of the 4 promotion processors to the user, and grant the user to log in the server node to use resource right, then run the operation;
the NPU related parameters in the salloc instruction are shown in the following table:
TABLE 3 NPU-related parameters of salloc instruction
Figure BDA0002914192680000071
Figure BDA0002914192680000081
Taking the salloc instruction as an example, the specific implementation manner of adding the application for using the NPU resource in the salloc instruction is as follows:
b1, add instruction parameters for NPU support in the definition of the structure of slm _ opt _ t of src/common/slm _ opt.h:
Figure BDA0002914192680000082
b2, in the _ file _ jobdesc _ from _ ops function in src/salloc/salloc.c file, the following code is added:
Figure BDA0002914192680000083
Figure BDA0002914192680000091
and C, adding an NPU module to the GRES plug-in unit, wherein the specific content is as follows:
a folder named npu is added under the src/plugin/gres folder, and the internal directory file structure in the npu folder is as follows:
Figure BDA0002914192680000092
the GRES _ npu.c is a file for adding an NPU module to the GRES plug-in, and both makefile.am and makefile.in are compiling auxiliary files and are mainly used for informing the SLURM system of compiling the GRES _ npu.c file.
The gres _ npu.c file also contains the following:
table 4 gres _ npu.c file section main functions
Figure BDA0002914192680000093
Figure BDA0002914192680000101
The recompiling SLURM source code described in the step D has the following specific contents:
d1, add-with-dsmi option in the slurn file; when a-with-DSMI parameter is used during compiling, library files on which the DSMI interface depends are required to be searched, and the added NPU related code files are compiled;
Figure BDA0002914192680000102
d2, adding an x _ ac _ dsmi.m4 file in an auxdir folder of the SLURM root directory; used to specify the library file on which the DSMI interface depends;
Figure BDA0002914192680000111
Figure BDA0002914192680000121
d3, adding support for the NPU module in src/plugins/GRES/makefile.am files of the GRES plug-in;
Figure BDA0002914192680000122
d4, adding support for the NPU module in a makefile.am file in an src/plugin folder under a root directory;
Figure BDA0002914192680000123
d5, adding support for the Makefile added by the NPU in the configuration file under the root directory;
Figure BDA0002914192680000124
d6, recompiling the modified SLURM code.
Figure BDA0002914192680000131
The specific content of the modified SLURM configuration file described in step E is as follows:
e1, set "GresTypes npu" in slurm. conf; set to "GresTypes-NPU, GPU", indicating simultaneous support for NPU and GPU;
e2, setting the number of NPU resources of the NPU node in slarm.conf;
e3, in GRES configuration file GRES. conf, specifying the node with NPU resources, and the device file of the node NPU device;
e4, add in cgroup. conf file:
ConstrainCores=yes;
ConstrainRAMSpace=yes;
ConstrainDevices=yes;
the configuration of constraint devices may enable jobs to be scheduled according to GRES, that is, resources can be allocated according to the unit of an NPU card, for example, a node of 8 NPU cards may run 8 tasks applying for each NPU card at the same time.
An Atlas800 server with 8 NPU cards, the nodes of which are configured as follows: NodeName Huawei CPUs 192Gres npu 8threads percore 1RealMemory 785000;
consf files are recorded as follows: NodeName, Huawei Name, npu File, dev, davinci [0-7 ].
Step F, starting SLURM service, taking centros 7 as an example:
starting the slarmctld service at the management node: system start slarmctld;
starting the slurmd service at all the SLURM management and computing nodes: systemctl start slurmd.
After SLURM and upgrade processor (NPU) are merged, the upgrade server can schedule the upgrade processor resource through SLURM, the user can queue and wait for the NPU resource to be allocated and run when submitting the operation application NPU resource, and in addition, the user can also inquire the cluster state, inquire the node resource, inquire the partition resource, submit the operation and inquire the operation state.
The following are examples of functions implemented by the present invention:
1) querying cluster status
The sinfo instruction can check the states of all nodes in the whole cluster, including the states of the CPU, GPU and NPU nodes, and the following figure shows that there are 1 node in the huawei partition, the node is named huawei, the state of the node is idle, and the user can run immediately when submitting a job in the partition.
Figure BDA0002914192680000141
2) Querying node resources
The scontrol show node instruction can check the resource condition of the node and the current operation state of the node, and the lower graph can see that the huawei node has 192 CPU cores, 9 NPU cards and 785000M memory, and the current state of the node is idle and no operation runs on the node.
Figure BDA0002914192680000151
3) Querying partition resources
The scontrol show partition instruction can check the state of the partition, and the following figure shows that only one node of the huawei partition is provided, all accounts of the partition are allowed to submit jobs, 192 CPU cores are provided in the partition, and the like.
Figure BDA0002914192680000152
4) Submitting a job
The following figure demonstrates a command of the srun to submit a job, the command applies for a node, applies for 1 NPU card, runs a simple hostname command, and the job runs successfully and outputs the hostname of the node where the job runs.
Figure BDA0002914192680000153
5) Querying job status
The squeue command can check the job status, and the lower diagram shows that the job with the job number 158, the submitter is the user huawei, the current job status is running, and the job name is bash.
Figure BDA0002914192680000161
The scontrol show job command can view the specific information of the job, the command in the figure can see that the job with the job number of 158 runs on the node of the huawei, the job submission time is 2021 year 1, 6 days 19:39:33, the QOS used by the job is normal, and the like.
Figure BDA0002914192680000162
6) Viewing job records
The sacct instruction can acquire data from the operating system to view the resource usage of the running or running job, and the instruction in the figure shows that the job 158 runs on the node huawei, uses 1 CPU core, and uses 1 NPU card.
Figure BDA0002914192680000171
Analysis of the job 158 shows that the job 158 runs at the huawei node, which has 8 NPU cards, but the job 158 only applies for 1 NPU card, which indicates that the present invention realizes that the SLURM schedules a single NPU node according to the NPU card instead of the whole node, and thus a node with 8 NPU cards can run 8 NPU job tasks at most simultaneously, which greatly improves the utilization rate of NPU resources, and also meets the requirement of the user for jobs with various requirements such as 1 NPU card, 2 NPU card, 4 NPU card, etc., and this fine-grained scheduling mode can improve the utilization rate of resources, meet the running of small tasks requiring NPU resources, and improve the job throughput rate of the whole cluster.
At present, tests in a plurality of high-performance computing clusters prove that the method can normally use a series of operations of querying cluster states, querying node resources, querying partition resources, submitting jobs, querying job states and the like, and all the operations can return results within 3 seconds.
The cluster state is checked, the node resources are inquired, the partition resources are inquired, the instructions such as the job submitting and the like need the SLURM scheduler to acquire the resource state of the nodes (including the NPU nodes), the SLURM can check the node state regularly, if the nodes fail, the nodes can be removed from an available (idle) queue, the job state is set to an unavailable state (down), and the operations embody the management of the SLURM scheduler on the NPU node resources.
When an application NPU resource job is submitted, the SLURM needs to select a node which is most suitable for the job from a plurality of NPU nodes, and set the environment of the job and the node, so that the job can be successfully operated on the corresponding NPU node, and the scheduling function of the SLURM scheduler on the NPU resource is embodied.
Therefore, all commands required by a series of high-performance cluster management scheduling such as cluster state query, node resource query, partition resource query, job submission and job status query of the high-performance cluster including NPU resources are realized, and the management and scheduling method of SLURM to the upgrade processor (NPU) is also realized.

Claims (5)

1. The soaring processor management and scheduling method based on SLURM job scheduling system is characterized in that: using the promotion processor as the NPU similar to the GPU to manage and schedule the NPU through the GRES plug-in SLURM, so as to manage and schedule the promotion processor; the method comprises the following specific steps:
A. adding an NPU plug-in module: acquiring hardware information of the NPU through an interface;
B. add job application NPU resource function: applying for NPU resources through salloc, srun or sbatch commands;
C. adding an NPU module by the GRES plug-in unit: adding an NPU module in a GRES plug-in to distribute and manage NPU resources;
D. recompile SLURM source code: adding a compiling option for the NPU module, and recompiling the SLURM;
E. modify the SLURM configuration file: modify the SLURM configuration file to satisfy support for the NPU;
F. start SLURM service to manage and schedule the promotion processor.
2. The method as claimed in claim 1, wherein the NPU plug-in module of step A is as follows:
and setting or acquiring the hardware information of the NPU through the DSMI interface function, wherein the hardware information at least comprises the chip number and the chip model of the acquired NPU.
3. The method as claimed in claim 1, wherein the step C of adding an NPU module to the GRES plug-in module includes the following steps:
adding a folder named NPU under the src/plugin/GRES folder, wherein the GRES _ npi.c file in the folder realizes the functions of initializing an NPU module in the GRES plug-in, setting an environment variable, acquiring operation information, acquiring an NPU resource list and setting operation parameters of the operation, and the function of adding the NPU module to the GRES plug-in is completed.
4. The method as claimed in claim 1, wherein the recompiling SLURM source code of step D comprises:
d1, add-with-dsmi option in the slurn file;
d2, adding an x _ ac _ dsmi.m4 file in an auxdir folder of the SLURM root directory;
d3, adding support for the NPU module in src/plugins/GRES/makefile.am files of the GRES plug-in;
d4, adding support for the NPU module in a makefile.am file in an src/plugin folder under a root directory;
d5, adding support for the Makefile added by the NPU in the configuration file under the root directory;
d6, recompiling the modified SLURM code.
5. The system of claim 1, wherein the SLURM configuration file is modified in step E by:
e1, set "GresTypes npu" in slurm. conf;
e2, setting the number of NPU resources of the NPU node in slarm.conf;
e3, in GRES configuration file GRES. conf, specifying the node with NPU resources, and the device file of the node NPU device;
e4, adding relationship devices to cgroup. conf file, so that SLURM can schedule resources in GRES units.
CN202110096508.6A 2021-01-25 2021-01-25 Method for managing and scheduling a processor in a processor-based SLURM operation scheduling system Active CN112882828B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110096508.6A CN112882828B (en) 2021-01-25 2021-01-25 Method for managing and scheduling a processor in a processor-based SLURM operation scheduling system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110096508.6A CN112882828B (en) 2021-01-25 2021-01-25 Method for managing and scheduling a processor in a processor-based SLURM operation scheduling system

Publications (2)

Publication Number Publication Date
CN112882828A true CN112882828A (en) 2021-06-01
CN112882828B CN112882828B (en) 2023-09-05

Family

ID=76050985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110096508.6A Active CN112882828B (en) 2021-01-25 2021-01-25 Method for managing and scheduling a processor in a processor-based SLURM operation scheduling system

Country Status (1)

Country Link
CN (1) CN112882828B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377540A (en) * 2021-06-15 2021-09-10 上海商汤科技开发有限公司 Cluster resource scheduling method and device, electronic equipment and storage medium
CN113722269A (en) * 2021-08-26 2021-11-30 北京大学 Stride slice operator processing method and device based on soaring AI processor
CN114461186A (en) * 2021-12-15 2022-05-10 中山大学 A method for automatically compiling and running C/C++ code for Huawei Ascend accelerator card
CN114745385A (en) * 2022-04-12 2022-07-12 吉林大学 Method for constructing slurm scheduling parallel computing cluster
CN117632428A (en) * 2023-12-01 2024-03-01 世芯电子科技(无锡)有限公司 Resource scheduling management method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593192A (en) * 2013-11-19 2014-02-19 湖南大学 Algorithm integration and evaluation platform and method based on SLURM scheduling
CN108334409A (en) * 2018-01-15 2018-07-27 北京大学 A kind of fine-grained high-performance cloud resource management dispatching method
US20180336723A1 (en) * 2017-05-17 2018-11-22 Lawrence Livermore National Security, Llc Tool for shared engineering mesh-run integration with version evolution tracking
CN110795241A (en) * 2019-10-18 2020-02-14 北京并行科技股份有限公司 Job scheduling management method, scheduling center and system
CN111198755A (en) * 2019-12-23 2020-05-26 曙光信息产业(北京)有限公司 SLURM job scheduling system-based pre-charging device and method
WO2020172692A2 (en) * 2020-04-27 2020-08-27 Futurewei Technologies, Inc. Dynamic resource tuning cloud service

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593192A (en) * 2013-11-19 2014-02-19 湖南大学 Algorithm integration and evaluation platform and method based on SLURM scheduling
US20180336723A1 (en) * 2017-05-17 2018-11-22 Lawrence Livermore National Security, Llc Tool for shared engineering mesh-run integration with version evolution tracking
CN108334409A (en) * 2018-01-15 2018-07-27 北京大学 A kind of fine-grained high-performance cloud resource management dispatching method
CN110795241A (en) * 2019-10-18 2020-02-14 北京并行科技股份有限公司 Job scheduling management method, scheduling center and system
CN111198755A (en) * 2019-12-23 2020-05-26 曙光信息产业(北京)有限公司 SLURM job scheduling system-based pre-charging device and method
WO2020172692A2 (en) * 2020-04-27 2020-08-27 Futurewei Technologies, Inc. Dynamic resource tuning cloud service

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377540A (en) * 2021-06-15 2021-09-10 上海商汤科技开发有限公司 Cluster resource scheduling method and device, electronic equipment and storage medium
CN113722269A (en) * 2021-08-26 2021-11-30 北京大学 Stride slice operator processing method and device based on soaring AI processor
CN114461186A (en) * 2021-12-15 2022-05-10 中山大学 A method for automatically compiling and running C/C++ code for Huawei Ascend accelerator card
CN114745385A (en) * 2022-04-12 2022-07-12 吉林大学 Method for constructing slurm scheduling parallel computing cluster
CN117632428A (en) * 2023-12-01 2024-03-01 世芯电子科技(无锡)有限公司 Resource scheduling management method, device, equipment and storage medium
CN117632428B (en) * 2023-12-01 2024-05-28 世芯电子科技(无锡)有限公司 Resource scheduling management method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112882828B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
CN112882828A (en) Upgrade processor management and scheduling method based on SLURM job scheduling system
Razaque et al. Task scheduling in cloud computing
US7689996B2 (en) Method to distribute programs using remote Java objects
US8739171B2 (en) High-throughput-computing in a hybrid computing environment
CN104243617B (en) Towards the method for scheduling task and system of mixed load in a kind of isomeric group
US8171481B2 (en) Method and system for scheduling jobs based on resource relationships
CN107291550B (en) A Spark platform resource dynamic allocation method and system for iterative applications
CN107316124B (en) Extensive affairs type job scheduling and processing general-purpose system under big data environment
CN102521024A (en) Job scheduling method based on bioinformation cloud platform
CN106027617A (en) Method for implementing dynamic scheduling of tasks and resources in private cloud environment
CN101976204B (en) Service-oriented heterogeneous multi-core computing platform and task scheduling method used by same
CN105094984A (en) Resource scheduling method and system
CN105912383A (en) High-reliability dependent task scheduling and resource configuration method
Wang et al. Dependency-aware network adaptive scheduling of data-intensive parallel jobs
Stafford et al. Improving utilization of heterogeneous clusters
CN114356714B (en) Resource integrated monitoring and scheduling device based on Kubernetes intelligent board cluster
CN110084507B (en) A hierarchical-aware scientific workflow scheduling optimization method in cloud computing environment
CN114816694A (en) A multi-process collaborative RPA task scheduling method and device
CN115794355B (en) Task processing method, device, terminal equipment and storage medium
Santcroos et al. Executing dynamic heterogeneous workloads on blue waters with radical-pilot
Pandey et al. Constraint programming versus heuristic approach to MapReduce scheduling problem in Hadoop YARN for energy minimization
CN114896054A (en) Cross-heterogeneous computing engine big data task scheduling method, device and medium
Zhang et al. COBRA: Toward provably efficient semi-clairvoyant scheduling in data analytics systems
US8402465B2 (en) System tool placement in a multiprocessor computer
CN112711448A (en) Agent technology-based parallel component assembling and performance optimizing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant