CN115904675A

CN115904675A - Virtual GPU scheduling method, device and medium for container system

Info

Publication number: CN115904675A
Application number: CN202110907216.6A
Authority: CN
Inventors: 邱红飞; 李先绪; 郑文武; 黄植勤; 王海霞; 黄春光; 陈辉
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2023-04-04

Abstract

The present disclosure relates to a vGPU scheduling method of a container system, which comprises the following steps: responding to a resource allocation request initiated by the Kubelet to the GPU Device Plugin, and filtering all host nodes by a Kubernetes default scheduler at one time according to the quantity and the type of the vGPU so as to screen out the host nodes meeting the conditions; performing secondary filtering by a GPU Scheduler Extender aiming at the host nodes screened out by the primary filtering to screen out GPU cards with idle resources meeting the conditions and the minimum idle resources; if a GPU card meeting the conditions is found through primary filtering and secondary filtering, creating a vGPU on the GPU card meeting the conditions, and binding the vGPU to a corresponding POD; and if no GPU card meeting the conditions is found through the primary filtering and the secondary filtering, the scheduling is carried out again.

Description

Virtual GPU scheduling method, device and medium for container system

Technical Field

The present disclosure relates to virtualized GPU scheduling for container systems.

Background

At present, the data scale of enterprises is continuously increased, and enterprises purchase a large number of GPU servers to meet the demand of a large amount of artificial intelligence computing power, so that the capacity of efficiently managing and scheduling large-scale GPU resources and flexibly scheduling cluster GPU computing power resources is needed. Like the computing power of the cluster CPU processor, enterprises also need to improve the GPU resource utilization rate in a distributed training mode and the like, so that the cost is controlled, and the computing power resource waste is reduced.

Disclosure of Invention

The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. However, it should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.

The disclosure provides a virtualized GPU scheduling method for a container system, the method comprising:

responding to a resource allocation request initiated by the Kubelet to the GPU Device Plugin, and filtering all host nodes by a Kubernetes default scheduler at one time according to the quantity and the type of the vGPU so as to screen out the host nodes meeting the conditions;

performing secondary filtering by a GPU Scheduler Extender aiming at the host nodes screened out by the primary filtering to screen out GPU cards with idle resources meeting the conditions and the minimum idle resources;

if the GPU card meeting the conditions is found through the primary filtering and the secondary filtering, creating a vGPU on the GPU card meeting the conditions, and binding the vGPU to the corresponding POD;

and if no GPU card meeting the conditions is found through the primary filtering and the secondary filtering, the scheduling is carried out again.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of the preferred embodiments of the present disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

fig. 1 shows a schematic diagram of a scheduling system of virtualized GPUs of a container system according to the present disclosure.

Fig. 2 shows an exemplary overall process of Device plug resource reporting.

Fig. 3 shows an exemplary procedure of scheduling and execution of POD.

Fig. 4 illustrates an exemplary schedule architecture diagram of the present disclosure.

Fig. 5 illustrates an exemplary one-time filtering of the present disclosure.

Fig. 6 illustrates an exemplary secondary filtering of the present disclosure.

Fig. 7 illustrates an exemplary operation of the monitoring unit of the present disclosure.

Fig. 8 illustrates an exemplary process by which the vGPU of the present disclosure creates and schedules a mount to a POD and runs.

FIG. 9 illustrates an exemplary configuration of a computing device capable of implementing embodiments in accordance with the present disclosure.

Detailed Description

The following detailed description is made with reference to the accompanying drawings and is provided to assist in a comprehensive understanding of various exemplary embodiments of the disclosure. The following description includes various details to aid understanding, but these details are to be regarded as examples only and are not intended to limit the disclosure, which is defined by the appended claims and their equivalents. The words and phrases used in the following description are used only to provide a clear and consistent understanding of the disclosure. In addition, descriptions of well-known structures, functions, and configurations may be omitted for clarity and conciseness. Those of ordinary skill in the art will recognize that various changes and modifications of the examples described herein can be made without departing from the spirit and scope of the disclosure.

Currently, kubernetes becomes a container arrangement fact standard, the deployment and scheduling capability of a GPU application container is provided, and the computational power of scenes such as artificial intelligence model development, training and reasoning is accelerated to be expanded to a GPU cluster. Where a POD is a node of Kubernetes, corresponding to one or a set of application containers of the same type. vGPUs are different types of GPU granules virtualized by GRID through a vfio technology, and kubernets can allocate the vGPUs to virtual machines of different nested containers (such as kata).

The GPU container scheduling capability provided by the Kubernetes service currently has the following problems. First is the cluster coarse-grained scheduling problem as follows: the definition of Kubernetes for extended resources such as GPUs only supports addition and subtraction of integer granularity, and cannot support allocation of complex resources, such as supporting scheduling allocation according to a vGPU type. Secondly, the problem of GPU resource allocation reclamation is as follows: the current Kubernetes expansion plug-in cannot acquire the POD corresponding to the vGPU device, so that the vGPU cannot be destroyed in time when the POD is destroyed, and the vGPU where the POD is destroyed can be reused by the Kubernetes, so that GPU resources cannot be allocated and recovered in time.

The present disclosure relates to a scheduling system for a virtualized GPU of a container system.

The system can comprise a vGPU information registration and resource reporting module, an expansion scheduling module, a vGPU creating module, a relation monitoring module and a vGPU destroying module.

In some embodiments, the function of the vGPU information registration and resource reporting module is as follows: registering a resource for each supported vGPU type, reporting the resource to a Kubelet, wherein the resource comprises (the vGPU type, the total number of the vGPUs and the idle number of the vGPUs), and the Kubelet further reports the resource to a Kubernets API Server.

In some embodiments, the functionality of the extended scheduling module is as follows: and screening the working nodes (host nodes) according to the received existing resources and the required resources, binding the nodes and the PODs, screening proper GPU physical nodes (GPU cards), and finally returning a scheduling result (the ID of the vGPU) to the Kubelet.

In some embodiments, the functionality of the vGPU creation module is as follows: and creating a vGPU on the screened GPU card according to the required resources, and binding the vGPU to the virtual machine of the corresponding POD.

In some embodiments, the relationship monitoring module functions as follows: acquiring the corresponding relation between the vGPU and the POD, and monitoring and acquiring the destruction event of the POD through a Kubernets API Server.

In some embodiments, the functionality of the vGPU destruction module is as follows: and judging the destroying time of the POD, and destroying the vGPU which is not in the life cycle or is in unhealthy POD corresponding relation in time.

Compared with the prior art, on one hand, the fine-grained scheduling is realized by expanding the scheduling function according to the received existing resources and the required resources, and the appropriate physical GPU card is screened out to create the vGPU; on the other hand, the corresponding relation between the POD and the vGPU is established by monitoring the Kubernets device management module, the vGPU corresponding to the POD is destroyed in time by acquiring the life cycle and health condition information of the POD, and GPU resources are recycled. Therefore, the method and the device can complete flexible scheduling of GPU cluster computing power under Kubernets, improve the utilization rate of the GPU cluster, and ensure high efficiency and energy conservation of a data center.

In the prior art, for example, chinese patent application CN202010263955.1 (publication No. CN 111506404A), the isolated solution is implemented based on MPS of Nvidia, but MPS cannot isolate resource usage. Since MPS shares GPU context, there is a bottleneck problem that all instances are affected once an error occurs.

Compared with the prior art, the container-oriented virtualized GPU scheduling method is realized based on the GRID of NVIDIA, the GRID supports GPU isolation according to resources and is mature and stable, the problem that a plurality of PODs share the GPU is solved based on the GRID, the resources of one GPU can be fully utilized by the PODs, and the resource utilization rate is improved. According to the scheme, by establishing an extended scheduling function and taking the type of the vGPU as a scheduling rule, firstly, working NODEs (NODE) meeting requirements are screened out, and then, physical GPU cards meeting the requirements are screened out again to create the vGPU meeting the requirements of users. Meanwhile, the corresponding relation between the POD and the vGPU is found through monitoring, the vGPU is destroyed in time after the POD is destroyed, and GPU resources are recycled. Therefore, the flexible scheduling of the GPU cluster computational power can be completed, the GPU resource utilization rate is improved, and the high efficiency and the energy conservation of the data center are ensured.

GRID is described below. vGPUs are different types of GPU granules virtualized by GRID through a vfio technology, and kubernets can allocate the vGPUs to virtual machines of different nested containers (such as kata)

In some embodiments, types of vGPU include the following: a, application virtualization, B office graph, Q professional graph and calculation are carried out; for example, P100-2Q means that a P100 Tesla card performs vGPU division according to 2G video memory, and the type Q indicates that the vGPU supports professional graphics and calculation.

The working mechanism of Device plug is introduced below.

The GPU Scheduler Extender has the following functions that a Scheduler extension mechanism of Kubernetes is used for judging whether a single GPU card on a node can provide a required vGPU type or not when global schedulers Filter and Bind exist, and the distribution result of the GPU is recorded to a POD Spec through the annotation at the moment of the Bind so as to be checked by a subsequent Filter.

The GPU Device Plugin has the following functions that by means of a Device Plugin mechanism, kubelet calls on nodes to be responsible for distribution of vGPUs, and the distribution is executed depending on a scheduler Extender distribution result.

The whole Device plug workflow can be divided into two parts: one is resource reporting at the starting moment; and the other is the scheduling and running of the user's use time.

The two most interesting and most core event methods of Device Plugin are as follows. And reporting the corresponding resource of ListAndWatch, and providing a health check mechanism. When the Device is unhealthy, it may report the ID to kubernets unhealthy Device, and let Device plug in frame remove the Device from schedulable Device. The Allocate can be called by Device Plugin when deploying the container, the core of the incoming parameter is the Device ID that the container will use, and the returned parameters are the Device, data volume and environment variables that are needed when the container is started.

Resource reporting and monitoring is further described below.

For each hardware Device, it needs to manage its corresponding Device plug, and these Device plugs connect the Device plug Manager in the Kubelet in a GRPC manner by using the identity of the client, and report the version number and Device name of the Unis socket api monitored by the Device plug Manager to the Kubelet, such as GPU.

The following describes the whole process of Device plug resource reporting with reference to fig. 2.

The first step is Device plug registration, which requires kubernets to know which Device plug to interact with. This is because there may be multiple devices on a node, and it is necessary for Device Plugin to report the Device name managed by Device Plugin to Kubelet, whether GPU or RDMA, with the identity of the client; the Kubelet can call the file position where the unis socket is located, wherein the file position is monitored by the plugin; the interaction protocol, i.e. the version number of the API.

The second step is service initiation, and Device Plugin can initiate a server (server) of GRPC. After that Device Plugin has been served with the identity of this server for Kubelet access, and listening to the address and providing a version of the API is already done in the first step.

Third, after the GRPC server is started, kubelet can establish a long connection to Device plug's ListAndWatch to discover the Device ID and health status of the Device. When Device Plugin detects that a Device is unhealthy, it will actively notify Kubelet. At this point, if the device is idle, the Kubelet will remove it from the allocable list. However, when the device is already in use by a POD, the Kubelet does nothing, which is a dangerous operation if the POD is killed.

Fourthly, the Kubelet can expose the devices to the state of the Node and send the number of the devices to the api-server of Kubernetes. The subsequent scheduler may schedule based on this information.

It should be noted that the kubel only reports the amount corresponding to the GPU when reporting to the api-server. The Kubelet's own Device plug Manager will save this GPU ID list and use it for specific Device assignment. While for the Kubernets global scheduler, it does not have knowledge of the ID list of this GPU, it only knows the number of GPUs. Meaning that under the existing Device plug working mechanism, kubernets' global scheduler cannot do more complex scheduling.

The scheduling and operation of POD will be described below with reference to fig. 3.

When a POD wants to use one GPU, it only needs to declare the GPU resources and the corresponding number (e.g. nvidia. Com/GPU: 1) in the limits field under the Resource of the POD as in the previous example. Kubernets will find the Node satisfying the quantity condition, then reduce the GPU quantity of the Node by 1, and complete the binding of POD and Node.

After binding is successful, the container can be naturally created by the Kubelet of the corresponding node. When the Kubelet finds that the resource requested by the container of the POD is a GPU, the Kubelet can entrust a Device plug Manager module in the Kubelet to select an available GPU from an ID list of the GPU owned by the Kubelet to be allocated to the container.

At this point, kubelet may initiate an allocation request to the native Device plug, which carries the parameters of the Device ID list to be assigned to the container.

After receiving the AllocateRequest, the Device Plugin can find the Device path, the driver directory, and the environment variable corresponding to the Device ID according to the Device ID transmitted by the Kubelet, and return the Device path, the driver directory, and the environment variable to the Kubelet in the form of AllocateResponse.

Once the device path and the driving directory information carried in the AllocateResponse are returned to the Kubelet, the Kubelet can perform an operation of allocating a GPU to the container according to the information, so that the Docker can create the container according to the instruction of the Kubelet, and a GPU device appears in the container. And the driver directory it needs is mounted in, so far the process of kubernets allocating a GPU to POD is finished.

The scheduling architecture diagram of the present disclosure is described below in conjunction with fig. 4.

First, vGPU registration is explained.

Device-plugin registers a resource for each supported vGPU type, and listens to a corresponding socket, wherein each GPU card only registers one vGPU type, and registers the total vGPU number (for example, if the GPU card has a memory of 32G and is registered as 2Q, the number is 16), and the vGPU is not really created during registration, but only a series of Device IDs are generated. In some embodiments, only one vGPU type is registered per GPU card, and the total number of vGPU's registered is equal to the video memory capacity of the GPU card divided by the video memory capacity of each vGPU.

Next, resource reporting is explained.

The GPU Device Plugin reports vGPU (number, type) of the node as other Extended Resource to Kubelet through ListAndWatch (); kubelet further reports to kubernets API Server. For example, if a node includes two GPU cards, one of the GPU cards is of a vGPU type 2Q (2 is a video memory 2G), 30 vGPU cards may be created, and the other GPU card is of a type a,60 vGPU cards may be created; meanwhile, the number 2 of GPU cards on the node is reported as another Extended Resource.

In some embodiments, registering, by the GPU Device-plugin, a vGPU type and vGPU number for the GPU card, and generating a series of Device IDs; reporting the number and the type of vGPU of the host node to Kubelet by GPU Device Plugin through a ListAndWatch method as other Extended resources; and the number and type of vGPU is further reported by Kubelet to kubernets API Server.

Then, the extended scheduling is explained.

The GPU Scheduler Extender can allocate GPU types to PODs, meanwhile, the allocation information is kept in POD specs in an annotation form, and whether each card corresponds to GPU type allocation or not is judged according to the information at the filtering moment.

The Kubernetes default Scheduler can call a filter method of the GPU Scheduler Extender in an http mode after all filtering (filter) behaviors are performed, because when the default Scheduler calculates Extended Resource, whether the number of GPUs meets the requirement can be only judged, and whether the type of the vGPU meets the requirement cannot be specifically judged; therefore, the GPU Scheduler Extender needs to check whether the single card contains the needed vGPU type resource.

Taking fig. 5 as an example, in a kubernets cluster composed of 3 NODEs including two GPU cards, we design to try to use GPU cards of the same vGPU type for each NODE, and when a user applies for a 2QvGPU type resource, the default scheduler may scan all NODEs, vGPU (type, total number, remaining number).

Once filtered, the default filter can only be filtered by vGPU number and type. The state of each node is represented in the form of vGPU (type, total number, remaining number). The remaining resources of N1 are found to be (A, 60, 45) that do not meet the resource requirements, and the N1 node is filtered out. The N2 and N3 nodes are N2 (2q, 60, 15) and N3 (2q, 60, 35), and both meet the conditions of the default scheduler from the perspective of overall scheduling. In some embodiments, in response to a resource allocation request initiated by Kubelet to GPU Device plug, all host nodes are filtered by the Kubernetes default scheduler at once by the number and type of vGPU to screen out host nodes that satisfy the condition (type is met, and vGPU number is sufficient).

The default Scheduler may delegate the GPU Scheduler Extender to perform secondary filtering. N2 and N3 both meet the requirement, we select the node with the least remaining resources, where N2 has 15 remaining resources and N3 has 35 remaining resources, so N2 node is selected. In some embodiments, for the host nodes screened out by the primary filtering, secondary filtering is performed by the GPU Scheduler Extender to screen out GPU cards whose remaining free resources satisfy the condition but have the least free resources.

If no node resource is found to meet the conditions at the moment, binding is not carried out, and the method directly reports no error exit and reschedules.

After finding out the nodes meeting the conditions, the Scheduler can entrust the bind method of the GPU Scheduler Extender to bind the nodes and the POD, and here, the GPU Scheduler Extender needs to do two things: finding the GPU card ID optimally selected in the node according to a binpack rule, wherein the optimal meaning is that for different GPU cards of the same node, the principle of the binpack is used as a judgment condition, the GPU card which meets the condition of idle resources and has the least residual resources is preferentially selected, and the GPU card is used as an ID and is stored in the annotation of POD together with vGPU type 2Q; and binding of the POD to the selected node occurs at this point. In some embodiments, if a eligible GPU card is found by the first and second filtering, a vGPU is created on the eligible GPU card and bound to the corresponding POD. In some embodiments, the Kubernetes default scheduler can only determine whether the number of GPUs meets the requirements, but cannot determine whether the type of vGPU meets the requirements; and the GPU Scheduler Extender checks whether the single GPU card contains the needed vGPU type resources.

And if no GPU resource on the distribution node is found to meet the conditions at the moment, not binding and rescheduling. Therefore, in some embodiments, if no eligible GPU card is found by the first and second filtering, scheduling is resumed without binding.

Taking fig. 6 as an example, when the GPU Scheduler Extender wants to apply for a 2Q vGPU: the POD of 2Q is bound with the screened node N2, and the available resources of different GPUs, namely GPU1 (2q, 15, 6), GPU2 (2q, 15, 4), and GPU3 (2q, 15, 5), can be compared first, where GPU2 is exactly the GPU card that meets the condition of idle resources and has the least remaining idle resources, and thus GPU2 is selected.

The operation on the node is explained below.

After the event that the POD and the node are bound is received by the Kubelet, the Kubelet can create a real POD entity on the node, in the process, the Kubelet can call an allocation method of the GPU Share Device plug, the parameter of the allocation method is the vGPU type (2Q) applied by the POD, and the vGPU is really created. In the Allocate method, the corresponding POD can be operated according to the scheduling decision of the GPU Share Scheduler Extender.

And converting GPU information in the POD annotation into environment variables and returning the environment variables to the Kubelet for really creating the POD.

In some embodiments, the GPU Scheduler Extender retains the allocation information in the POD spec in the form of annotation while allocating the GPU type to the POD, and determines whether each card corresponds to the GPU type allocation according to this information at the filtering time.

The monitoring unit is described below with reference to fig. 7.

In some embodiments, a monitoring unit is established, whose main functions are as follows: the device-plugin acquires the corresponding relation between the vGPU and the POD by monitoring the change of a checkpoint file generated by a device plugin manager; and then, acquiring the destroying event of the POD through a Kubernets API Server, thereby judging the destroying time of the vGPU.

In some embodiments, the destruction events and health status of PODs are acquired by Kubernets API Server monitoring and destruction of vGPU corresponding to PODs that are not in lifecycle or unhealthy is destroyed to reclaim GPU resources.

The process of vGPU creation and scheduling mount to POD and run is described below in connection with fig. 8.

In some embodiments, the process comprises the steps of: devicePlugin registers a resource for each supported vGPU type; the Kubelet requests the ListAndWatch interface to return a series of pre-generated universal unique identifiers uuid as the equipment ID; kubele calls the AllocaterRequest request (assigns uuid); screening out a host node and a physical GPU which meet the requirements by a default scheduler and an extended scheduler, and allocating a certain uuid; the AllocateResponse returns the device path and the drive directory information; the Device plugin writes the uuid into the creation directory to create a vGPU; putting vGPUhotplug into a container virtual machine (such as kata) environment; and, POD operation.

In some embodiments, when the default scheduler and the extended scheduler screen out the satisfying host node and the physical GPU, the satisfying host node and the physical GPU may be screened out, for example, in the manner of the primary filtering and the secondary filtering described in detail above.

Compared with the prior art, the virtualized GPU scheduling mechanism designed by the method can effectively allocate and manage virtualized GPU resources, and solves the problems of cluster coarse-grained scheduling and GPU resource allocation and recovery. Firstly, the designed extended scheduling function can analyze resource requests and cluster existing GPU resources, and the vGPU type is adopted as a screening rule to allocate the vGPU resources aiming at different PODs. And secondly, monitoring the Kubernetes related module, acquiring the corresponding relation between the vGPU and the POD, acquiring the destruction event of the POD, destroying the vGPU in time and recycling GPU resources. Therefore, the flexible scheduling of the GPU cluster computing power can be completed, the GPU resource utilization rate is improved, and the high efficiency and the energy conservation of the data center are ensured.

Fig. 9 illustrates an exemplary configuration of a computing device 900 capable of implementing embodiments in accordance with the present disclosure.

Computing device 900 is an example of a hardware device to which the above-described aspects of the disclosure can be applied. Computing device 900 may be any machine configured to perform processing and/or computing. Computing device 900 may be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a Personal Data Assistant (PDA), a smart phone, an in-vehicle computer, or a combination thereof.

As shown in fig. 9, computing device 900 may include one or more elements that may be connected to or communicate with a bus 902 via one or more interfaces. Bus 902 can include, but is not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus, to name a few. Computing device 900 may include, for example, one or more processors 904, one or more input devices 906, and one or more output devices 908. The one or more processors 904 may be any kind of processor and may include, but are not limited to, one or more general purpose processors or special purpose processors (such as special purpose processing chips). The processor 904 may be configured to perform the methods of the present disclosure, for example. Input device 906 may be any type of input device capable of inputting information to a computing device and may include, but is not limited to, a mouse, a keyboard, a touch screen, a microphone, and/or a remote controller. Output device(s) 908 can be any type of device capable of presenting information and can include, but are not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer.

The computing device 900 may also include or be connected to a non-transitory storage device 914, which non-transitory storage device 914 may be any non-transitory and may implement a storage device for data storage, and may include, but is not limited to, a disk drive, an optical storage device, a solid state memory, a floppy disk, a flexible disk, a hard disk, a magnetic tape, or any other magnetic medium, a compact disk, or any other optical medium, a cache memory, and/or any other memory chip or module, and/or any other medium from which a computer may read data, instructions, and/or code. Computing device 900 may also include Random Access Memory (RAM) 910 and Read Only Memory (ROM) 912. The ROM 912 can store programs, utilities or processes to be executed in a nonvolatile manner. The RAM 910 may provide volatile data storage and stores instructions related to the operation of the computing device 900. The computing device 900 may also include a network/bus interface 916 to couple to a data link 918. The network/bus interface 916 can be any kind of device or system capable of enabling communication with external apparatuses and/or networks, and can include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset (such as bluetooth) ^TM Devices, 802.11 devices, wiFi devices, wiMax devices, cellular communications facilities, etc.).

The present disclosure may be implemented as any combination of apparatus, systems, integrated circuits, and computer programs on non-transitory computer readable media. One or more processors may be implemented as an Integrated Circuit (IC), an Application Specific Integrated Circuit (ASIC), or a large scale integrated circuit (LSI), a system LSI, or a super LSI, or as an ultra LSI package that performs some or all of the functions described in this disclosure.

The present disclosure includes the use of software, applications, computer programs or algorithms. Software, applications, computer programs, or algorithms may be stored on a non-transitory computer readable medium to cause a computer, such as one or more processors, to perform the steps described above and depicted in the figures. For example, one or more memories store software or algorithms in executable instructions and one or more processors can associate a set of instructions to execute the software or algorithms to provide various functionality in accordance with embodiments described in this disclosure.

Software and computer programs (which may also be referred to as programs, software applications, components, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural, object-oriented, functional, logical, or assembly or machine language. The term "computer-readable medium" refers to any computer program product, apparatus or device, such as magnetic disks, optical disks, solid state storage devices, memories, and Programmable Logic Devices (PLDs), used to provide machine instructions or data to a programmable data processor, including a computer-readable medium that receives machine instructions as a computer-readable signal.

By way of example, computer-readable media may comprise Dynamic Random Access Memory (DRAM), random Access Memory (RAM), read Only Memory (ROM), electrically erasable read only memory (EEPROM), compact disk read only memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which may be used to carry or store desired computer-readable program code in the form of instructions or data structures and which may be accessed by a general-purpose or special-purpose computer or a general-purpose or special-purpose processor. Disk or disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.

The subject matter of the present disclosure is provided as examples of apparatus, systems, methods, and programs for performing the features described in the present disclosure. However, other features or variations are contemplated in addition to the features described above. It is contemplated that the implementation of the components and functions of the present disclosure may be accomplished with any emerging technology that may replace the technology of any of the implementations described above.

Additionally, the above description provides examples, and does not limit the scope, applicability, or configuration set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the spirit and scope of the disclosure. Various embodiments may omit, substitute, or add various procedures or components as appropriate. For example, features described with respect to certain embodiments may be combined in other embodiments.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method for scheduling virtualized GPUs in a container system, the method comprising the steps of:

aiming at the host nodes screened out by the primary filtering, performing secondary filtering by a GPU Scheduler Extender to screen out GPU cards with idle resources meeting the conditions and the minimum idle resources;

2. The method of claim 1, further comprising the step of:

registering the type and the number of vGPUs by a GPU Device-plugin aiming at a GPU card, and generating a series of Device IDs;

reporting the number and the type of vGPU of the host node to Kubelet by GPU Device Plugin through a ListAndWatch method as other Extended resources;

the number and type of vGPU is further reported by Kubelet to kubernets API Server.

3. The method of claim 1, further comprising the step of:

acquiring destruction events and health status of POD by Kubernets API Server monitoring, and

destroying the vGPU corresponding to PODs that are not in the lifecycle or unhealthy to reclaim GPU resources.

4. The method of claim 1, further comprising the step of:

and the GPU Scheduler Extender allocates the GPU type to the POD and simultaneously retains the allocation information in the POD spec in an annotation form, and judges whether each card corresponds to the GPU type allocation according to the information at the filtering moment.

5. The method of claim 1, wherein,

the Kubernetes default scheduler can only judge whether the GPU number meets the requirement or not, and cannot judge whether the vGPU type meets the requirement or not;

and the GPU Scheduler Extender checks whether the single GPU card contains the needed vGPU type resources.

6. The method of claim 2, wherein,

only one vGPU type is registered in each GPU card, and the total number of registered vGPUs is equal to the video memory capacity of the GPU card divided by the video memory capacity of each vGPU.

7. The method of claim 2, wherein,

the vGPU is not really created at the time of registration, but only the device ID is generated.

8. The method of claim 2, further comprising:

and reporting the number of the GPU cards on the host node as another Extended Resource.

9. A virtualized GPU scheduling apparatus for a container system, comprising:

a memory having instructions stored thereon; and

a processor configured to execute instructions stored on the memory to perform the method of any of claims 1 to 8.

10. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1-8.