CN118764491A

CN118764491A - A method for managing an instance and an instance management platform

Info

Publication number: CN118764491A
Application number: CN202310308241.1A
Authority: CN
Inventors: 田靖轩; 郭辉; 王文辉
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2024-10-11
Also published as: WO2024198602A1

Abstract

The embodiment of the present application provides a method for managing instances and an instance management platform, the method comprising: the instance management platform monitors the instances of running tasks in the scaling group to obtain monitoring indicator data such as resource utilization of the instances, and generates a historical task portrait based on the resource utilization; the instance management platform uses a first model to generate one or more scaling strategies based on the historical task portrait, and recommends the scaling strategy to the user, and after the user selects, the instance management platform adjusts the number of instances in the scaling group or the specifications of the instances based on the scaling strategy selected by the user. The method can use the model to automatically generate a reasonable scaling strategy based on the historical task portrait, thereby reducing the problems of cumbersome configuration and large errors caused by manually configuring the scaling strategy.

Description

A method for managing an instance and an instance management platform

技术领域Technical Field

本申请涉及云服务领域，并且更为具体地，涉及一种管理实例的方法和实例管理平台。The present application relates to the field of cloud services, and more specifically, to a method for managing instances and an instance management platform.

背景技术Background Art

基于云服务的分布式任务调度系统主要提供任务的切分和编排，以及对任务进行实时、精准的调度，具有例如定时任务、一次性任务、任务编排、分布式执行批量任务等功能。为了提高云服务平台的服务化能力，任务调度系统还需要控制云服务器(ElasticCompute Service，ECS)资源进行弹性伸缩，即在资源忙时补充，闲时回收，从而进一步降低资源成本。The distributed task scheduling system based on cloud services mainly provides task segmentation and scheduling, as well as real-time and accurate scheduling of tasks, with functions such as scheduled tasks, one-time tasks, task scheduling, distributed execution of batch tasks, etc. In order to improve the service-oriented capabilities of the cloud service platform, the task scheduling system also needs to control the elastic scaling of cloud server (ElasticCompute Service, ECS) resources, that is, replenishing resources when they are busy and recycling them when they are idle, thereby further reducing resource costs.

分布式任务调度系统的使用中，使用者需要根据历史经验，制定符合任务执行画像的弹性伸缩策略，即制定用于决定何时伸缩、如何伸缩、采用那种哪种指标进行伸缩的策略，该策略的配置过程繁琐，并且由于人工干预程度较大，容易产生误差。When using a distributed task scheduling system, users need to develop an elastic scaling strategy that conforms to the task execution profile based on historical experience, that is, a strategy for deciding when to scale, how to scale, and which indicators to use for scaling. The configuration process of this strategy is cumbersome, and due to the high degree of manual intervention, errors are prone to occur.

发明内容Summary of the invention

本申请提供一种管理实例的方法和实例管理平台，该方法能够自动生成伸缩策略，解决了伸缩策略配置繁琐、人工干预程度大易产生误差的问题。The present application provides a method for managing instances and an instance management platform. The method can automatically generate a scaling strategy, solving the problem that scaling strategy configuration is complicated and manual intervention is high and prone to errors.

第一方面，提供了一种管理实例的方法，该方法包括：实例管理平台监控伸缩组中运行任务的实例，以得到实例的资源利用率，资源利用率包括以下的一种或多种：中央处理器CPU使用率、内存使用率；实例管理平台基于资源利用率生成历史任务画像，历史任务画像包括以下的一种或多种：历史任务时序特征、历史任务资源特征，其中，历史任务时序特征用于指示历史任务在伸缩组运行时的时序上的特性，历史任务资源特征用于指示伸缩组的资源类型，资源类型包括如下的一种或多种：计算密集型、内存型、输入输出密集(In OutIntensive，IO Intensive)型(或称为读写密集型)、图形密集型；实例管理平台使用第一模型根据历史任务画像生成一个或多个伸缩策略，第一模型的输入为历史任务画像，第一模型的输出为一个或者多个伸缩策略；实例管理平台向用户推荐一个或者多个伸缩策略；实例管理平台确定用户选择的伸缩策略；实例管理平台根据用户选择的伸缩策略，调整伸缩组中实例的数量或者实例的规格。In a first aspect, a method for managing an instance is provided, the method comprising: an instance management platform monitors an instance of a task running in a scaling group to obtain a resource utilization rate of the instance, the resource utilization rate comprising one or more of the following: a CPU utilization rate and a memory utilization rate; the instance management platform generates a historical task profile based on the resource utilization rate, the historical task profile comprising one or more of the following: a historical task timing feature and a historical task resource feature, wherein the historical task timing feature is used to indicate a timing feature of a historical task when the scaling group is running, and the historical task resource feature is used to indicate a resource type of the scaling group, and the resource type comprises one or more of the following: computing intensive, memory intensive, input/output intensive (In Out Intensive, IO Intensive) type (or referred to as read/write intensive), and graphics intensive; the instance management platform generates one or more scaling policies according to the historical task profile using a first model, the input of the first model being the historical task profile, and the output of the first model being one or more scaling policies; the instance management platform recommends one or more scaling policies to a user; the instance management platform determines the scaling policy selected by the user; and the instance management platform adjusts the number of instances or the specifications of the instances in the scaling group according to the scaling policy selected by the user.

基于上述技术方案，能够充分利用大量优质的数据，使用模型根据历史任务画像自动生成合理的伸缩策略，从而解决了伸缩策略配置繁琐、人工干预误差大的问题，还能够提升伸缩效果；并且，根据历史任务画像来制定伸缩策略，能够减少直接根据监控相关数据来制定伸缩策略所带来的数据计算量大，参数转换繁琐、准确率低等问题，使得伸缩策略的生成更为高效和合理。Based on the above technical solution, a large amount of high-quality data can be fully utilized, and the model can be used to automatically generate reasonable scaling strategies according to historical task portraits, thereby solving the problems of cumbersome scaling strategy configuration and large errors in manual intervention, and improving the scaling effect. In addition, formulating scaling strategies based on historical task portraits can reduce the problems of large data calculation volume, cumbersome parameter conversion, low accuracy, etc. caused by directly formulating scaling strategies based on monitoring-related data, making the generation of scaling strategies more efficient and reasonable.

结合第一方面，在第一方面的某些实现方式中，伸缩策略包括以下一种或多种：定时伸缩策略、告警伸缩策略，使用第一模型根据历史任务画像生成一个或多个伸缩策略，包括：使用第一模型根据历史任务画像确定定时伸缩策略的以下一种或多种参数：伸缩时间、定时伸缩语义、以及定时伸缩规模；使用第一模型根据历史任务画像确定告警伸缩策略的以下一种或多种参数：告警指标、告警阈值、告警伸缩语义、以及告警伸缩规模。In combination with the first aspect, in certain implementations of the first aspect, the scaling strategy includes one or more of the following: a scheduled scaling strategy and an alarm scaling strategy, and one or more scaling strategies are generated based on historical task portraits using the first model, including: using the first model to determine one or more of the following parameters of the scheduled scaling strategy based on the historical task portraits: scaling time, scheduled scaling semantics, and scheduled scaling scale; using the first model to determine one or more of the following parameters of the alarm scaling strategy based on the historical task portraits: alarm indicators, alarm thresholds, alarm scaling semantics, and alarm scaling scale.

基于上述实施方式，具体细化了所需要策略生成模型确定的相关参数，能够使得后续伸缩组能够更准确的对实例进行调整，从而能够进一步提升伸缩效果。Based on the above implementation, the relevant parameters determined by the required policy generation model are specifically refined, so that the subsequent scaling group can adjust the instance more accurately, thereby further improving the scaling effect.

结合第一方面，在第一方面的某些实现方式中，方法还包括：实例管理平台根据实例的数量和实例的规格确定伸缩组的剩余资源量；实例管理平台根据伸缩组的资源类型和剩余资源量决定将任务挂起等待，或将任务在伸缩组之间进行调度。In combination with the first aspect, in certain implementations of the first aspect, the method also includes: the instance management platform determines the remaining resource amount of the scaling group based on the number of instances and the specifications of the instances; the instance management platform decides to suspend the task and wait, or to schedule the task between scaling groups based on the resource type and the remaining resource amount of the scaling group.

基于上述实施方式，能够根据伸缩组中的类型和剩余资源量，动态调整任务执行限流参数，并选择最优的伸缩组完成任务下发，能够避免资源的浪费，提高伸缩组的任务执行效率。Based on the above implementation, the task execution current limiting parameters can be dynamically adjusted according to the type and remaining resources in the scaling group, and the optimal scaling group can be selected to complete task delivery, thereby avoiding resource waste and improving the task execution efficiency of the scaling group.

结合第一方面，在第一方面的某些实现方式中，该方法还包括：实例管理平台向用户呈现历史任务时序特征和历史任务资源特征。In combination with the first aspect, in some implementations of the first aspect, the method further includes: the instance management platform presenting historical task timing characteristics and historical task resource characteristics to the user.

基于该技术方案，在用户自行制定伸缩策略时，该历史任务画像能够辅助用户完成伸缩策略制定，从而有利于提升伸缩策略的伸缩效果。Based on this technical solution, when users formulate scaling strategies themselves, the historical task portrait can assist users in completing scaling strategy formulation, thereby helping to improve the scaling effect of the scaling strategy.

结合第一方面，在第一方面的某些实现方式中，基于资源利用率生成历史任务画像，包括：对资源利用率进行预处理以得到处理数据，预处理包括以下一种或多种：归一化、向量化；使用第二模型根据处理数据生成历史任务画像，第二模型的输入为处理数据，第二模型的输出为历史任务画像。In combination with the first aspect, in certain implementations of the first aspect, generating a historical task portrait based on resource utilization includes: preprocessing the resource utilization to obtain processing data, the preprocessing including one or more of the following: normalization, vectorization; using a second model to generate a historical task portrait based on the processing data, the input of the second model is the processing data, and the output of the second model is the historical task portrait.

基于本方案，通过第二模型根据资源利用率等数据来对历史任务画像进行提取，相比于人工分析提取任务画像，能够更快、更准确地提取出历史任务画像；并且第二模型可以基于预处理后的处理数据提取历史任务画像，有利于历史任务画像的生成更为高效和准确。Based on this solution, the second model is used to extract historical task portraits based on data such as resource utilization. Compared with manual analysis and extraction of task portraits, historical task portraits can be extracted faster and more accurately; and the second model can extract historical task portraits based on pre-processed processing data, which is conducive to more efficient and accurate generation of historical task portraits.

结合第一方面，在第一方面的某些实现方式中，历史任务时序特征包括以下的一种或多种：历史任务平均运行时长、历史任务量高峰时段、历史任务量低谷时段、历史任务执行周期。In combination with the first aspect, in certain implementations of the first aspect, the historical task timing characteristics include one or more of the following: average running time of historical tasks, peak period of historical task volume, trough period of historical task volume, and historical task execution cycle.

第二方面，提供一种实例管理平台，包括：监控模块，用于监控伸缩组中运行任务的实例，得到实例的资源利用率，资源利用率包括以下的一种或多种：中央处理器CPU使用率、内存使用率；策略生成模块，用于基于资源利用率生成历史任务画像，历史任务画像包括以下的一种或多种：历史任务时序特征、历史任务资源特征，其中，历史任务时序特征用于指示历史任务在伸缩组运行时的时序上的特性，历史任务资源特征用于指示伸缩组的资源类型，资源类型包括如下的一种或多种：计算密集型、内存型、输入输出密集型、图形密集型；策略生成模块还用于，使用第一模型根据历史任务画像生成一个或多个伸缩策略，第一模型的输入为历史任务画像，第一模型的输出为一个或多个伸缩策略；执行模块，用于向用户推荐一个或多个伸缩策略；执行模块还用于，确定用户选择的伸缩策略；执行模块还用于，根据用户选择的伸缩策略，调整伸缩组中实例的数量或者实例的规格。In a second aspect, an instance management platform is provided, including: a monitoring module, which is used to monitor the instances of running tasks in a scaling group, and obtain the resource utilization of the instances, where the resource utilization includes one or more of the following: central processing unit CPU utilization and memory utilization; a policy generation module, which is used to generate a historical task portrait based on the resource utilization, where the historical task portrait includes one or more of the following: historical task timing characteristics and historical task resource characteristics, wherein the historical task timing characteristics are used to indicate the timing characteristics of the historical tasks when the scaling group is running, and the historical task resource characteristics are used to indicate the resource type of the scaling group, and the resource type includes one or more of the following: computing intensive, memory intensive, input and output intensive, and graphics intensive; the policy generation module is also used to generate one or more scaling policies according to the historical task portrait using a first model, where the input of the first model is the historical task portrait, and the output of the first model is one or more scaling policies; an execution module, which is used to recommend one or more scaling policies to a user; the execution module is also used to determine the scaling policy selected by the user; the execution module is also used to adjust the number of instances in the scaling group or the specifications of the instances according to the scaling policy selected by the user.

基于上述技术方案，策略生成模块能够充分利用大量优质的数据，使用模型根据历史任务画像自动生成合理的伸缩策略，并且能够通过与用户进行交互来启用伸缩策略，从而解决了伸缩策略配置繁琐、人工干预误差大的问题，能够提升伸缩效果；并且，根据历史任务画像来制定伸缩策略，减少直接根据监控的数据来制定伸缩策略所带来的数据计算量大，参数转换繁琐、准确率低等问题，使得伸缩策略的生成更为高效和合理。Based on the above technical solution, the strategy generation module can make full use of a large amount of high-quality data, use the model to automatically generate reasonable scaling strategies according to historical task portraits, and can enable scaling strategies by interacting with users, thereby solving the problems of cumbersome scaling strategy configuration and large errors in manual intervention, and can improve the scaling effect; in addition, scaling strategies are formulated based on historical task portraits to reduce the large amount of data calculation, cumbersome parameter conversion, low accuracy, and other problems caused by directly formulating scaling strategies based on monitored data, making the generation of scaling strategies more efficient and reasonable.

结合第二方面，在第二方面的某些实现方式中，伸缩策略包括以下一种或多种：定时伸缩策略、告警伸缩策略，策略生成模块具体用于，使用第一模型根据历史任务画像确定定时伸缩策略的以下一种或多种参数：伸缩时间、定时伸缩语义、以及定时伸缩规模；使用第一模型根据历史任务画像确定告警伸缩策略的以下一种或多种参数：告警指标、告警阈值、告警伸缩语义、以及告警伸缩规模。In combination with the second aspect, in some implementations of the second aspect, the scaling strategy includes one or more of the following: a scheduled scaling strategy and an alarm scaling strategy. The strategy generation module is specifically used to use the first model to determine one or more of the following parameters of the scheduled scaling strategy based on the historical task portrait: scaling time, scheduled scaling semantics, and scheduled scaling scale; and use the first model to determine one or more of the following parameters of the alarm scaling strategy based on the historical task portrait: alarm indicator, alarm threshold, alarm scaling semantics, and alarm scaling scale.

结合第二方面，在第二方面的某些实现方式中，执行模块还用于：根据实例的数量和实例的规格确定伸缩组的剩余资源量；根据伸缩组的资源类型和剩余资源量决定将任务挂起等待，或将任务在伸缩组之间进行调度。In combination with the second aspect, in certain implementations of the second aspect, the execution module is also used to: determine the remaining resources of the scaling group based on the number of instances and the specifications of the instances; decide to suspend the task and wait, or schedule the task between scaling groups based on the resource type and remaining resources of the scaling group.

结合第二方面，在第二方面的某些实现方式中，执行模块还用于：向用户呈现历史任务时序特征和历史任务资源特征。In combination with the second aspect, in some implementations of the second aspect, the execution module is further used to: present historical task timing features and historical task resource features to the user.

结合第二方面，在第二方面的某些实现方式中，策略生成模块具体用于：对资源利用率进行预处理以得到处理数据，预处理包括以下一种或多种：归一化、向量化；使用第二模型根据处理数据生成历史任务画像，第二模型的输入为处理数据，第二模型的输出为历史任务画像。In combination with the second aspect, in certain implementations of the second aspect, the strategy generation module is specifically used to: preprocess resource utilization to obtain processing data, the preprocessing including one or more of the following: normalization, vectorization; use the second model to generate a historical task portrait based on the processing data, the input of the second model is the processing data, and the output of the second model is the historical task portrait.

结合第二方面，在第二方面的某些实现方式中，历史任务时序特征包括以下的一种或多种：历史任务平均运行时长、历史任务量高峰时段、历史任务量低谷时段、历史任务执行周期。In combination with the second aspect, in certain implementations of the second aspect, the historical task timing characteristics include one or more of the following: average running time of historical tasks, peak period of historical task volume, trough period of historical task volume, and historical task execution cycle.

第三方面，提供一种计算设备，包括处理器和存储器，其中，存储器用于存储指令，处理器用于从存储器中调用并运行该指令，使得该计算设备执行第一方面或第一方面任意一种可能的实现方式中的方法。In a third aspect, a computing device is provided, comprising a processor and a memory, wherein the memory is used to store instructions, and the processor is used to call and execute the instructions from the memory, so that the computing device executes the method in the first aspect or any possible implementation of the first aspect.

第四方面，提供一种计算设备集群，包括至少一个计算设备，每个计算设备包括处理器和存储器，其中，存储器用于存储指令，处理器用于从存储器中调用并运行该指令，使得该计算设备集群执行第一方面或第一方面任意一种可能的实现方式中的方法。In a fourth aspect, a computing device cluster is provided, comprising at least one computing device, each computing device comprising a processor and a memory, wherein the memory is used to store instructions, and the processor is used to call and execute the instructions from the memory, so that the computing device cluster executes the method in the first aspect or any possible implementation of the first aspect.

结合第四方面，在第四方面的某些实现方式中，该处理器可以是通用处理器，可以通过硬件来实现也可以通过软件来实现。当通过硬件来实现时，该处理器可以是逻辑电路、集成电路等；当通过软件来实现时，该处理器可以是一个通用处理器，通过读取存储器中存储的软件代码来实现，该存储器可以集成在处理器中，可以位于该处理器之外独立存在。In conjunction with the fourth aspect, in certain implementations of the fourth aspect, the processor may be a general-purpose processor, which may be implemented by hardware or software. When implemented by hardware, the processor may be a logic circuit, an integrated circuit, etc.; when implemented by software, the processor may be a general-purpose processor, which is implemented by reading software code stored in a memory, and the memory may be integrated in the processor or may be located outside the processor and exist independently.

第五方面，提供了一种计算机可读存储介质，包括计算机程序指令，当该计算机指令由计算设备集群运行时，使得计算设备集群执行上述第一方面或第一方面任意一种可能的实现方式中的方法。In a fifth aspect, a computer-readable storage medium is provided, comprising computer program instructions. When the computer program instructions are executed by a computing device cluster, the computing device cluster executes the method in the above-mentioned first aspect or any possible implementation manner of the first aspect.

结合第五方面，在第五方面的某些实现方式中，上述存储介质具体可以是非易失性存储介质。In combination with the fifth aspect, in certain implementations of the fifth aspect, the above-mentioned storage medium may specifically be a non-volatile storage medium.

第六方面，提供了一种包含指令的计算机程序产品，当该指令被计算设备集群运行时，使得计算设备集群执行上述第一方面或第一方面任意一种可能的实现方式中的方法。In a sixth aspect, a computer program product comprising instructions is provided. When the instructions are executed by a computing device cluster, the computing device cluster executes the method in the above-mentioned first aspect or any possible implementation manner of the first aspect.

第七方面，提供了一种芯片，该芯片获取指令并执行该指令来实现上述第一方面或第一方面任意一种可能的实现方式中的方法。In a seventh aspect, a chip is provided, which obtains instructions and executes the instructions to implement the method in the above-mentioned first aspect or any possible implementation manner of the first aspect.

结合第七方面，在第七方面的某些实现方式中，该芯片包括处理器与数据接口，该处理器通过该数据接口读取存储器上存储的指令，执行上述第一方面或第一方面任意一种可能的实现方式中的方法。In combination with the seventh aspect, in certain implementations of the seventh aspect, the chip includes a processor and a data interface, and the processor reads instructions stored in the memory through the data interface to execute the method in the above-mentioned first aspect or any possible implementation of the first aspect.

结合第七方面，在第七方面的某些实现方式中，该芯片还可以包括存储器，该存储器中存储有指令，该处理器用于执行该存储器上存储的指令，当该指令被运行时，该处理器用于执行上述第一方面或第一方面任意一种可能的实现方式中的方法。In combination with the seventh aspect, in certain implementations of the seventh aspect, the chip may also include a memory, in which instructions are stored, and the processor is used to execute the instructions stored in the memory. When the instructions are executed, the processor is used to execute the method in the above-mentioned first aspect or any possible implementation of the first aspect.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本申请实施例提供的一种云服务系统架构的示意图。FIG1 is a schematic diagram of a cloud service system architecture provided in an embodiment of the present application.

图2为本申请实施例提供的一种管理实例的方法200的示意性流程图。FIG. 2 is a schematic flowchart of a method 200 for managing instances provided in an embodiment of the present application.

图3为本申请实施例提供一种历史任务画像的生成方法300的示意性流程图。FIG3 is a schematic flowchart of a method 300 for generating a historical task portrait according to an embodiment of the present application.

图4为本申请实施例提供的一种客户端界面示意图。FIG. 4 is a schematic diagram of a client interface provided in an embodiment of the present application.

图5为本申请实施例提供的一种实例管理平台的示意性结构框图。FIG5 is a schematic structural block diagram of an instance management platform provided in an embodiment of the present application.

图6是本申请实施例提供的一种计算设备的示意性结构框图。FIG6 is a schematic structural block diagram of a computing device provided in an embodiment of the present application.

图7是本申请实施例提供的一种计算设备集群的示意性结构框图。FIG. 7 is a schematic structural block diagram of a computing device cluster provided in an embodiment of the present application.

图8是本申请实施例提供的另一计算设备集群的示意性结构框图。FIG8 is a schematic structural block diagram of another computing device cluster provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

下面将结合附图，对本申请中的技术方案进行描述。The technical solution in this application will be described below in conjunction with the accompanying drawings.

本申请将围绕包括多个设备、组件、模块等的系统来呈现各个方面、实施例或特征。应当理解和明白的是，各个系统可以包括另外的设备、组件、模块等，并且/或者可以并不包括结合附图讨论的所有设备、组件、模块等。此外，还可以使用这些方案的组合。The present application will present various aspects, embodiments or features around a system including multiple devices, components, modules, etc. It should be understood and appreciated that each system may include additional devices, components, modules, etc., and/or may not include all devices, components, modules, etc. discussed in conjunction with the figures. In addition, combinations of these schemes may also be used.

另外，在本申请实施例中，“示例性的”、“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言，使用示例的一词旨在以具体方式呈现概念。In addition, in the embodiments of the present application, words such as "exemplary" and "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described as "example" in the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of the word "example" is intended to present concepts in a concrete way.

本申请实施例描述的架构以及业务场景是为了更加清楚地说明本申请实施例的技术方案，并不构成对于本申请实施例提供的技术方案的限定，本领域普通技术人员可知，随着架构的演变和新业务场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。The architecture and business scenarios described in the embodiments of the present application are intended to more clearly illustrate the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided in the embodiments of the present application. A person of ordinary skill in the art will appreciate that, with the evolution of the architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.

在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此，在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例，而是意味着“一个或多个但不是所有的实施例”，除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”，除非是以其他方式另外特别强调。References to "one embodiment" or "some embodiments" etc. described in this specification mean that a particular feature, structure or characteristic described in conjunction with the embodiment is included in one or more embodiments of the present application. Thus, the phrases "in one embodiment", "in some embodiments", "in some other embodiments", "in some other embodiments", etc. appearing in different places in this specification do not necessarily refer to the same embodiment, but mean "one or more but not all embodiments", unless otherwise specifically emphasized in other ways. The terms "including", "comprising", "having" and their variations all mean "including but not limited to", unless otherwise specifically emphasized in other ways.

本申请中，“至少一个”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：包括单独存在A，同时存在A和B，以及单独存在B的情况，其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达，是指的这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a，b，或c中的至少一项(个)，可以表示：a，b，c，a-b，a-c，b-c，或a-b-c，其中a，b，c可以是单个，也可以是多个。In the present application, "at least one" means one or more, and "plurality" means two or more. "And/or" describes the association relationship of associated objects, indicating that three relationships may exist. For example, A and/or B can mean: including the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A and B can be singular or plural. The character "/" generally indicates that the previous and next associated objects are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can mean: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple.

本申请的技术方案适用于云平台系统，以下简称云平台。云平台是一种服务器平台，云平台为应用提供运行环境和资源，例如实例、内存等，并且支持应用的多实例部署，以支持高并发的外部用户访问。The technical solution of this application is applicable to a cloud platform system, hereinafter referred to as a cloud platform. A cloud platform is a server platform that provides an operating environment and resources, such as instances and memory, for applications, and supports multi-instance deployment of applications to support high-concurrency external user access.

图1示出了本申请实施例提供的一种云服务系统架构的示意图。如图1所示，客户端可以通过互联网接入云平台。通常情况下，云平台中包含多个服务器，如服务器1至服务器n，每个服务器中分别包括云服务资源，如服务器1中包括云服务1、云服务2，云服务资源为租户提供相应的云服务。客户端通过管理平台10与服务器相连接。服务器的硬件层可以包括处理器、存储器、网卡以及数据总线等。FIG1 shows a schematic diagram of a cloud service system architecture provided by an embodiment of the present application. As shown in FIG1 , a client can access a cloud platform via the Internet. Typically, a cloud platform includes multiple servers, such as server 1 to server n, each of which includes cloud service resources, such as server 1 includes cloud service 1 and cloud service 2, and cloud service resources provide corresponding cloud services for tenants. The client is connected to the server via a management platform 10. The hardware layer of the server may include a processor, a memory, a network card, and a data bus, etc.

为了便于理解，下面先对本申请实施例可能涉及的相关术语和概念进行介绍。To facilitate understanding, relevant terms and concepts that may be involved in the embodiments of the present application are first introduced below.

1、伸缩组1. Telescopic group

伸缩组可以理解为具有相同应用场景的ECS实例的集合，或者说是用于管理弹性伸缩云服务器的最小单元。伸缩组中的实例的个数是可以弹性伸缩的，例如，若伸缩组中的实例的负载过大，则可以增加伸缩组中实例的个数，以分担伸缩组中每个实例的负载；若伸缩组中的实例的负载较低，则可以删除部分实例，以节约系统的资源。A scaling group can be understood as a collection of ECS instances with the same application scenario, or the smallest unit for managing elastically scalable cloud servers. The number of instances in a scaling group can be elastically scalable. For example, if the load of the instances in a scaling group is too large, the number of instances in the scaling group can be increased to share the load of each instance in the scaling group; if the load of the instances in the scaling group is low, some instances can be deleted to save system resources.

2、伸缩策略2. Scaling strategy

伸缩策略用于规定伸缩组如何进行弹性伸缩。经典的弹性伸缩策略包括定时伸缩策略以及告警伸缩策略。Scaling policies are used to define how a scaling group performs elastic scaling. Classic elastic scaling policies include timed scaling policies and alarm scaling policies.

定时伸缩策略用于指示根据设定的时刻执行弹性伸缩活动，需要指定：伸缩时间，即何时伸缩；伸缩规模，即当触发伸缩时，扩容或缩容的数量；伸缩语义，即执行的动作，包括增加、减少、调整至、增加百分之几、减少百分之几等。The scheduled scaling policy is used to instruct the execution of elastic scaling activities at the set time. You need to specify: scaling time, that is, when to scale; scaling scale, that is, the amount of expansion or reduction when scaling is triggered; scaling semantics, that is, the action to be performed, including increase, decrease, adjust to, increase by a certain percentage, decrease by a certain percentage, etc.

告警伸缩策略用于指示根据可观测性指标以及设置的阈值，当击穿时执行弹性伸缩活动，需要指定：告警指标，如中央处理器(Central Processing Unit，CPU)利用率、内存使用率、自定义业务指标等；告警阈值，即当监控指标数据达到该阈值时，执行伸缩；告警伸缩规模，即当触发伸缩时，扩容或缩容的数量；告警伸缩语义，即执行的动作，包括增加、减少、调整至、增加百分之几、减少百分之几等。The alarm scaling strategy is used to indicate that elastic scaling activities are performed when the observability indicators and the set thresholds are exceeded. It is necessary to specify: alarm indicators, such as central processing unit (CPU) utilization, memory usage, custom business indicators, etc.; alarm threshold, that is, when the monitoring indicator data reaches the threshold, scaling is performed; alarm scaling scale, that is, the amount of expansion or reduction when scaling is triggered; alarm scaling semantics, that is, the action performed, including increase, decrease, adjust to, increase by a certain percentage, decrease by a certain percentage, etc.

上面对本申请中涉及到的术语做了简单说明，下文实施例中不再赘述。此外，上文关于术语的说明，仅是为便于理解进行的说明，其对本申请实施例的保护范围不造成限定。The above briefly describes the terms involved in this application, which will not be repeated in the following embodiments. In addition, the above description of terms is only for the sake of ease of understanding, and does not limit the protection scope of the embodiments of this application.

分布式任务调度系统广泛应用于DevOps(一种重视“软件开发人员(Dev)”和“IT运维技术人员(Ops)”之间沟通合作的过程、方法与系统的统称)等持续交付场景，分布式任务调度系统能够合理地屏蔽了云平台中ECS资源调度的细节，从而降低使用者心智负担，更好的聚焦业务。Distributed task scheduling systems are widely used in continuous delivery scenarios such as DevOps (a general term for processes, methods, and systems that emphasize communication and cooperation between "software developers (Dev)" and "IT operation and maintenance technicians (Ops)"). Distributed task scheduling systems can reasonably shield the details of ECS resource scheduling in the cloud platform, thereby reducing the mental burden of users and better focusing on business.

任务调度系统能够获取目标任务的任务数据，并将任务数据上传至云平台。目前用于云平台的任务调度系统，例如kubernetes，能够提供组件kubernetes Auto Scaler完成伸缩组管理及伸缩，但是往往需要用户自行分析业务负载特征，以及自行进行伸缩策略的配置，即，确定伸缩周期、伸缩规模、告警指标等。之后，系统收集监控指标数据，按告警或按时间触发伸缩策略，通过云计算厂商提供的资源调整接口，完成ECS实例的动态调整。繁琐的伸缩策略手动配置过程需要耗费大量人力、物力，并且依据人工经验来进行配置可能会产生误差。The task scheduling system can obtain the task data of the target task and upload the task data to the cloud platform. The current task scheduling system used for the cloud platform, such as kubernetes, can provide the component kubernetes Auto Scaler to complete the management and scaling of the scaling group, but it often requires users to analyze the business load characteristics and configure the scaling strategy by themselves, that is, to determine the scaling cycle, scaling scale, alarm indicators, etc. After that, the system collects monitoring indicator data, triggers the scaling strategy according to the alarm or by time, and completes the dynamic adjustment of the ECS instance through the resource adjustment interface provided by the cloud computing vendor. The cumbersome manual configuration process of the scaling strategy requires a lot of manpower and material resources, and configuration based on manual experience may cause errors.

鉴于此，本申请实施例提供了一种管理实例的方法，该方法能够根据监控伸缩组中运行任务的实例以得到资源利用率，并根据资源利用率生成历史任务画像，再使用模型来根据历史任务画像自动地生成伸缩策略，能够有效解决伸缩策略配置繁琐、人工干预程度大易产生误差的问题。In view of this, an embodiment of the present application provides a method for managing instances. The method can obtain resource utilization by monitoring instances of running tasks in a scaling group, generate historical task portraits based on the resource utilization, and then use a model to automatically generate scaling policies based on the historical task portraits. This method can effectively solve the problems of cumbersome scaling policy configuration and high degree of manual intervention that is prone to errors.

图2示出了本申请实施例提供的一种管理云平台上的资源的方法200的示意性流程图，该方法包括以下步骤。FIG2 shows a schematic flow chart of a method 200 for managing resources on a cloud platform provided in an embodiment of the present application. The method includes the following steps.

S210，实例管理平台监控伸缩组中运行任务的实例，以得到实例的监控指标数据；S210, the instance management platform monitors the instance running the task in the scaling group to obtain monitoring indicator data of the instance;

在一些实施例中，实例管理平台可以为云平台。云可以通俗地理解为一组或一堆远程计算机,这些远程计算机协同工作构建出一个平台，对用户提供各种各样的服务，例如，对业务数据进行计算分析的算力，其中计算可以指构建业务系统的各种需求。在一些实施例中，云平台还可以称为云计算平台。In some embodiments, the instance management platform may be a cloud platform. Cloud can be generally understood as a group or a bunch of remote computers that work together to build a platform to provide users with a variety of services, such as computing power for computing and analyzing business data, where computing can refer to various requirements for building business systems. In some embodiments, the cloud platform can also be called a cloud computing platform.

具体的，实例可以是服务器资源。例如，实例可以是虚拟机、容器、数据库、微服务等。在一些实施例中，实例还可以称为ECS实例。Specifically, an instance may be a server resource. For example, an instance may be a virtual machine, a container, a database, a microservice, etc. In some embodiments, an instance may also be referred to as an ECS instance.

在一些实施例中，监控指标数据可以包括资源利用率，其中资源利用率可以包括以下的一种或多种：CPU使用率、内存使用率、GPU使用率、磁盘利用率、显存使用率。In some embodiments, the monitoring indicator data may include resource utilization, where the resource utilization may include one or more of the following: CPU utilization, memory utilization, GPU utilization, disk utilization, and video memory utilization.

在一些实施例中，监控指标数据还可以包括以下一种或多种：负载指标、网络指标、自定义指标、磁盘读速率、磁盘写速率。其中，网络指标可以包含公网带宽指标和内网带宽指标。In some embodiments, the monitoring indicator data may also include one or more of the following: load indicator, network indicator, custom indicator, disk read rate, disk write rate. Among them, the network indicator may include a public network bandwidth indicator and an intranet bandwidth indicator.

S220，实例管理平台基于监控指标数据生成历史任务画像；S220, the instance management platform generates a historical task profile based on the monitoring indicator data;

其中，历史任务画像可以用来反映伸缩组中实例运行的历史任务的特征。可以理解的是，历史任务可以是实例所运行的全部或部分历史任务。在一些实施方式中，历史任务画像可以包括以下的一种或多种：历史任务时序特征、历史任务资源特征。其中，历史任务时序特征可以用于指示历史任务在伸缩组运行时的时序上的特性，历史任务资源特征可以用于指示伸缩组的资源类型。Among them, the historical task portrait can be used to reflect the characteristics of the historical tasks run by the instance in the scaling group. It is understandable that the historical task can be all or part of the historical tasks run by the instance. In some embodiments, the historical task portrait may include one or more of the following: historical task timing characteristics, historical task resource characteristics. Among them, the historical task timing characteristics can be used to indicate the characteristics of the historical tasks in the timing when the scaling group is running, and the historical task resource characteristics can be used to indicate the resource type of the scaling group.

可选的，历史任务时序特征可以包括历史任务平均运行时长、历史任务量高峰时段、历史任务量低谷时段、历史任务执行周期中的一项或多项。通过上述参数来反映历史任务时序特征，有利于后续策略生成模型根据上述参数更准确、有效地生成伸缩策略。Optionally, the historical task timing characteristics may include one or more of the historical task average running time, the historical task volume peak period, the historical task volume valley period, and the historical task execution cycle. Reflecting the historical task timing characteristics through the above parameters is conducive to the subsequent strategy generation model to generate scaling strategies more accurately and effectively according to the above parameters.

可选的，资源类型可以包括如下的一种或多种：计算密集型、内存型、输入输出密集(In Out Intensive，IO Intensive)型(或称为读写密集型)、图形密集型中的一项或多项。其中，计算密集型通常需要进行大量的计算，消耗CPU资源；IO密集型通常CPU消耗较少，任务的大部分时间都在等待IO操作完成；图形密集型通常需要大量的对图形进行渲染，需要较强的处理和存储能力。本实施例基于历史任务资源特征，有利于后续策略生成模型根据该特征更准确地预测在任务执行过程中不同时刻所需要的资源量，从而生成更合理的伸缩策略。Optionally, resource types may include one or more of the following: one or more of compute-intensive, memory-intensive, input-output intensive (IO intensive) (or read-write intensive), and graphics intensive. Computation-intensive usually requires a lot of calculations and consumes CPU resources; IO intensive usually consumes less CPU, and most of the task time is spent waiting for IO operations to complete; graphics intensive usually requires a lot of graphics rendering, which requires strong processing and storage capabilities. This embodiment is based on historical task resource characteristics, which is conducive to the subsequent strategy generation model to more accurately predict the amount of resources required at different times during the task execution process according to the characteristics, thereby generating a more reasonable scaling strategy.

S230，实例管理平台使用第一模型根据历史任务画像生成一个或多个伸缩策略；S230, the instance management platform generates one or more scaling strategies based on the historical task profile using the first model;

本申请具体示例中的“第一模型”还可以描述为“策略生成模型”，下文不再重复赘述。应理解，以下仅是为便于描述和理解给出的示例，不应构成对技术方案的任何限定。The "first model" in the specific example of this application can also be described as a "strategy generation model", which will not be repeated below. It should be understood that the following is only an example given for the convenience of description and understanding, and should not constitute any limitation on the technical solution.

在一些实施方式中，策略生成模型可以是机器学习模型。可以理解的是，该策略生成模型可以是训练好的，并且能够根据输入的历史任务画像输出伸缩策略。示例性的，实现该画像生成模型的机器学习算法可以包括以下算法中的一项或多项：分类的算法，如K-近邻算法；或者回归的算法，如线性回归；或者聚类的算法，如K-均值算法等。In some embodiments, the policy generation model may be a machine learning model. It is understood that the policy generation model may be trained and can output a scaling policy based on an input historical task profile. Exemplarily, the machine learning algorithm that implements the profile generation model may include one or more of the following algorithms: a classification algorithm, such as a K-nearest neighbor algorithm; or a regression algorithm, such as a linear regression; or a clustering algorithm, such as a K-means algorithm, etc.

可选的，伸缩策略可以包括以下一种或多种：定时伸缩策略、告警伸缩策略、周期伸缩策略。其中，告警伸缩策略可以指示实例管理平台基于监控系统告警数据(例如CPU使用率)，自动增加、减少或设置实例的数量或规格；定时伸缩策略可以指示实例管理平台基于配置的某个时间点，自动增加、减少或设置实例的数量或规格；周期伸缩策略可以指示实例管理平台按照配置周期(例如，按天、按周、按月)，周期性地增加、减少或设置实例的数量或规格。Optionally, the scaling strategy may include one or more of the following: timed scaling strategy, alarm scaling strategy, and periodic scaling strategy. Among them, the alarm scaling strategy can instruct the instance management platform to automatically increase, decrease, or set the number or specifications of instances based on the alarm data of the monitoring system (such as CPU usage); the timed scaling strategy can instruct the instance management platform to automatically increase, decrease, or set the number or specifications of instances based on a certain configured time point; the periodic scaling strategy can instruct the instance management platform to periodically increase, decrease, or set the number or specifications of instances according to the configured period (for example, by day, by week, by month).

S240，实例管理平台向用户推荐一个或者多个伸缩策略，并确定用户选择的伸缩策略；S240, the instance management platform recommends one or more scaling strategies to the user, and determines the scaling strategy selected by the user;

在一些实施方式中，实例管理平台生成一个或者多个伸缩策略之后，可以和用户进行交互，例如可以向用户推荐上述伸缩策略。In some implementations, after the instance management platform generates one or more scaling strategies, it may interact with the user, for example, it may recommend the scaling strategies to the user.

示例性的，可以通过前端向用户呈现该伸缩策略。示例性的，可以通过前端向用户呈现伸缩策略中定时伸缩策略中的以下一项或多项参数：伸缩时间、定时伸缩语义、定时伸缩规模。示例性的，还可以向通过前端向用户呈现告警伸缩策略中的以下一项或多项参数：告警指标，告警阈值，告警伸缩语义，告警伸缩规模。Exemplarily, the scaling policy may be presented to the user through the front end. Exemplarily, one or more of the following parameters in the timing scaling policy in the scaling policy may be presented to the user through the front end: scaling time, timing scaling semantics, timing scaling scale. Exemplarily, one or more of the following parameters in the alarm scaling policy may also be presented to the user through the front end: alarm indicator, alarm threshold, alarm scaling semantics, alarm scaling scale.

可以理解的是，用户可以自行选择是否启用推荐的伸缩策略，以及选择一个或多个伸缩策略中的哪一个或哪几个。对于用户选择启用的伸缩策略，当满足策略条件时，将触发伸缩组的伸缩；用户未选择启用的伸缩策略则不会触发伸缩组的伸缩。It is understandable that users can choose whether to enable the recommended scaling policy, and which one or more scaling policies to choose. For the scaling policy that users choose to enable, when the policy conditions are met, the scaling of the scaling group will be triggered; the scaling policy that users do not choose to enable will not trigger the scaling of the scaling group.

S250，实例管理平台根据用户选择的伸缩策略，调整伸缩组中实例的数量或者实例的规格。S250: The instance management platform adjusts the number of instances or the specifications of the instances in the scaling group according to the scaling policy selected by the user.

在一些实施方式中，伸缩组可以进行横向伸缩(或称为水平伸缩)，也可以进行纵向伸缩(或称为纵向伸缩)。其中，横向伸缩可以理解为对伸缩组中实例的数量进行调整，例如，增加或者减少实例的数量。纵向伸缩可以理解为对伸缩组中实例的规格进行调整，例如，增大或者减小实例的CPU、内存、带宽等配置。In some implementations, the scaling group can be scaled horizontally (or horizontal scaling) or vertically (or vertical scaling). Horizontal scaling can be understood as adjusting the number of instances in the scaling group, for example, increasing or decreasing the number of instances. Vertical scaling can be understood as adjusting the specifications of the instances in the scaling group, for example, increasing or decreasing the CPU, memory, bandwidth, and other configurations of the instances.

基于上述技术方案，能够充分利用大量优质的数据，使用模型根据历史任务画像自动生成合理的伸缩策略，并且能够通过与用户进行交互来启用伸缩策略，从而解决了伸缩策略配置繁琐、人工干预误差大的问题，能够提升伸缩效果；并且，根据历史任务画像来制定伸缩策略，减少直接根据监控相关数据来制定伸缩策略所带来的数据计算量大，参数转换繁琐、准确率低等问题，使得伸缩策略的生成更为高效和合理。Based on the above technical solution, a large amount of high-quality data can be fully utilized, and a model can be used to automatically generate reasonable scaling strategies according to historical task portraits. The scaling strategies can also be enabled by interacting with users, thereby solving the problems of cumbersome scaling strategy configuration and large errors in manual intervention, and improving the scaling effect. In addition, scaling strategies are formulated according to historical task portraits to reduce the large amount of data calculation, cumbersome parameter conversion, and low accuracy caused by directly formulating scaling strategies based on monitoring-related data, making the generation of scaling strategies more efficient and reasonable.

可选的，历史任务画像可以由人工根据历史任务数据分析得到，历史任务画像也可以由机器学习模型根据历史任务数据生成。在一些实施的方式中，步骤S220中基于监控指标数据生成历史任务画像，具体可以包括：使用第二模型根据监控指标数据生成历史任务画像，第二模型的输入为监控指标数据，第二模型的输出为历史任务画像。Optionally, the historical task portrait can be obtained by manual analysis based on the historical task data, or the historical task portrait can be generated by a machine learning model based on the historical task data. In some implementations, generating the historical task portrait based on the monitoring indicator data in step S220 can specifically include: using a second model to generate the historical task portrait based on the monitoring indicator data, the input of the second model is the monitoring indicator data, and the output of the second model is the historical task portrait.

本申请具体示例中的“第二模型”可以描述为“画像生成模型”，下文不再重复赘述。应理解，以下仅是为便于描述和理解给出的示例，不应构成对技术方案的任何限定。The "second model" in the specific example of this application can be described as a "portrait generation model", which will not be repeated below. It should be understood that the following is only an example given for the convenience of description and understanding, and should not constitute any limitation on the technical solution.

可选的，画像生成模型也可以为训练好的机器学习模型，其中，画像生成模型的输入可以为监控指标数据，输出为历史任务画像。实现该画像生成模型的机器学习算法选择较广，可以使用常用的深度学习算法，例如梯度下降算法、反向传播算法、池化等；还可以使用分类的算法，如K-近邻算法；或者聚类的算法，如K-均值算法等。Optionally, the portrait generation model can also be a trained machine learning model, where the input of the portrait generation model can be monitoring indicator data and the output is a historical task portrait. There are a wide range of machine learning algorithms for implementing the portrait generation model, and commonly used deep learning algorithms can be used, such as gradient descent algorithm, back propagation algorithm, pooling, etc.; classification algorithms, such as K-nearest neighbor algorithm; or clustering algorithms, such as K-means algorithm, etc. can also be used.

基于本方案，通过第二模型根据监控指标数据中的资源利用率等数据来对历史任务画像进行提取，相比于人工分析提取任务画像，能够更快、更准确地提取出历史任务画像。Based on this solution, the second model is used to extract historical task portraits based on data such as resource utilization in the monitoring indicator data. Compared with manual analysis and extraction of task portraits, historical task portraits can be extracted faster and more accurately.

在一些实施的方式中，在监控指标数据输入画像生成模型之前，还可以对监控指标数据进行预处理以得到处理数据。图3提供一种历史任务画像的生成方法，如图3所示，上述步骤S220具体可以包括以下步骤：In some implementations, before the monitoring indicator data is input into the portrait generation model, the monitoring indicator data may be preprocessed to obtain processed data. FIG3 provides a method for generating a historical task portrait. As shown in FIG3 , the above step S220 may specifically include the following steps:

S221，对监控指标数据进行预处理以得到处理数据；S221, preprocessing the monitoring indicator data to obtain processed data;

其中，预处理可以包括以下一种或多种：归一化、向量化。The preprocessing may include one or more of the following: normalization and vectorization.

S222，使用第二模型根据处理数据生成历史任务画像，第二模型的输入为处理数据，第二模型的输出为历史任务画像。S222, using the second model to generate a historical task portrait based on the processed data, the input of the second model is the processed data, and the output of the second model is the historical task portrait.

基于本方案，画像生成模型可以基于预处理后的监控指标数据提取历史任务画像，有利于历史任务画像的生成更为高效和准确。Based on this solution, the portrait generation model can extract historical task portraits based on the preprocessed monitoring indicator data, which is conducive to more efficient and accurate generation of historical task portraits.

在一些实施方式中，在步骤S220生成历史任务画像后，还可以向用户呈现该历史任务画像，即向用户呈现该历史任务时序特征和/或历史任务资源特征。具体的，可以通过前端的用户界面来呈现该任务画像。基于该技术方案，在用户自行制定伸缩策略时，该历史任务画像能够辅助用户完成伸缩策略制定，从而有利于提升伸缩策略的伸缩效果。In some implementations, after the historical task portrait is generated in step S220, the historical task portrait can also be presented to the user, that is, the historical task time series characteristics and/or historical task resource characteristics can be presented to the user. Specifically, the task portrait can be presented through the front-end user interface. Based on this technical solution, when the user formulates a scaling strategy by himself, the historical task portrait can assist the user in completing the scaling strategy formulation, thereby facilitating improving the scaling effect of the scaling strategy.

在一些实施方式中，上述步骤S230中的使用策略生成模型根据历史任务画像生成伸缩策略，具体可以使用策略生成模型根据历史任务画像确定定时伸缩策略的以下参数的一项或多项：伸缩时间，定时伸缩语义，以及定时伸缩规模。定时伸缩策略的示例：“2023/03/0100:00:00，增加3个实例”，另一个示例：“2023/03/01 00:00:00，扩容3”。In some implementations, the use of the policy generation model in step S230 to generate a scaling policy based on the historical task profile may specifically use the policy generation model to determine one or more of the following parameters of the timed scaling policy based on the historical task profile: scaling time, timed scaling semantics, and timed scaling scale. An example of a timed scaling policy is: "2023/03/01 00:00:00, add 3 instances", and another example is: "2023/03/01 00:00:00, expand capacity by 3".

在一些实施方式中，上述步骤S230中具体可以使用策略生成模型确定告警伸缩策略的以下参数的一项或多项：告警指标，告警阈值，告警伸缩语义，以及告警伸缩规模。可选的，告警阈值可以包括上限阈值和下限阈值。告警伸缩策略的示例：“CPU使用率>70％，增加2个实例”；“CPU使用率<30％，减少2个实例”In some implementations, the above step S230 may specifically use the policy generation model to determine one or more of the following parameters of the alarm scaling policy: alarm indicator, alarm threshold, alarm scaling semantics, and alarm scaling scale. Optionally, the alarm threshold may include an upper threshold and a lower threshold. Examples of alarm scaling policies: "CPU usage > 70%, add 2 instances"; "CPU usage < 30%, reduce 2 instances"

可选的，上述步骤S230中具体可以使用策略生成模型确定周期伸缩策略的以下参数的一项或多项：伸缩周期，周期伸缩语义，以及周期伸缩规模。周期伸缩策略的示例：“2023/03/01 00:00:00-2023/03/31 23:59:59，每天增加10％的实例”。Optionally, in the above step S230, the policy generation model may be used to determine one or more of the following parameters of the periodic scaling policy: scaling period, periodic scaling semantics, and periodic scaling scale. An example of a periodic scaling policy: "2023/03/01 00:00:00-2023/03/31 23:59:59, increase 10% of instances every day".

可选的，当业务负载难以预测时，用户可以选择告警策略，系统会根据实时的监控数据(如CPU使用率)触发伸缩活动，动态调整伸缩组内的实例数量或规格。Optionally, when the business load is difficult to predict, the user can select an alarm policy. The system will trigger scaling activities based on real-time monitoring data (such as CPU usage) and dynamically adjust the number or specifications of instances in the scaling group.

当业务负载的变化有规律时，用户可以选择定时策略或周期策略调整伸缩组内的实例数量或规格。When business load changes regularly, you can select a timing policy or a periodic policy to adjust the number or specifications of instances in the scaling group.

通过上述实施方式，具体细化了所需要策略生成模型确定的相关参数，能够使得后续伸缩组能够更准确的对实例进行调整，从而能够进一步提升伸缩效果。Through the above implementation, the relevant parameters determined by the required policy generation model are specifically refined, so that the subsequent scaling group can adjust the instance more accurately, thereby further improving the scaling effect.

在一些实施方式中，步骤S260中，调整伸缩组中实例的数量或者实例的规格，具体可以有以下几种调整模式：直接调整模式、递进式调整模式、跟踪型调整模式。In some implementations, in step S260, the number of instances in the scaling group or the specifications of the instances are adjusted, and specifically there may be the following adjustment modes: direct adjustment mode, progressive adjustment mode, and tracking adjustment mode.

直接调整模式是指直接将伸缩组中实例的数量或者实例的规格调整至设定值。示例性的，伸缩策略包括“CPU使用率>70％，扩容3”，触发告警伸缩策略时，实例管理平台可以直接在伸缩组中增加3个实例。Direct adjustment mode refers to directly adjusting the number of instances in the scaling group or the specifications of the instances to the set value. For example, the scaling policy includes "CPU usage > 70%, expand by 3". When the alarm scaling policy is triggered, the instance management platform can directly add 3 instances to the scaling group.

递进式调整模式是指基于监控报警进行分段扩缩容，在直接调整模式的基础上增加了分步定义，可以精细地控制扩缩容。The progressive adjustment mode refers to segmented expansion and contraction based on monitoring alarms. It adds step-by-step definitions to the direct adjustment mode, allowing for fine-grained control of expansion and contraction.

跟踪型调整模式是指，可以选择一项监控指标数据，并指定目标值。实例管理平台会自动计算所需的实例数量并进行扩缩容，从而将监控指标数据维持在目标值附近。例如，伸缩策略包括“维持CPU使用率70％”，触发告警伸缩策略时，计算所需的实例数量并进行扩缩容，从而将CPU使用率维持在70％附近。Tracking adjustment mode means that you can select a monitoring indicator data and specify the target value. The instance management platform will automatically calculate the required number of instances and scale them up and down to keep the monitoring indicator data near the target value. For example, the scaling strategy includes "maintaining CPU utilization at 70%". When the alarm scaling strategy is triggered, the required number of instances is calculated and scaled up and down to keep the CPU utilization near 70%.

当伸缩组执行伸缩操作后，伸缩组内的资源容量发生改变，可能会引起任务与资源实际负载能力不匹配。例如，实例数量增加后，任务因一些限流原因未能匹配资源实际负载能力，从而造成资源浪费。When a scaling group performs a scaling operation, the resource capacity in the scaling group changes, which may cause a mismatch between the task and the actual resource load capacity. For example, after the number of instances increases, the task fails to match the actual resource load capacity due to some flow limiting reasons, resulting in resource waste.

可选的，任务下发时，可以根据任务的业务归属，例如流水线业务、大数据业务、模型训练业务等，来判断任务的类型。任务的类型也可以分为计算密集型、内存型、读/写密集型、图形密集型等。在一些实施方式中，任务可以优先调度给类型匹配的伸缩组。例如，计算密集型的任务可以优先调度给计算密集型的伸缩组，内存密集型的任务可以优先调度给内存密集型的伸缩组。Optionally, when a task is issued, the type of task can be determined based on the business affiliation of the task, such as pipeline business, big data business, model training business, etc. The type of task can also be divided into compute-intensive, memory-intensive, read/write-intensive, graphics-intensive, etc. In some implementations, tasks can be preferentially scheduled to scaling groups of matching types. For example, compute-intensive tasks can be preferentially scheduled to compute-intensive scaling groups, and memory-intensive tasks can be preferentially scheduled to memory-intensive scaling groups.

在一些实施方式中，实例管理平台还可以根据伸缩组中实例数量和规格计算伸缩组的剩余资源量。可选的，任务可以优先调度给剩余资源量较大的伸缩组。In some implementations, the instance management platform may also calculate the remaining resources of the scaling group according to the number and specifications of the instances in the scaling group. Optionally, tasks may be preferentially scheduled to scaling groups with larger remaining resources.

在一些实施方式中，实例管理平台可以根据伸缩组的资源类型和剩余资源量决定将任务挂起等待，或将任务在伸缩组之间进行调度。换句话说，实例管理平台可以根据伸缩组的资源类型和剩余资源量来对任务限流进行动态调整。任务下发时，可以优先将任务调度给与任务类型匹配、剩余资源量大的伸缩组，该伸缩组可以称为最优伸缩组。当与任务类型匹配的伸缩组的剩余资源量都比较小时，可以将任务进行挂起等待。In some implementations, the instance management platform can decide to suspend the task or schedule the task between scaling groups based on the resource type and remaining resources of the scaling group. In other words, the instance management platform can dynamically adjust the task flow limit based on the resource type and remaining resources of the scaling group. When the task is issued, the task can be preferentially scheduled to the scaling group that matches the task type and has a large remaining resource amount. This scaling group can be called the optimal scaling group. When the remaining resources of the scaling groups that match the task type are relatively small, the task can be suspended and waited.

基于上述实施方式，能够根据伸缩组中的资源类型和剩余资源量，动态调整任务执行限流参数，并选择最优的伸缩组完成任务下发，能够避免资源的浪费，提高伸缩组的任务执行效率。Based on the above implementation, the task execution current limiting parameters can be dynamically adjusted according to the resource type and remaining resource amount in the scaling group, and the optimal scaling group can be selected to complete task delivery, which can avoid resource waste and improve the task execution efficiency of the scaling group.

如图4所示，为本申请实施例提供的一种客户端界面示意图。示例性的，用户可以点击“伸缩策略配置推荐”选项来查看当前策略以及推荐策略。其中，推荐策略1、推荐策略2可以为本申请策略推荐模型所生成的策略。可选的，用户可以点击“推荐策略1”选项可以查看伸缩策略的具体参数，例如点击推荐策略1后，界面上可以显示“CPU使用率>70％，扩容3”。As shown in Figure 4, a schematic diagram of a client interface provided in an embodiment of the present application is shown. Exemplarily, the user can click the "Scaling Policy Configuration Recommendation" option to view the current policy and the recommended policy. Among them, recommended policy 1 and recommended policy 2 can be policies generated by the policy recommendation model of the present application. Optionally, the user can click the "Recommended Policy 1" option to view the specific parameters of the scaling policy. For example, after clicking on recommended policy 1, the interface can display "CPU usage > 70%, expansion 3".

可选的，实例管理平台可以对实例的各监控指标进行实时监测。正常作业时，各项性能指标在合理的范围内。Optionally, the instance management platform can monitor the instance's various monitoring indicators in real time. During normal operation, each performance indicator is within a reasonable range.

示例性的，用户可以点击“启用”选项来启用对应的推荐策略，例如用户点击推荐策略1对应的“启用”选项后，则当监测到实例CPU使用率大于70％时，则会触发该推荐策略1的伸缩活动，使得云平台执行该扩容操作，在伸缩组中增加3个实例，以避免任务执行异常。Exemplarily, the user can click the "Enable" option to enable the corresponding recommended strategy. For example, after the user clicks the "Enable" option corresponding to recommended strategy 1, when the instance CPU usage is monitored to be greater than 70%, the scaling activity of recommended strategy 1 will be triggered, causing the cloud platform to execute the expansion operation and add 3 instances to the scaling group to avoid task execution exceptions.

示例性的，推荐策略2可以为“CPU使用率<30％，缩容1”，用户点击推荐策略2对应的“启用”选项后，则当监测到CPU使用率小于30％时，则会触发该推荐策略2的伸缩活动，以减少资源浪费，降低运营成本。Exemplarily, the recommended strategy 2 can be "CPU usage <30%, scale down 1". After the user clicks the "Enable" option corresponding to the recommended strategy 2, when the CPU usage is monitored to be less than 30%, the scaling activity of the recommended strategy 2 will be triggered to reduce resource waste and reduce operating costs.

上文详细地描述了本申请实施例的方法实施例，下面描述本申请实施例的装置实施例，装置实施例与方法实施例相互对应，因此装置实施例中未详细描述的部分可参见前面方法实施例。The method embodiment of the embodiment of the present application is described in detail above, and the device embodiment of the embodiment of the present application is described below. The device embodiment and the method embodiment correspond to each other, so the parts not described in detail in the device embodiment can refer to the previous method embodiment.

图5示出了本申请实施例提供的一种实例管理平台500的示意性结构框图。该实例管理平台可以包括：监控模块510，用于监控伸缩组中运行任务的实例，以得到实例的资源利用率，资源利用率包括以下的一种或多种：CPU使用率、内存使用率；策略生成模块520，用于基于资源利用率生成历史任务画像，历史任务画像包括以下的一种或多种：历史任务时序特征、历史任务资源特征，其中，历史任务时序特征用于指示历史任务在伸缩组运行时的时序上的特性，历史任务资源特征用于指示伸缩组的资源类型，资源类型可以包括如下的一种或多种：计算密集型、内存型、输入输出密集型、图形密集型；策略生成模块520还用于，使用第一模型根据历史任务画像生成一个或多个伸缩策略，第一模型的输入为历史任务画像，第一模型的输出为一个或多个伸缩策略；执行模块530，用于向用户推荐一个或多个伸缩策略；执行模块530还用于，确定用户选择的伸缩策略；执行模块530还用于，根据用户选择的伸缩策略，调整伸缩组中实例的数量或者实例的规格。FIG5 shows a schematic structural block diagram of an instance management platform 500 provided in an embodiment of the present application. The instance management platform may include: a monitoring module 510, which is used to monitor the instance of the running task in the scaling group to obtain the resource utilization of the instance, and the resource utilization includes one or more of the following: CPU utilization and memory utilization; a policy generation module 520, which is used to generate a historical task portrait based on the resource utilization, and the historical task portrait includes one or more of the following: historical task timing characteristics and historical task resource characteristics, wherein the historical task timing characteristics are used to indicate the characteristics of the historical task in the timing when the scaling group is running, and the historical task resource characteristics are used to indicate the resource type of the scaling group, and the resource type may include one or more of the following: computing intensive, memory intensive, input and output intensive, and graphics intensive; the policy generation module 520 is also used to generate one or more scaling policies according to the historical task portrait using a first model, the input of the first model is the historical task portrait, and the output of the first model is one or more scaling policies; an execution module 530, which is used to recommend one or more scaling policies to a user; the execution module 530 is also used to determine the scaling policy selected by the user; the execution module 530 is also used to adjust the number of instances or the specifications of the instances in the scaling group according to the scaling policy selected by the user.

基于上述技术方案，策略生成模块520能够充分利用大量优质的数据，使用模型根据历史任务画像自动生成合理的伸缩策略，并且能够通过执行模块530与用户进行交互来启用伸缩策略，从而解决了伸缩策略配置繁琐、人工干预误差大的问题，能够提升伸缩效果；并且，策略生成模块520根据历史任务画像来制定伸缩策略，减少直接根据监控的数据来制定伸缩策略所带来的数据计算量大，参数转换繁琐、准确率低等问题，使得伸缩策略的生成更为高效和合理。Based on the above technical solution, the strategy generation module 520 can make full use of a large amount of high-quality data, use the model to automatically generate a reasonable scaling strategy according to the historical task portrait, and can interact with the user through the execution module 530 to enable the scaling strategy, thereby solving the problems of cumbersome scaling strategy configuration and large errors in manual intervention, and can improve the scaling effect; in addition, the strategy generation module 520 formulates the scaling strategy according to the historical task portrait, reducing the large amount of data calculation, cumbersome parameter conversion, low accuracy and other problems brought about by directly formulating the scaling strategy based on the monitored data, thereby making the generation of the scaling strategy more efficient and reasonable.

在一些可能的实施方式中，伸缩策略可以包括以下一种或多种：定时伸缩策略、告警伸缩策略，策略生成模块具体用于，使用第一模型根据历史任务画像确定定时伸缩策略的以下一种或多种参数：伸缩时间，定时伸缩语义，以及定时伸缩规模；使用第一模型根据历史任务画像确定告警伸缩策略的以下一种或多种参数：告警指标，告警阈值，以及告警伸缩语义，以及告警伸缩规模。In some possible implementations, the scaling strategy may include one or more of the following: a scheduled scaling strategy and an alarm scaling strategy. The strategy generation module is specifically used to use the first model to determine one or more of the following parameters of the scheduled scaling strategy based on the historical task portrait: scaling time, scheduled scaling semantics, and scheduled scaling scale; use the first model to determine one or more of the following parameters of the alarm scaling strategy based on the historical task portrait: alarm indicator, alarm threshold, alarm scaling semantics, and alarm scaling scale.

在一些可能的实施方式中，执行模块530还可以用于：根据实例的数量和实例的规格确定剩余资源量；根据伸缩组的资源类型和剩余资源量决定将任务挂起等待，或将任务在伸缩组之间进行调度。In some possible implementations, the execution module 530 may also be used to: determine the remaining resource amount based on the number of instances and the instance specifications; decide to suspend the task or schedule the task between scaling groups based on the resource type and remaining resource amount of the scaling group.

基于上述实施方式，能够根据伸缩组中的类型和剩余资源量，动态调整任务执行限流参数，并选择最优的伸缩组完成任务下发，能够减小资源的浪费，提高伸缩组的任务执行效率。Based on the above implementation, the task execution current limiting parameters can be dynamically adjusted according to the type and remaining resources in the scaling group, and the optimal scaling group can be selected to complete task delivery, which can reduce resource waste and improve the task execution efficiency of the scaling group.

在一些实施方式中，执行模块530还可以向用户呈现历史任务时序特征和历史任务资源特征。In some implementations, the execution module 530 may also present historical task timing characteristics and historical task resource characteristics to the user.

基于该技术方案，在用户自行制定伸缩策略时，历史任务画像能够辅助用户完成伸缩策略制定，从而有利于提升伸缩策略的伸缩效果。Based on this technical solution, when users formulate scaling strategies themselves, historical task portraits can assist users in completing scaling strategy formulation, thereby helping to improve the scaling effect of the scaling strategy.

在一些实施方式中，策略生成模块520具体可以用于：对资源利用率进行预处理以得到处理数据，预处理包括以下一种或多种：归一化、向量化；使用第二模型根据处理数据生成历史任务画像，第二模型的输入为处理数据，第二模型的输出为历史任务画像。In some embodiments, the strategy generation module 520 can be specifically used to: preprocess resource utilization to obtain processing data, the preprocessing including one or more of the following: normalization, vectorization; use a second model to generate a historical task portrait based on the processing data, the input of the second model is the processing data, and the output of the second model is the historical task portrait.

基于本方案，策略生成模块520能够使用第二模型根据资源利用率等数据来对历史任务画像进行提取，相比于人工分析提取任务画像，能够更快、更准确地提取出历史任务画像；并且第二模型可以基于预处理后的处理数据提取历史任务画像，有利于历史任务画像的生成更为高效和准确。Based on this solution, the strategy generation module 520 can use the second model to extract historical task portraits based on data such as resource utilization. Compared with manual analysis to extract task portraits, historical task portraits can be extracted faster and more accurately; and the second model can extract historical task portraits based on pre-processed processing data, which is conducive to more efficient and accurate generation of historical task portraits.

本申请的在一些实施方式中，历史任务时序特征可以包括以下的一种或多种：历史任务平均运行时长、历史任务量高峰时段、历史任务量低谷时段、历史任务执行周期。In some implementations of the present application, the historical task timing characteristics may include one or more of the following: average running time of historical tasks, peak time periods of historical task volume, trough time periods of historical task volume, and historical task execution cycles.

其中，监控模块、策略生成模块和执行模块均可以通过软件实现，或者可以通过硬件实现。示例性的，接下来以监控模块为例，介绍监控模块的实现方式。类似的，策略生成模块和执行模块的实现方式可以参考监控模块的实现方式。Among them, the monitoring module, the policy generation module and the execution module can all be implemented by software, or can be implemented by hardware. Exemplarily, the implementation of the monitoring module is introduced below by taking the monitoring module as an example. Similarly, the implementation of the policy generation module and the execution module can refer to the implementation of the monitoring module.

模块作为软件功能单元的一种举例，监控模块可以包括运行在计算实例上的代码。其中，计算实例可以包括物理主机(计算设备)、虚拟机、容器中的至少一种。进一步地，上述计算实例可以是一台或者多台。例如，监控模块可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是，用于运行该代码的多个主机/虚拟机/容器可以分布在相同的区域(region)中，也可以分布在不同的region中。进一步地，用于运行该代码的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone，AZ)中，也可以分布在不同的AZ中，每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中，通常一个region可以包括多个AZ。As an example of a software functional unit, the monitoring module may include code running on a computing instance. Among them, the computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Furthermore, the above-mentioned computing instance may be one or more. For example, the monitoring module may include code running on multiple hosts/virtual machines/containers. It should be noted that the multiple hosts/virtual machines/containers used to run the code may be distributed in the same region or in different regions. Furthermore, the multiple hosts/virtual machines/containers used to run the code may be distributed in the same availability zone (AZ) or in different AZs, each AZ including one data center or multiple data centers with similar geographical locations. Among them, usually a region may include multiple AZs.

同样，用于运行该代码的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud，VPC)中，也可以分布在多个VPC中。其中，通常一个VPC设置在一个region内，同一region内两个VPC之间，以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关，经通信网关实现VPC之间的互连。Similarly, multiple hosts/virtual machines/containers used to run the code can be distributed in the same virtual private cloud (VPC) or in multiple VPCs. Usually, a VPC is set up in a region. For cross-region communication between two VPCs in the same region and between VPCs in different regions, a communication gateway needs to be set up in each VPC to achieve interconnection between VPCs through the communication gateway.

模块作为硬件功能单元的一种举例，监控模块可以包括至少一个计算设备，如服务器等。或者，监控模块也可以是利用专用集成电路(application-specific integratedcircuit，ASIC)实现、或可编程逻辑器件(programmable logic device，PLD)实现的设备等。其中，上述PLD可以是复杂程序逻辑器件(complex programmable logical device，CPLD)、现场可编程门阵列(field-programmable gate array，FPGA)、通用阵列逻辑(generic array logic，GAL)或其任意组合实现。As an example of a hardware functional unit, the monitoring module may include at least one computing device, such as a server, etc. Alternatively, the monitoring module may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). The PLD may be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.

监控模块包括的多个计算设备可以分布在相同的region中，也可以分布在不同的region中。监控模块包括的多个计算设备可以分布在相同的AZ中，也可以分布在不同的AZ中。同样，监控模块包括的多个计算设备可以分布在同一个VPC中，也可以分布在多个VPC中。其中，所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。The multiple computing devices included in the monitoring module can be distributed in the same region or in different regions. The multiple computing devices included in the monitoring module can be distributed in the same AZ or in different AZs. Similarly, the multiple computing devices included in the monitoring module can be distributed in the same VPC or in multiple VPCs. The multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.

需要说明的是，在其他实施例中，监控模块可以用于执行管理实例的方法中的任意步骤，策略生成模块可以用于执行管理实例的方法中的任意步骤,执行模块可以用于执行管理实例的方法中的任意步骤,监控模块、策略生成模块、以及执行模块负责实现的步骤可根据需要指定，通过监控模块、策略生成模块、以及执行模块分别实现管理实例的方法中不同的步骤来实现实例管理平台的全部功能。It should be noted that, in other embodiments, the monitoring module can be used to execute any step in the method for managing an instance, the policy generation module can be used to execute any step in the method for managing an instance, and the execution module can be used to execute any step in the method for managing an instance. The steps that the monitoring module, the policy generation module, and the execution module are responsible for implementing can be specified as needed. The full functions of the instance management platform are realized by respectively implementing different steps in the method for managing an instance through the monitoring module, the policy generation module, and the execution module.

本申请还提供一种计算设备600。如图6所示，计算设备600包括：总线602、处理器604、存储器606和通信接口608。处理器604、存储器606和通信接口608之间通过总线602通信。计算设备600可以是服务器或终端设备。应理解，本申请不限定计算设备100中的处理器、存储器的个数。The present application also provides a computing device 600. As shown in FIG6 , the computing device 600 includes: a bus 602, a processor 604, a memory 606, and a communication interface 608. The processor 604, the memory 606, and the communication interface 608 communicate with each other through the bus 602. The computing device 600 can be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in the computing device 100.

总线602可以是外设部件互连标准(peripheral component interconnect，PCI)总线或扩展工业标准结构(extended industry standard architecture，EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示，图6中仅用一条线表示，但并不表示仅有一根总线或一种类型的总线。总线602可包括在计算设备600各个部件(例如，存储器606、处理器604、通信接口608)之间传送信息的通路。The bus 602 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG. 6 is represented by only one line, but does not mean that there is only one bus or one type of bus. The bus 602 may include a path for transmitting information between various components of the computing device 600 (e.g., the memory 606, the processor 604, and the communication interface 608).

处理器604可以包括中央处理器(central processing unit，CPU)、图形处理器(graphics processing unit，GPU)、微处理器(micro processor，MP)或者数字信号处理器(digital signal processor，DSP)等处理器中的任意一种或多种。The processor 604 may include any one or more of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).

存储器606可以包括易失性存储器(volatile memory)，例如随机存取存储器(random access memory，RAM)。处理器604还可以包括非易失性存储器(non-volatilememory)，例如只读存储器(read-only memory，ROM)，快闪存储器，机械硬盘(hard diskdrive，HDD)或固态硬盘(solid state drive，SSD)。The memory 606 may include a volatile memory, such as a random access memory (RAM). The processor 604 may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).

存储器606中存储有可执行的程序代码，处理器604执行该可执行的程序代码以分别实现前述监控模块、策略生成模块、执行模块的功能，从而实现上述管理实例的方法。也即，存储器606上存有用于执行上述管理实例的方法的指令。The memory 606 stores executable program codes, and the processor 604 executes the executable program codes to respectively implement the functions of the aforementioned monitoring module, policy generation module, and execution module, thereby implementing the aforementioned method for managing instances. That is, the memory 606 stores instructions for executing the aforementioned method for managing instances.

通信接口608使用例如但不限于网络接口卡、收发器一类的收发模块，来实现计算设备600与其他设备或通信网络之间的通信。The communication interface 608 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 600 and other devices or communication networks.

本申请实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备。该计算设备可以是服务器，例如是中心服务器、边缘服务器，或者是本地数据中心中的本地服务器。在一些实施例中，计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。The embodiment of the present application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.

如图7所示，计算设备集群包括至少一个计算设备600。计算设备集群中的一个或多个计算设备600中的存储器606中可以存有相同的用于执行上述管理实例的方法的指令。As shown in Fig. 7, the computing device cluster includes at least one computing device 600. The memory 606 in one or more computing devices 600 in the computing device cluster may store the same instructions for executing the above-mentioned method for managing instances.

在一些可能的实现方式中，该计算设备集群中的一个或多个计算设备600的存储器606中也可以分别存有用于执行上述管理实例的方法的部分指令。换言之，一个或多个计算设备600的组合可以共同执行用于执行上述管理实例的方法的指令。In some possible implementations, the memory 606 of one or more computing devices 600 in the computing device cluster may also store partial instructions for executing the above-mentioned method for managing instances. In other words, the combination of one or more computing devices 600 may jointly execute instructions for executing the above-mentioned method for managing instances.

需要说明的是，计算设备集群中的不同的计算设备600中的存储器606可以存储不同的指令，分别用于执行上述实例管理平台的部分功能。也即，不同的计算设备600中的存储器606存储的指令可以实现数据监控模块、策略生成模块、执行模块中的一个或多个模块的功能。It should be noted that the memory 606 in different computing devices 600 in the computing device cluster can store different instructions, which are respectively used to execute part of the functions of the above-mentioned instance management platform. That is, the instructions stored in the memory 606 in different computing devices 600 can implement the functions of one or more modules among the data monitoring module, the policy generation module, and the execution module.

在一些可能的实现方式中，计算设备集群中的一个或多个计算设备可以通过网络连接。其中，所述网络可以是广域网或局域网等等。图8示出了一种可能的实现方式。如图8所示，两个计算设备600A和600B之间通过网络进行连接。具体地，通过各个计算设备中的通信接口与所述网络进行连接。在这一类可能的实现方式中，计算设备600A中的存储器606中存有执行监控模块和执行模块的功能的指令。同时，计算设备600B中的存储器606中存有执行策略生成模块的功能的指令。In some possible implementations, one or more computing devices in the computing device cluster can be connected via a network. Wherein, the network can be a wide area network or a local area network, etc. FIG. 8 shows a possible implementation. As shown in FIG. 8 , two computing devices 600A and 600B are connected via a network. Specifically, the network is connected via a communication interface in each computing device. In this type of possible implementation, the memory 606 in the computing device 600A stores instructions for executing the functions of the monitoring module and the execution module. At the same time, the memory 606 in the computing device 600B stores instructions for executing the functions of the strategy generation module.

图8所示的计算设备集群之间的连接方式可以是考虑到本申请提供的管理实例的方法模型训练需要大量地计算，因此考虑将策略生成模块实现的功能交由计算设备600B执行。The connection method between the computing device clusters shown in Figure 8 can be considered that the method model training of the management instance provided in this application requires a lot of calculations, so it is considered to hand over the functions implemented by the policy generation module to the computing device 600B for execution.

应理解，图8中示出的计算设备600A的功能也可以由多个计算设备600完成。同样，计算设备600B的功能也可以由多个计算设备600完成。It should be understood that the functions of the computing device 600A shown in FIG8 may also be completed by multiple computing devices 600. Similarly, the functions of the computing device 600B may also be completed by multiple computing devices 600.

本申请实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的，能够运行在计算设备上或被储存在任何可用介质中的软件或程序产品。当所述计算机程序产品在至少一个计算设备上运行时，使得至少一个计算设备执行上述管理实例的方法。The embodiment of the present application also provides a computer program product including instructions. The computer program product may be software or a program product including instructions that can be run on a computing device or stored in any available medium. When the computer program product is run on at least one computing device, the at least one computing device executes the above-mentioned method for managing instances.

本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令，所述指令指示计算设备执行上述管理实例的方法。The embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state hard disk). The computer-readable storage medium includes instructions that instruct the computing device to execute the above-mentioned method of managing instances.

本申请实施例还提供一种芯片，该芯片包括处理器与数据接口，该处理器通过该数据接口读取存储器上存储的指令，以执行上述管理实例的方法。An embodiment of the present application also provides a chip, which includes a processor and a data interface. The processor reads instructions stored in a memory through the data interface to execute the above-mentioned method for managing instances.

可选的，该芯片包括处理器与数据接口，该处理器通过该数据接口读取存储器上存储的指令，执行上述管理实例的的方法。Optionally, the chip includes a processor and a data interface, and the processor reads instructions stored in the memory through the data interface to execute the above-mentioned method of managing the instance.

可选的，该芯片还可以包括存储器，该存储器中存储有指令，该处理器用于执行该存储器上存储的指令，当该指令被运行时，该处理器用于执行上述管理实例的方法。Optionally, the chip may further include a memory storing instructions, and the processor is used to execute the instructions stored in the memory. When the instructions are executed, the processor is used to execute the above-mentioned method for managing instances.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for managing an instance, comprising:

The instance management platform monitors the instances running tasks in the scaling group to obtain resource utilization of the instances, where the resource utilization includes one or more of the following: CPU utilization, memory utilization;

The instance management platform generates a historical task profile based on the resource utilization, where the historical task profile includes one or more of the following: a historical task timing feature and a historical task resource feature, where the historical task timing feature is used to indicate a characteristic of a historical task in terms of timing when the scaling group is running, and the historical task resource feature is used to indicate a resource type of the scaling group, where the resource type includes one or more of the following: computing intensive, memory intensive, input/output intensive, and graphics intensive;

The instance management platform generates one or more scaling policies according to the historical task profile using a first model, where an input of the first model is the historical task profile, and an output of the first model is the one or more scaling policies;

The instance management platform recommends the one or more scaling strategies to the user;

The instance management platform determines the scaling strategy selected by the user;

The instance management platform adjusts the number of instances or the specifications of the instances in the scaling group according to the scaling policy selected by the user.

2. The method according to claim 1, wherein the scaling strategy includes one or more of the following: a timing scaling strategy, an alarm scaling strategy,

The using the first model to generate one or more scaling strategies according to the historical task profile includes:

Determine one or more of the following parameters of the scheduled scaling strategy according to the historical task profile using the first model: scaling time, scheduled scaling semantics, and scheduled scaling scale;

The first model is used to determine one or more of the following parameters of the alarm scaling strategy according to the historical task portrait: alarm indicator, alarm threshold, alarm scaling semantics, and alarm scaling scale.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

The instance management platform determines the remaining resources of the scaling group according to the number of instances and specifications of the instances;

The instance management platform decides to suspend the task or schedule the task between scaling groups according to the resource type and the remaining resource amount of the scaling group.

4. The method according to any one of claims 1 to 3, characterized in that the method further comprises:

The instance management platform presents the historical task timing characteristics and the historical task resource characteristics to the user.

5. The method according to any one of claims 1 to 4, characterized in that generating a historical task profile based on the resource utilization comprises:

Preprocessing the resource utilization to obtain processing data, wherein the preprocessing includes one or more of the following: normalization and vectorization;

A second model is used to generate the historical task portrait according to the processed data, wherein the input of the second model is the processed data, and the output of the second model is the historical task portrait.

6. The method according to any one of claims 1-5 is characterized in that the historical task timing characteristics include one or more of the following: average running time of historical tasks, peak period of historical task volume, trough period of historical task volume, and historical task execution cycle.

7. An instance management platform, characterized by comprising:

A monitoring module is used to monitor the instances of running tasks in the scaling group to obtain resource utilization of the instances, where the resource utilization includes one or more of the following: CPU utilization and memory utilization;

A policy generation module, configured to generate a historical task profile based on the resource utilization, wherein the historical task profile includes one or more of the following: a historical task timing feature and a historical task resource feature, wherein the historical task timing feature is used to indicate a characteristic of a historical task in terms of timing when the scaling group is running, and the historical task resource feature is used to indicate a resource type of the scaling group, wherein the resource type includes one or more of the following: computing intensive, memory intensive, input/output intensive, and graphics intensive;

The strategy generation module is further used to generate one or more scaling strategies according to the historical task portrait using a first model, where an input of the first model is the historical task portrait, and an output of the first model is the one or more scaling strategies;

An execution module, configured to recommend the one or more scaling strategies to a user;

The execution module is also used to determine the scaling strategy selected by the user;

The execution module is further configured to adjust the number of instances or the specifications of the instances in the scaling group according to the scaling policy selected by the user.

8. The instance management platform according to claim 7, wherein the scaling strategy comprises one or more of the following: a timed scaling strategy, an alarm scaling strategy,

The strategy generation module is specifically used to determine one or more of the following parameters of the timed scaling strategy according to the historical task portrait using the first model: scaling time, timed scaling semantics, and timed scaling scale;

9. The instance management platform according to claim 7 or 8, wherein the execution module is further used for:

Determining the remaining resources of the scaling group according to the number of instances and specifications of the instances;

It is determined according to the resource type and the remaining resource amount of the scaling group whether to suspend the task and wait, or to schedule the task between scaling groups.

10. The instance management platform according to any one of claims 7 to 9, wherein the execution module is further used for:

The historical task timing characteristics and the historical task resource characteristics are presented to the user.

11. The instance management platform according to any one of claims 7 to 10, wherein the policy generation module is specifically used to:

12. An instance management platform according to any one of claims 7-11, characterized in that the historical task timing characteristics include one or more of the following: average running time of historical tasks, peak period of historical task volume, trough period of historical task volume, and historical task execution cycle.

13. A computing device cluster, comprising at least one computing device, each computing device comprising a processor and a memory;

The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster executes the method according to any one of claims 1 to 6.

14. A computer-readable storage medium, characterized in that it comprises computer program instructions, and when the computer instructions are executed by a computing device cluster, the computing device cluster executes the method according to any one of claims 1 to 6.

15 . A computer program product comprising instructions, wherein when the instructions are executed by a computing device cluster, the computing device cluster executes the method according to claim 1 .