CN116166465A - Cluster operation and maintenance method and device based on management plane cluster - Google Patents
Cluster operation and maintenance method and device based on management plane cluster Download PDFInfo
- Publication number
- CN116166465A CN116166465A CN202310188510.5A CN202310188510A CN116166465A CN 116166465 A CN116166465 A CN 116166465A CN 202310188510 A CN202310188510 A CN 202310188510A CN 116166465 A CN116166465 A CN 116166465A
- Authority
- CN
- China
- Prior art keywords
- cluster
- resource
- task
- node
- deployment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computer Hardware Design (AREA)
- Hardware Redundancy (AREA)
Abstract
The disclosure provides a cluster operation and maintenance method based on management plane clusters, relates to the technical field of cloud computing, and can be applied to the technical field of finance and technology. The method comprises the following steps: responding to a change event of a cluster resource object, and acquiring cluster deployment resource information; analyzing the cluster deployment resource information to determine a cluster change type and machine resources required by the change; establishing a first task resource according to the cluster resource change type and the machine resource required by the change; and sending the first task resource to a task execution node. The disclosure also provides a cluster operation and maintenance device, storage medium and program product based on the management plane cluster.
Description
Technical Field
The disclosure relates to the technical field of cloud computing, in particular to the technical field of kubernetes cluster resource management, and more particularly to a cluster operation and maintenance method, device, equipment, storage medium and program product based on management plane clusters.
Background
Along with the development and popularization of technologies of container engines such as kubernetes (hereinafter referred to as K8 s) and dockers and container scheduling platforms, K8s clusters provide Namespace-level isolation capability for users and theoretically support not more than 5 KNode and 15W Pod. The multi-K8 s cluster solves the problems of resource isolation and fault isolation of a single cluster, breaks the limit of the number of supportable nodes and the number of Pods, and brings about the increase of the complexity of cluster management; especially in proprietary cloud scenarios, K8s engineers are unlikely to reach the customer environment as fast as in public clouds, and the operational cost is further amplified. Therefore, how to realize low-cost, high-efficiency and automatic low-management of multiple K8s clusters becomes a technical problem to be solved.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a low cost, efficient, automated management plane cluster-based cluster operation and maintenance method, apparatus, device, medium, and program product.
According to a first aspect of the present disclosure, there is provided a cluster operation and maintenance method based on a management plane cluster, the method comprising:
responding to a change event of a cluster resource object, and acquiring cluster deployment resource information, wherein the cluster deployment resource information is preconfigured through a management plane cluster;
analyzing the cluster deployment resource information to determine a cluster change type and machine resources required by the change;
establishing a first task resource according to the cluster resource change type and the machine resource required by the change; and
and sending the first task resource to a task execution node.
According to an embodiment of the present disclosure, the method further comprises:
monitoring node states and cluster states at regular time;
after the cluster state abnormality is determined, determining machine resources defined by cluster deployment resources of the fault cluster;
Reconstructing a fault cluster according to machine resources defined by the cluster deployment resources; or (b)
And after the abnormal state of the node is determined, automatically repairing the fault node.
According to an embodiment of the disclosure, the reconstructing the failed cluster according to the machine resources defined by the cluster deployment resources includes:
newly creating a second task resource according to the machine resource defined by the cluster deployment resource;
executing the second task resource to create a new cluster;
the application resources in the fault cluster are re-issued to the new cluster; and
and recovering the resources of the fault cluster.
According to an embodiment of the present disclosure, the automatically repairing the failed node includes:
executing a pull-up repair operation according to the application resource of the fault node to repair the fault node;
if the state of the cluster where the fault node is located is still abnormal, repairing the fault node again; and
and after the repairing times are greater than a preset threshold value, determining available machine resources to replace the fault node in a resource pool.
According to an embodiment of the present disclosure, the determining that the failed node is replaced with an available machine resource in a resource pool includes:
creating a third task resource from the available machine resources; and
And executing the third task resource newly-built node to finish the replacement of the fault node.
According to an embodiment of the present disclosure, the cluster deployment resource information includes cluster scale information including a cluster name, a number of nodes, and a number of master nodes, and version information including container management system version information and distributed database version information.
According to an embodiment of the present disclosure, the creating a first task resource according to the cluster resource change type and the machine resource required by the change includes:
if the cluster change type is determined to be capacity expansion or cluster new addition, determining an internet protocol address and an access key of machine resources required to be changed according to the cluster deployment resource information;
and mounting the Internet protocol address and the access key of the machine resource required to be changed into the first task resource.
A second aspect of the present disclosure provides a management plane cluster-based cluster operation and maintenance apparatus, the apparatus comprising:
the cluster deployment resource management system comprises an acquisition module, a management module and a cluster deployment resource management module, wherein the acquisition module is used for responding to a change event of a cluster resource object and acquiring cluster deployment resource information, and the cluster deployment resource information is preconfigured through a management plane cluster;
the determining module is used for analyzing the cluster deployment resource information to determine a cluster change type and machine resources required by the change;
The task resource newly-built module is used for newly-building a first task resource according to the cluster resource change type and the machine resource required by the change; and
and the task sending module is used for sending the first task resource to a task execution node. According to an embodiment of the present disclosure, the apparatus further comprises: the system comprises a cluster state monitoring module, a first fault restoration module and a second fault restoration module.
The cluster state monitoring module is used for monitoring the node state and the cluster state at regular time;
the first fault repairing module is used for determining machine resources defined by cluster deployment resources of the fault cluster after determining that the cluster state is abnormal;
and reconstructing the fault cluster according to the machine resources defined by the cluster deployment resources.
And the second fault repairing module is used for automatically repairing the fault node after the abnormal state of the node is determined.
According to an embodiment of the disclosure, the first fault repairing module includes a new sub-module, an executing sub-module, an issuing sub-module and a recycling sub-module.
The new building sub-module is used for building a second task resource according to the machine resource defined by the cluster deployment resource;
the execution sub-module is used for executing the second task resource to establish a new cluster;
The issuing sub-module is used for issuing the application resources in the fault cluster to the new cluster again; and
and the recycling sub-module is used for recycling the resources of the fault cluster.
According to an embodiment of the present disclosure, the second fault repair module includes:
the first fault repairing sub-module is used for executing a pull-up repairing operation according to the application resource of the fault node to repair the fault node;
the second fault repairing sub-module is used for repairing the fault node again if the state of the cluster where the fault node is located is determined to be still abnormal; and
and the replacing sub-module is used for determining available machine resources to replace the fault node in the resource pool after the repairing times are larger than a preset threshold value.
According to an embodiment of the present disclosure, the replacement submodule includes: a creation unit and a replacement unit.
A creation unit configured to create a third task resource according to the available machine resource; and
and the replacing unit is used for executing the new node of the third task resource so as to complete the replacement of the fault node.
According to an embodiment of the present disclosure, the task resource creation module includes: the first determining sub-module and the mounting sub-module.
The first determining submodule is used for determining an internet protocol address and an access key of machine resources required to be changed according to the cluster deployment resource information if the cluster change type is determined to be capacity expansion or cluster new addition;
And the mounting sub-module is used for mounting the Internet protocol address and the access key of the machine resource required to be changed into the first task resource.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the cluster operation and maintenance method based on the management plane cluster.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described management plane cluster-based cluster operation and maintenance method.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described cluster operation and maintenance method based on management plane clusters.
According to the cluster operation and maintenance method based on the management plane cluster, which is provided by the embodiment of the invention, by adopting the design of K8s on K8s, a user only needs to define each version and each type of cluster as cluster deployment resource information to be sent to the management plane cluster, and each controller of the management plane cluster can control four large processes of arrangement construction, task triggering, task execution state checking and cluster state checking, and a cluster is newly built or updated according to the cluster deployment resource information, so that efficient cluster change operation and maintenance operation is realized. Compared with the related art, the method can be used for automatically managing the clusters with high efficiency and low cost, and realizing the automatic operation and maintenance of multiple clusters.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a management plane cluster-based cluster operation and maintenance method, apparatus, device, medium and program product according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a system architecture diagram of a management plane cluster in accordance with an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flowchart of a cluster operation and maintenance method based on management plane clusters, provided according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of another cluster operation and maintenance method based on management plane clusters provided in accordance with an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a cluster failure operation and maintenance method provided in accordance with an embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow chart of a cluster node operation and maintenance method provided in accordance with an embodiment of the present disclosure;
FIG. 7 schematically illustrates a block diagram of a cluster operation and maintenance device based on management plane clusters according to an embodiment of the disclosure; and
fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement a management plane cluster-based cluster operation and maintenance method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The terms appearing in the embodiments of the present disclosure are explained first:
K8S Cluster: kubernetes (commonly known as K8 s) is an open source container cluster management system for automatically deploying, expanding, and managing containerized (containerized) applications. The system builds a dispatch service for a container based on Docker.
Master node: and k8s of management nodes, and realizing management flow receiving, resource management and container scheduling. The node is typically used by a system administrator.
Node: and the nodes operated by the user service and the application containers are deployed on the nodes to provide services to the outside.
Kube-apiserver: and k8s is a gateway of the cluster, an external traffic inlet and is deployed on the master node.
Etcd: the distributed key-value storage system can be simply understood as a database with high availability and non-relation.
CustomResourceDefinitions: a cluster-level resource definition, which defines a new resource type. After definition, the whole cluster is newly added with one resource type, and the whole cluster is available. Everything in Kubernetes can be regarded as resources, and the secondary development capability of CRD custom resources is added after the Kubernetes is 1.7 to expand the Kubernetes.
Operators: is a controller for a specific application that can be used to extend the functionality of K8S api to create, configure and manage instances of complex applications on behalf of the user of K8S.
an executable: mainly comprises a job, configMap, secret of the k8s and a self-developed job controller. Wherein job is primarily used to execute an ansable script.
In the related art, the operation and maintenance work of a large-scale cluster management platform with huge quantity, various types, thousands of nodes and complex components is faced, and the operation and maintenance personnel have very difficult modification and deployment of the component configuration of each machine. The operation and maintenance operation of the current cluster needs to be carried out by arranging tasks through an allowable, executing operation and maintenance script and appointing IP to operate the machines in the cluster.
This mode of operation has mainly the following problems:
1) Manual operation and maintenance operation is required to be manually performed, and errors and configuration errors are likely to occur.
2) Deployment scripting tools also do not have version control, and managing the configuration of machines on a large number of clusters in production is very cumbersome.
3) For the problems that faults (such as configuration file loss) occur in production and the like, the problems cannot be handled in time, and an operation and maintenance script is independently written to repair, so that the time and the labor are consumed.
Based on the technical problems, an embodiment of the present disclosure provides a cluster operation and maintenance method based on management plane clusters, where the method includes: responding to a change event of a cluster resource object, and acquiring cluster deployment resource information, wherein the cluster deployment resource information is preconfigured through a management plane cluster; analyzing the cluster deployment resource information to determine a cluster change type and machine resources required by the change; establishing a first task resource according to the cluster resource change type and the machine resource required by the change; and executing the first task resource by the task execution node to complete the change of the cluster resource.
Fig. 1 schematically illustrates an application scenario diagram of a cluster operation and maintenance method, apparatus, device, medium and program product based on management plane clusters according to an embodiment of the present disclosure.
As shown in fig. 1, an application scenario 100 according to this embodiment may include a multi-cluster operation-maintenance scenario. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a k8s management plane cluster server, and selects corresponding machine resources in a resource pool according to cluster deployment resource information created by a user, where each controller of the management plane controls four large flows including an onstable structure, a job trigger, a job state check and a cluster state check. Thereby building or updating the cluster required by the administrator.
It should be noted that, the cluster operation and maintenance method based on the management plane cluster provided in the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the cluster operation and maintenance device based on the management plane cluster provided in the embodiments of the present disclosure may be generally disposed in the server 105. The cluster operation and maintenance method based on the management plane cluster provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the cluster operation and maintenance device based on the management plane cluster provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
It should be noted that, the cluster operation and maintenance method and device based on the management plane cluster determined in the embodiments of the present disclosure may be used in the technical field of cloud computing, or may also be used in the technical field of finance, or may be used in any field other than the financial field, and the application field of the cluster operation and maintenance method and device based on the management plane cluster determined in the embodiments of the present disclosure is not limited.
Fig. 2 schematically illustrates a system architecture diagram of a management plane cluster according to an embodiment of the present disclosure. Describing the cluster operation and maintenance process of the embodiment of the present disclosure with reference to fig. 2, as shown in fig. 2, a management plane cluster of the embodiment of the present disclosure deploys a plurality of controllers, including: clusterdeployment controller, cluster Manager controller, cluster Health Controller and Ansible Job controller. The cluster deployment controller is responsible for centralized management of defined cluster CR resources of users; the cluster management controller is responsible for managing the resource pool hung under the whole platform. And classifying and managing different machine types. When certain machine resources are needed, the relevant IP/access keys are mounted to an allowable job; the cluster health controller is in charge of monitoring cluster states, and timely reporting the states of all service clusters to the management plane so as to perform visual management with a foreground; the task orchestration controller is responsible for listening to the completion of an active job, similar to the cluster health controller.
The cluster operation and maintenance method based on the management plane cluster according to the embodiment of the present disclosure will be described in detail below based on the scenario described in fig. 1 and the system architecture of fig. 2 through fig. 3 to 6.
Fig. 3 schematically illustrates a flowchart of a cluster operation and maintenance method based on management plane clusters according to an embodiment of the disclosure. As shown in fig. 3, the cluster operation and maintenance method based on the management plane cluster of this embodiment includes operations S210 to S240, and the method may be performed by a server or other computing device.
In operation S210, cluster deployment resource information is acquired in response to a change event of a cluster resource object.
According to an embodiment of the present disclosure, the cluster deployment resource information includes cluster scale information including a cluster name, a number of nodes, and a number of master nodes, and version information including container management system version information and distributed database version information.
According to an embodiment of the present disclosure, the cluster deployment resource information is preconfigured by a management plane cluster.
In one example, the management plane cluster is deployed with a cluster deployment controller and a task orchestration controller. When a user needs to perform operation and maintenance operation on a cluster, taking a newly built cluster as an example, cluster Deployment CR resources, namely cluster deployment resource information, are firstly created in the management plane cluster, wherein the cluster scale is defined by the information, and the information comprises the number of node nodes, the specific system configuration (CPU/MEMORY/DISK/GPU) of each node and the like. The CR of each cluster may be made up of the following fields:
CR name | Cluster-1 |
|
100 |
K8S version | 1.17.5 |
ETCD version | 3.5.0 |
Master node number | 3 |
The CR resources of a cluster define the overall situation of a cluster, and in the process of later operation and maintenance, only the running cluster and the defined CR resources are required to be kept consistent, for example, updating operations such as expanding and shrinking the cluster or upgrading the version are required, and a user only needs to modify the CR resources in K8s management. And after the cluster deployment controller of the K8s management plane monitors the CR resource change, the cluster deployment controller automatically performs operation and maintenance operation according to the changed CR resource.
In operation S220, the cluster deployment resource information is parsed to determine a cluster change type and machine resources required for the change.
In one example, in the embodiment of the present disclosure, after the cluster deployment controller senses the change of the cluster deployment resource information, the cluster deployment resource information is parsed, and the cluster change type is determined in comparison with the current cluster situation. If new machine resources are needed, such as capacity expansion, cluster addition and the like, the cluster management controller is triggered, and the machine which is needed in the CR is found out from a resource pool managed by the cluster management controller. And prepares the IP and access keys for these machines. If the capacity is reduced, the version is updated and the like, and the new cluster resource is not needed, the cluster management controller is not triggered, and then the new machine resource is not needed.
In operation S230, a first task resource is newly created according to the cluster resource change type and the machine resources required for the change.
According to the embodiment of the disclosure, if the cluster change type is determined to be capacity expansion or cluster new addition, determining an internet protocol address and an access key of a machine resource required to be changed according to the cluster deployment resource information; and mounting the Internet protocol address and the access key of the machine resource required to be changed into the first task resource.
In operation S240, the first task resource is transmitted to a task execution node.
In one example, a first task resource, i.e., a JOB resource, is created, the machine resource information determined in operation 220 is mounted to the JOB, and the cluster management controller sends the prepared first task resource to the kube-apiserver, which listens to the kube-apiserver, and receives a request of the JOB resource. kube-scheduler schedules job resources to task execution nodes. And executing the first task resource by the task execution node to complete the change of the cluster resource.
According to the cluster operation and maintenance method based on the management plane cluster, which is provided by the embodiment of the invention, by adopting the design of K8s on K8s, a user only needs to define each version and each type of cluster as cluster deployment resource information to be sent to the management plane cluster, and each controller of the management plane can control four large processes of arrangement construction, task triggering, task execution state checking and cluster state checking, and the cluster deployment controller is used for creating or updating a cluster according to the cluster deployment resource information, so that efficient cluster change operation and maintenance operation is realized. Compared with the related art, the method can be used for automatically managing the clusters with high efficiency and low cost, and realizing the automatic operation and maintenance of multiple clusters.
When a cluster or a node fails, the operation and maintenance method provided by the embodiment of the disclosure can also quickly respond and automatically perform fault repair. Fig. 4 schematically illustrates a flowchart of another cluster operation and maintenance method based on management plane clusters according to an embodiment of the present disclosure. As shown in fig. 4, operations S310 to S340 are included.
In operation S310, node states and cluster states are monitored periodically.
Upon determining that the cluster state is abnormal, a machine resource defined by the cluster deployment resource of the failed cluster is determined in operation S320.
In operation S330, the failed cluster is rebuilt according to the machine resources defined by the cluster deployment resources.
And in operation S340, when it is determined that the node status is abnormal, automatically repairing the failed node.
In one example, after the task execution node performs task resource to complete the change of the cluster, the cluster health controller periodically monitors the node and cluster states in the CR and checks the real-time states of each key component, and if all succeeds, defines the state of the new cluster as o k. In addition, the task orchestration controller also periodically checks the execution information of the allowable job, and timely feeds back the execution condition to the management plane. And after informing a system administrator, determining machine resources defined by cluster deployment resources of the fault cluster in a resource pool, and reconstructing the fault cluster according to the machine resources defined by the cluster deployment resources. And after the abnormal state of the node is determined, automatically repairing the fault node.
Fig. 5 schematically illustrates a flowchart of a cluster failure operation and maintenance method provided according to an embodiment of the present disclosure. Fig. 6 schematically illustrates a flowchart of a cluster node operation and maintenance method provided according to an embodiment of the present disclosure. As shown in fig. 5, operation S330 includes operations S331 to S334.
In operation S331, a second task resource is newly created according to the machine resource defined by the cluster deployment resource. In operation S332, the second task resource is executed to create a new cluster. In operation S333, the application resources in the failed cluster are re-issued to the new cluster. In operation S334, the resources of the failed cluster are reclaimed.
In one example, all machine resources defined by the failed cluster CR are found in the resource pool managed by the cluster management controller, and the cluster installation operation is performed according to operations S220 to S240 described above, and the completion of the installation is waited for. And (3) re-issuing all application resources in the cluster to the new cluster. And (5) after waiting for all application containers to be pulled, recovering all resources of the original cluster.
As shown in fig. 6, operation S340 includes operations S341 to S343.
In operation S341, a pull-up repair job is performed according to an application resource of a failed node to repair the failed node. In operation S342, if it is determined that the state of the cluster where the failed node is located is still abnormal, the failed node is repaired again. In operation S343, after the repair number is greater than the preset threshold, it is determined that the failed node is replaced with the available machine resource in the resource pool.
In one example, after determining that the cluster node fails, the cluster management controller re-executes operations S220 to S240 to re-install the failed node, invokes kube-schedule to pull up an onstable of a repair job to repair, re-checks the state of the cluster where the failed node is located after the repair is completed, and if the state of the cluster is already repaired, ends the repair flow if the state of the cluster is normal. If the repair is not completed, after 3 times of repair actions are attempted, the abnormal faults caused by the operating system or the network can be judged, the fault nodes in the cluster are removed, other available machines are found in the resource pool, and the installation actions are carried out again according to the operations S220-S240. If the attempt still fails, the system administrator will be notified. And if successful, recovering the original fault machine.
Based on the cluster operation and maintenance method based on the management plane cluster, the disclosure also provides a cluster operation and maintenance device based on the management plane cluster. This system will be described in detail below in connection with fig. 7.
Fig. 7 schematically illustrates a block diagram of a cluster operation and maintenance device based on a management plane cluster according to an embodiment of the disclosure.
As shown in fig. 7, the cluster operation and maintenance device 700 based on the management plane cluster in this embodiment includes an acquisition module 710, a determination module 720, a task resource creation module 730, and a task transmission module 740.
The acquiring module 710 is configured to acquire cluster deployment resource information in response to a change event of a cluster resource object. In an embodiment, the obtaining module 710 may be configured to perform the operation S210 described above, which is not described herein.
The determining module 720 is configured to parse the cluster deployment resource information by the cluster deployment controller to determine a cluster change type and a machine resource required for the change. In an embodiment, the determining module 720 may be configured to perform the operation S220 described above, which is not described herein.
The task resource creation module 730 is configured to create a first task resource according to the cluster resource change type and the machine resource required by the change by the task orchestration controller. In an embodiment, the task resource creation module 730 may be configured to perform the operation S230 described above, which is not described herein.
The task sending module 740 is configured to send the first task resource to a task execution node. In an embodiment, the task sending module 740 may be configured to perform the operation S240 described above, which is not described herein.
According to an embodiment of the present disclosure, the apparatus further comprises: the system comprises a cluster state monitoring module, a first fault restoration module and a second fault restoration module.
And the cluster state monitoring module is used for monitoring the node state and the cluster state at fixed time.
And the first fault repair module is used for determining machine resources defined by cluster deployment resources of the fault cluster after determining that the cluster state is abnormal.
And reconstructing the fault cluster according to the machine resources defined by the cluster deployment resources.
And the second fault repairing module is used for automatically repairing the fault node after the abnormal state of the node is determined.
According to an embodiment of the disclosure, the first fault repairing module includes a new sub-module, an executing sub-module, an issuing sub-module and a recycling sub-module.
And the new building sub-module is used for building a second task resource according to the machine resource defined by the cluster deployment resource.
And the execution sub-module is used for executing the second task resource to establish a new cluster.
And the issuing sub-module is used for re-issuing the application resources in the fault cluster to the new cluster.
And the recycling sub-module is used for recycling the resources of the fault cluster.
According to an embodiment of the present disclosure, the second fault repair module includes:
and the first fault repairing sub-module is used for executing a pull-up repairing job according to the application resource of the fault node to repair the fault node.
And the second fault repairing sub-module is used for repairing the fault node again if the state of the cluster where the fault node is located is still abnormal.
And the replacing sub-module is used for determining available machine resources to replace the fault node in the resource pool after the repairing times are larger than a preset threshold value.
According to an embodiment of the present disclosure, the replacement submodule includes: a creation unit and a replacement unit.
A creation unit configured to create a third task resource according to the available machine resource; and
and the replacing unit is used for executing the new node of the third task resource so as to complete the replacement of the fault node.
According to an embodiment of the present disclosure, the task resource creation module includes: the first determining sub-module and the mounting sub-module.
And the first determining submodule is used for determining an internet protocol address and an access key of machine resources required to be changed according to the cluster deployment resource information if the cluster change type is determined to be capacity expansion or cluster new addition.
And the mounting sub-module is used for mounting the Internet protocol address and the access key of the machine resource required to be changed into the first task resource.
Any of the acquisition module 710, the determination module 720, the task resource creation module 730, and the task transmission module 740 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules according to an embodiment of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the acquisition module 710, the determination module 720, the task resource creation module 730, and the task transmission module 740 may be implemented, at least in part, as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable way of integrating or packaging circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the acquisition module 710, the determination module 720, the task resource creation module 730, and the task transmission module 740 may be at least partially implemented as a computer program module that, when executed, may perform the corresponding functions.
Fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement a management plane cluster-based cluster operation and maintenance method according to an embodiment of the present disclosure.
As shown in fig. 8, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. The processor 901 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 901 may also include on-board memory for caching purposes. Processor 901 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. The processor 901 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the program may be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 905, the input/output (I/O) interface 905 also being connected to the bus 904. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs that, when executed, implement a cluster operation and maintenance method based on management plane clusters according to an embodiment of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 902 and/or RAM 903 and/or one or more memories other than ROM 902 and RAM 903 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. When the computer program product runs in a computer system, the program code is used for enabling the computer system to realize the cluster operation and maintenance method based on the management plane clusters provided by the embodiment of the disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, via communication portion 909, and/or installed from removable medium 911. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.
Claims (11)
1. A cluster operation and maintenance method based on management plane clusters, the method comprising:
responding to a change event of a cluster resource object, and acquiring cluster deployment resource information, wherein the cluster deployment resource information is preconfigured through a management plane cluster;
Analyzing the cluster deployment resource information to determine a cluster change type and machine resources required by the change;
establishing a first task resource according to the cluster resource change type and the machine resource required by the change; and
and sending the first task resource to a task execution node.
2. The method according to claim 1, wherein the method further comprises:
monitoring node states and cluster states at regular time;
after the cluster state abnormality is determined, determining machine resources defined by cluster deployment resources of the fault cluster;
reconstructing a fault cluster according to machine resources defined by the cluster deployment resources; and
and after the abnormal state of the node is determined, automatically repairing the fault node.
3. The method of claim 2, wherein the reconstructing the failed cluster from the machine resources defined by the cluster deployment resources comprises:
newly creating a second task resource according to the machine resource defined by the cluster deployment resource;
executing the second task resource to create a new cluster;
the application resources in the fault cluster are re-issued to the new cluster; and
and recovering the resources of the fault cluster.
4. The method of claim 2, wherein automatically repairing the failed node comprises:
executing a pull-up repair operation according to the application resource of the fault node to repair the fault node;
if the state of the cluster where the fault node is located is still abnormal, repairing the fault node again; and
and after the repairing times are greater than a preset threshold value, determining available machine resources to replace the fault node in a resource pool.
5. The method of claim 4, wherein the determining in a resource pool that available machine resources replace the failed node comprises:
creating a third task resource from the available machine resources; and
and executing the third task resource newly-built node to finish the replacement of the fault node.
6. The method of any of claims 1-5, wherein the cluster deployment resource information comprises cluster size information including a cluster name, a number of nodes, and a number of master nodes, and version information including container management system version information and distributed database version information.
7. The method of claim 6, wherein creating the first task resource based on the cluster resource change type and the machine resources required for the change comprises:
If the cluster change type is determined to be capacity expansion or cluster new addition, determining an internet protocol address and an access key of machine resources required to be changed according to the cluster deployment resource information;
and mounting the Internet protocol address and the access key of the machine resource required to be changed into the first task resource.
8. A cluster operation and maintenance device based on a management plane cluster, the device comprising:
the cluster deployment resource management system comprises an acquisition module, a management module and a cluster deployment resource management module, wherein the acquisition module is used for responding to a change event of a cluster resource object and acquiring cluster deployment resource information, and the cluster deployment resource information is preconfigured through a management plane cluster;
the determining module is used for analyzing the cluster deployment resource information to determine a cluster change type and machine resources required by the change;
the task resource newly-built module is used for newly-building a first task resource according to the cluster resource change type and the machine resource required by the change; and
and the task sending module is used for sending the first task resource to a task execution node.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the managed plane cluster-based cluster operation and maintenance method of any of claims 1-7.
10. A computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to perform the management plane cluster-based cluster operation and maintenance method according to any one of claims 1 to 7.
11. A computer program product comprising a computer program which, when executed by a processor, implements the management plane cluster-based cluster operation and maintenance method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310188510.5A CN116166465A (en) | 2023-02-27 | 2023-02-27 | Cluster operation and maintenance method and device based on management plane cluster |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310188510.5A CN116166465A (en) | 2023-02-27 | 2023-02-27 | Cluster operation and maintenance method and device based on management plane cluster |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116166465A true CN116166465A (en) | 2023-05-26 |
Family
ID=86419852
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310188510.5A Pending CN116166465A (en) | 2023-02-27 | 2023-02-27 | Cluster operation and maintenance method and device based on management plane cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116166465A (en) |
-
2023
- 2023-02-27 CN CN202310188510.5A patent/CN116166465A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113742031B (en) | Node state information acquisition method and device, electronic equipment and readable storage medium | |
US9348709B2 (en) | Managing nodes in a distributed computing environment | |
CN107016480B (en) | Task scheduling method, device and system | |
CN104021078B (en) | Software supervision device and method | |
CN113569987A (en) | Model training method and device | |
CN109871384B (en) | Method, system, equipment and storage medium for container migration based on PaaS platform | |
CN105556499A (en) | Intelligent auto-scaling | |
CN111526049B (en) | Operation and maintenance system, operation and maintenance method, electronic device and storage medium | |
US20200293310A1 (en) | Software development tool integration and monitoring | |
CN103595572B (en) | A kind of method of cloud computing cluster interior joint selfreparing | |
US12035156B2 (en) | Communication method and apparatus for plurality of administrative domains | |
CN113778486A (en) | Containerization processing method, device, medium and equipment for code pipeline | |
US20200396128A1 (en) | Monitoring time-base policy domain architecture | |
CN106293911A (en) | Dispatching System, method | |
CN113746676B (en) | Network card management method, device, equipment, medium and product based on container cluster | |
CN112256384B (en) | Service set processing method and device based on container technology and computer equipment | |
CN105897487B (en) | Equipment management method and device for operation and maintenance system | |
CN116166465A (en) | Cluster operation and maintenance method and device based on management plane cluster | |
CN114841678B (en) | Post data exchange method, data exchange system, server and storage medium | |
CN116679955A (en) | Container updating method, device, equipment and storage medium | |
CN115840642A (en) | Edge resource processing method, device, system, equipment and medium | |
CN114816477A (en) | Server upgrading method, device, equipment, medium and program product | |
CN114416276A (en) | Scheduling method and device of equipment management service, electronic equipment and storage medium | |
CN116032745B (en) | Automatic configuration method and device of hadoop cluster | |
CN112965815B (en) | Host deployment method and device, electronic equipment and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |