KR20240085328A

KR20240085328A - Failure node eliminator and control method thereof

Info

Publication number: KR20240085328A
Application number: KR1020220169799A
Authority: KR
Inventors: 허지현; 이어형; 손준호; 권대경; 권진철; 한정수; 김유하; 허진
Original assignee: 주식회사 카카오엔터프라이즈
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2024-06-17

Abstract

본 발명은, 컴퓨트 노드에 장애가 발생하였을 때 사용자의 개입 없이도 클러스터 자체에서 노드 장애를 처리할 수 있는 장애 노드 제거 장치에 관한 것이다. 보다 구체적으로 본 발명은, 적어도 하나의 노드의 현재 상태를 모니터링하고, 노드의 이상적 상태(desire status)인 노드 CR(Custom Resource)을 저장하며, 상기 모니터링 결과와 상기 노드 CR에 기초하여 상기 적어도 하나의 노드 중 장애 노드를 식별하고, 상기 식별된 장애 노드에 대한 처리를 수행하는 기술에 관한 것이다.The present invention relates to a failed node removal device that can handle the node failure in the cluster itself without user intervention when a compute node failure occurs. More specifically, the present invention monitors the current state of at least one node, stores a node CR (Custom Resource), which is the ideal state (desire status) of the node, and based on the monitoring result and the node CR, the at least one node It relates to a technology for identifying a faulty node among nodes and performing processing on the identified faulty node.

Description

FAILURE NODE ELIMINATOR AND CONTROL METHOD THEREOF}

본 발명은 컨테이너 오케스트레이션 서비스에서 사용되는 컨트롤러 패턴을 클라우드 서비스 프로바이더가 제공하는 IaaS 서비스의 하이퍼바이저 리소스 운영에 관한 것으로, 보다 구체적으로는 컴퓨트 노드에 장애가 발생하였을 때 사용자의 개입 없이도 클러스터 자체에서 노드 장애를 처리할 수 있는 기술에 관한 것이다.The present invention relates to hypervisor resource operation of an IaaS service provided by a cloud service provider using a controller pattern used in a container orchestration service. More specifically, when a compute node failure occurs, the cluster itself operates the node without user intervention. It's about technology that can handle obstacles.

클라우드 인프라를 관리하기 위한 측면에서 IT 리소스(컴퓨팅/네트워킹/스토리지)는 가상화되어 효율적으로 관리 될 수 있다. 그 예로, 컴퓨팅 리소스는 가상 머신(Virtual machine, VM)을 이용한 가상화를 주로 이용한다.In terms of managing cloud infrastructure, IT resources (computing/networking/storage) can be virtualized and managed efficiently. For example, computing resources mainly use virtualization using virtual machines (VMs).

오픈스택는 IaaS 서비스를 제공하기 위하여 기업에서 프라이빗한 용도로 내부에서 클라우드 서비스를 구축할 수 있게 해주는 오픈소스이다. OpenStack is an open source that allows companies to build cloud services internally for private purposes in order to provide IaaS services.

최근에는 대부분의 어플리케이션이 모놀로틱 구조에서 마이크로서비스 구조(Microservices Architecture)로 전환됨에 따라서 가상 머신을 활용하는 대신에 민첩한 컨테이너를 활용하는 방법이 대세로 자리잡고 있다.Recently, as most applications have transitioned from a monolithic architecture to a microservices architecture, using agile containers instead of virtual machines has become a trend.

컨테이너화는 운영 체제 수준의 가상화에 기반한 가상화 방식을 말한다. 컨테이너는 서로 간에 또는 호스트로부터 격리된 애플리케이션을 위한 가볍고 이식 가능한 실행 요소라는 특징이 존재한다.Containerization refers to a virtualization method based on operating system-level virtualization. Containers are characterized as lightweight, portable execution elements for applications that are isolated from each other or from the host.

이러한 추세에 맞춰, 컨테이너 오케스트레이션 산하에 모든 서비스의 요소들을 마이크로서비스화 하여 컨테이너로 관리하는 클라우드-네이티브 컴퓨팅이 부상하고 있다.In line with this trend, cloud-native computing is emerging, which microservices all service elements under container orchestration and manages them as containers.

이에 따라, 클라우드-네이티브 컴퓨팅을 기반으로 하여, 클라우드 서비스를 구축하기 위한 연구가 요구되는 실정이다.Accordingly, research to build cloud services based on cloud-native computing is required.

본 발명이 해결하고자 하는 과제는 오픈스택(Openstack) 오픈소스 등 클라우드 서비스의 IT 리소스에 대한 관리를 수행하는데 있어서, 콘트롤러 패턴을 활용하여 선언적으로 리소스의 상태를 관리할 수 있는 노드 제거 장치를 제공하는 것이다.The problem that the present invention aims to solve is to provide a node removal device that can declaratively manage the status of resources by utilizing the controller pattern in managing IT resources of cloud services such as OpenStack open source. will be.

본 발명이 해결하고자 하는 다른 과제는 관리자의 관여를 최소화하면서도 노드의 장애를 효과적으로 관리 및 제거할 수 있는 노드 제거 장치를 제공하는 것이다.Another problem to be solved by the present invention is to provide a node removal device that can effectively manage and eliminate node failures while minimizing administrator involvement.

본 발명이 해결하고자 하는 또 다른 과제는 컨테이너 오케스트레이션 플랫폼 상에서 클라우드 서비스를 활용할 수 있는 시스템을 제공하는 것이다.Another problem that the present invention aims to solve is to provide a system that can utilize cloud services on a container orchestration platform.

본 발명이 해결하고자 하는 또 다른 과제는 고객의 리소스를 downtime을 최소화하여 안정적으로 정상 노드로 이관하는 시스템을 제공하는 것이다.Another problem that the present invention aims to solve is to provide a system that stably transfers customer resources to a normal node with minimal downtime.

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below. You will be able to.

상기 또는 다른 과제를 해결하기 위해 본 발명의 일 측면에 따르면, 적어도 하나의 노드의 현재 상태를 모니터링하는 모니터링부; 노드의 이상적 상태(desire status)인 노드 CR(Custom Resource)을 저장하는 노드 CR 저장부; 및 상기 모니터링 결과와 상기 노드 CR에 기초하여 상기 적어도 하나의 노드 중 장애 노드를 식별하고, 상기 식별된 장애 노드에 대한 처리를 수행하는 노드 제거 콘트롤러를 포함하는, 노드 제거 장치를 제공한다.According to one aspect of the present invention to solve the above or other problems, there is provided a monitoring unit that monitors the current state of at least one node; A node CR storage unit that stores a node CR (Custom Resource), which is the ideal status (desire status) of the node; and a node removal controller that identifies a failed node among the at least one node based on the monitoring result and the node CR, and performs processing on the identified failed node.

상기 장애 노드란 상기 모니터링한 현재 상태가 상기 저장된 이상적 상태와 상이한 노드일 수 있다.The faulty node may be a node whose monitored current state is different from the stored ideal state.

상기 노드 제거 콘트롤러는, 상기 식별된 장애 노드에 대한 처리를 선언적인 방식으로 수행할 수 있다. The node removal controller may perform processing on the identified failed node in a declarative manner.

상기 노드 제거 콘트롤러는 상기 식별된 장애 노드에 대해서 재조정 작업(reconciliation)을 수행하고 제거할 수 있다. The node removal controller may perform reconciliation and remove the identified faulty node.

상기 적어도 하나의 노드에 접근할 수 있는 접근 권한 정보를 저장하는 접근 권한 저장부를 더 포함할 수 있다.It may further include an access permission storage unit that stores access permission information for accessing the at least one node.

상기 모니터링부는, 상기 저장된 접근 권한 정보를 이용하여 상기 적어도 하나의 노드에 대한 클라이언트를 획득할 수 있다. The monitoring unit may obtain a client for the at least one node using the stored access authority information.

상기 적어도 하나의 노드에 대한 노드 CR을 정의하는 AZ (Availability Zone) 클러스터 콘트롤러를 더 포함할 수 있다.It may further include an AZ (Availability Zone) cluster controller that defines a node CR for the at least one node.

상기 노드 CR 저장부에 저장된 노드 CR은 상기 AZ 클러스터 콘트롤러가 정의한 노드 CR일 수 있다.The node CR stored in the node CR storage unit may be a node CR defined by the AZ cluster controller.

사용자로부터 라벨 변경 입력을 수신하는 라벨 입력부를 더 포함하고, 상기 모니터링부는, 상기 라벨 변경 입력에 기초하여 라벨이 변경된 노드를 장애 노드로 식별할 수 있다.It may further include a label input unit that receives a label change input from a user, and the monitoring unit may identify a node whose label has been changed as a faulty node based on the label change input.

상기 또는 다른 과제를 해결하기 위해 본 발명의 다른 측면에 따르면, 모니터링부가 적어도 하나의 노드의 현재 상태를 모니터링하는 단계; 노드 CR 저장부가 노드의 이상적 상태(desire status)인 노드 CR(Custom Resource)을 저장하는 단계; 및 노드 제거 콘트롤러가 상기 모니터링 결과 상기 적어도 하나의 노드 중 장애 노드를 식별하고, 상기 식별된 장애 노드에 대한 처리를 수행하는 단계를 포함하는, 노드 제거 장치의 제어 방법을 제공한다.According to another aspect of the present invention to solve the above or other problems, a monitoring unit monitors the current state of at least one node; A node CR storage unit storing a node CR (Custom Resource), which is the ideal state (desire status) of the node; and a node removal controller identifying a faulty node among the at least one node as a result of the monitoring, and performing processing on the identified faulty node.

상기 처리를 수행하는 단계는, 상기 식별된 장애 노드에 대해서 재조정 작업(reconciliation)을 수행하고 제거할 수 있다. The step of performing the processing may be to perform reconciliation and remove the identified faulty node.

접근 권한 저장부가 상기 적어도 하나의 노드에 접근할 수 있는 접근 권한 정보를 저장하는 단계를 더 포함할 수 있다.The access authority storage unit may further include storing access authority information for accessing the at least one node.

상기 모니터링하는 단계는, 상기 저장된 접근 권한 정보를 이용하여 상기 적어도 하나의 노드에 대한 클라이언트를 획득할 수 있다. In the monitoring step, a client for the at least one node may be obtained using the stored access authority information.

AZ 클러스터 콘트롤러가 상기 적어도 하나의 노드에 대한 노드 CR을 정의하는 단계를 더 포함할 수 있다.The AZ cluster controller may further include defining a node CR for the at least one node.

상기 노드 CR 저장부에 저장된 노드 CR은 상기 AZ 클러스터 콘트롤러가 정의한 노드 CR일 수 있다. The node CR stored in the node CR storage unit may be a node CR defined by the AZ cluster controller.

라벨 입력부가 사용자로부터 라벨 변경 입력을 수신하는 단계를 더 포함하고, 상기 모니터링하는 단계는, 상기 라벨 변경 입력에 기초하여 라벨이 변경된 노드를 장애 노드로 식별할 수 있다.The label input unit may further include receiving a label change input from the user, and the monitoring step may identify a node whose label has been changed as a faulty node based on the label change input.

본 발명에 따른 장애 노드 제거 장치 및 그것의 제어 방법의 효과에 대해 설명하면 다음과 같다.The effects of the faulty node removal device and its control method according to the present invention will be described as follows.

본 발명의 실시 예들 중 적어도 하나에 의하면, 오픈스택 오픈소스 등 클라우드 서비스의 IT 리소스에 대한 관리를 수행하는데 있어서, 콘트롤러 패턴을 활용하여 선언적으로 리소스의 상태를 관리할 수 있는 노드 제거 장치를 제공할 수 있다는 장점이 있다.According to at least one of the embodiments of the present invention, in performing management of IT resources of cloud services such as OpenStack open source, a node removal device that can declaratively manage the status of resources using a controller pattern is provided. There is an advantage to being able to do this.

또한, 본 발명의 실시 예들 중 적어도 하나에 의하면, 관리자의 관여를 최소화하면서도 노드의 장애를 효과적으로 관리 및 제거할 수 있는 노드 제거 장치를 제공할 수 있다는 장점이 있다.In addition, according to at least one of the embodiments of the present invention, there is an advantage of providing a node removal device that can effectively manage and eliminate node failures while minimizing administrator involvement.

또한, 본 발명의 실시 예들 중 적어도 하나에 의하면, 컨테이너 오케스트레이션 플랫폼 상에서 클라우드 서비스 프로바이더가 활용할 수 있는 시스템을 제공할 수 있다는 장점이 있다.Additionally, according to at least one of the embodiments of the present invention, there is an advantage of providing a system that a cloud service provider can utilize on a container orchestration platform.

또한, 본 발명의 실시 예들 중 적어도 하나에 의하면, 클라우드 서비스 프로바이더가 제공하는 IaaS 서비스를 구성하는 하이퍼바이저에 장애 발생시 그것을 빠르게 감지하고, 고객의 리소스를 downtime을 최소화 하여 안정적으로 정상 노드로 이관하여 서비스의 SLA(service level aggrement)를 올릴수 있다는 장점이 있다.In addition, according to at least one of the embodiments of the present invention, when a failure occurs in the hypervisor that constitutes the IaaS service provided by the cloud service provider, it is quickly detected and the customer's resources are stably transferred to a normal node with minimal downtime. It has the advantage of being able to increase the service level aggregation (SLA) of the service.

발명의 적용 가능성의 추가적인 범위는 이하의 상세한 설명으로부터 명백해질 것이다. 그러나 본 발명의 사상 및 범위 내에서 다양한 변경 및 수정은 당업자에게 명확하게 이해될 수 있으므로, 상세한 설명 및 본 발명의 바람직한 실시 예와 같은 특정 실시 예는 단지 예시로 주어진 것으로 이해되어야 한다. Further scope of applicability of the invention will become apparent from the detailed description that follows. However, since various changes and modifications within the spirit and scope of the present invention may be clearly understood by those skilled in the art, the detailed description and specific embodiments such as preferred embodiments of the present invention should be understood as being given only as examples.

도 1은 본 발명의 일실시예에 따른 장애 노드 제거 장치(100)의 블록도를 도시하는 도면이다.
도 2는 본 발명의 일실시예에 따른 장애 노드 제거 파드(116)의 블록도를 도시하는 도면이다.
도 3은 본 발명의 일실시예에 따른 오픈스택 오픈소스를 활용한 노드 제거 절차를 도시하는 도면이다.
도 4는 본 발명의 일실시예에 따른 AZ 클러스터 콘트롤러(202)의 CR을 정의하는 제어 순서도를 도시하는 도면이다.
도 5는 본 발명의 일실시예에 따른 노드 제거 파드(116)의 노드 제거 순서도를 도시하는 도면이다.
도 6은 본 발명의 일실시예에 따른 노드 제거 파드(116)가 노드를 제거하는 개념도를 도시하는 도면이다.
도 7은 일 실시예에 따른 장애 노드 제거 장치(100)의 구성을 도시한 도면이다.FIG. 1 is a block diagram of an apparatus 100 for removing a faulty node according to an embodiment of the present invention.
Figure 2 is a block diagram of a failed node removal pod 116 according to an embodiment of the present invention.
Figure 3 is a diagram showing a node removal procedure using OpenStack open source according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a control flowchart defining CR of the AZ cluster controller 202 according to an embodiment of the present invention.
FIG. 5 is a diagram illustrating a node removal flowchart of the node removal pod 116 according to an embodiment of the present invention.
FIG. 6 is a conceptual diagram showing how the node removal pod 116 removes a node according to an embodiment of the present invention.
FIG. 7 is a diagram illustrating the configuration of a faulty node removal device 100 according to an embodiment.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다.Hereinafter, embodiments disclosed in the present specification will be described in detail with reference to the attached drawings. However, identical or similar components will be assigned the same reference numbers regardless of reference numerals, and duplicate descriptions thereof will be omitted. The suffixes “module” and “part” for components used in the following description are given or used interchangeably only for the ease of preparing the specification, and do not have distinct meanings or roles in themselves.

또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Additionally, in describing the embodiments disclosed in this specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in this specification, the detailed descriptions will be omitted. In addition, the attached drawings are only for easy understanding of the embodiments disclosed in this specification, and the technical idea disclosed in this specification is not limited by the attached drawings, and all changes included in the spirit and technical scope of the present invention are not limited. , should be understood to include equivalents or substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms containing ordinal numbers, such as first, second, etc., may be used to describe various components, but the components are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is said to be "connected" or "connected" to another component, it is understood that it may be directly connected to or connected to the other component, but that other components may exist in between. It should be. On the other hand, when it is mentioned that a component is “directly connected” or “directly connected” to another component, it should be understood that there are no other components in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. Singular expressions include plural expressions unless the context clearly dictates otherwise.

본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In this application, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that this does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

복잡한 IT 환경에서 단순하고 비용 효율적인 운영을 추구하기 위해서는 선언적으로 서비스를 관리하는 쿠버네티스(Kubernetes, k8s) 플랫폼이 컨테이너 오케스트레이션 서비스의 대표적인 예시이다.To pursue simple and cost-effective operation in a complex IT environment, the Kubernetes (k8s) platform, which declaratively manages services, is a representative example of a container orchestration service.

컨테이너 오케스트레이션 아키텍처에서 클러스터(Cluster)란 컨테이너 형태의 애플리케이션을 호스팅하는 물리/가상 환경의 노드들로 이루어진 집합을 의미한다. 즉, 컨테이너 오케스트레이션에 기초한 데이터 센터 환경에서는, 호스트 환경에 구성된 자원들을 클러스터 단위로 추상화하여 관리한다. 하나의 클러스터 안에는 클러스터 내부 요소들을 제어하는 컨트롤 플레인(Control plane) 역할을 수행하는 마스터 노드(Master node)를 두고, 관리자는 이 마스터 노드를 통하여 클러스터 전체를 제어하는 구성을 따른다.In container orchestration architecture, a cluster refers to a set of nodes in a physical/virtual environment that hosts container-type applications. That is, in a data center environment based on container orchestration, resources configured in the host environment are abstracted and managed on a cluster basis. Within one cluster, there is a master node that acts as a control plane to control internal elements of the cluster, and the administrator follows the configuration of controlling the entire cluster through this master node.

이하에서는, 컨테이너 오케스트레이션 플랫폼의 대표적인 예시로 쿠버네티스의 경우를 들고 있지만, 반드시 이에 한정되는 것은 아니고 다양한 종류의 컨테이너 오케스트레이션 플랫폼이 본 발명에 적용될 수 있을 것이다.Below, the case of Kubernetes is cited as a representative example of a container orchestration platform, but it is not necessarily limited thereto, and various types of container orchestration platforms may be applied to the present invention.

또한, 이하에서는 쿠버네티스에서 사용되어 널리 통용되는 용어들이 사용되지만, 이러한 용어의 사용으로 인하여 본 발명이 쿠버네티스라는 컨테이너 오케스트레이션 플랫폼에 한정되지는 않을 것이다.In addition, hereinafter, widely used terms used in Kubernetes will be used, but the use of these terms will not limit the present invention to a container orchestration platform called Kubernetes.

컨테이너는 호스트 하드웨어 컴퓨팅 환경에 밀접하게 연결되어 있지 않기 때문에 애플리케이션을 컨테이너 이미지에 연결하고 기본 컨테이너 아키텍처를 지원하는 모든 호스트 또는 가상 호스트에서 단일 경량 패키지로 실행할 수 있다. 이러한 특징 때문에, 컨테이너는 서로 다른 컴퓨팅 환경에서 소프트웨어를 작동시킬 때 발생할 수 있는 여러가지 문제를 해결할 수 있다.Because containers are not tightly coupled to the host hardware computing environment, applications can be attached to a container image and run as a single lightweight package on any host or virtual host that supports the native container architecture. Because of these characteristics, containers can solve many problems that can arise when running software in different computing environments.

그리고 컨테이너는 운영 체제를 가상화한 환경임으로 가상 머신에 비해서 상대적으로 가볍기 때문에, 동일한 환경 내지 조건 하에서 기존의 가상 머신 보다 더 많은 컨테이너 인스턴스 지원이 가능할 뿐만 아니라 빠르게 생성 및 제거가 가능하다.And since containers are an environment in which the operating system is virtualized, they are relatively lighter than virtual machines, so they can support more container instances than existing virtual machines under the same environment or conditions, and can be created and deleted quickly.

하이퍼바이저란, 가상 머신 모니터라고도 하는 하이퍼바이저는 가상 머신(VM)을 생성하고 실행하는 프로세스를 의미한다. 하이퍼바이저는 메모리 및 처리와 같은 단일 호스트 컴퓨터의 리소스를 가상으로 공유하여 호스트 컴퓨터가 여러 게스트 가상 머신을 지원할 수 있도록 한다.A hypervisor, also known as a virtual machine monitor, is a process that creates and runs a virtual machine (VM). A hypervisor virtually shares the resources of a single host computer, such as memory and processing, allowing the host computer to support multiple guest virtual machines.

종종 수명이 짧은 컨테이너는 가상 머신 보다 더 효율적으로 생성 및 이동할 수 있다. 그리고 컨테이너는 논리적으로 관련된 요소들의 그룹으로 관리하는 것이 가능하다. 예를 들어서 쿠버네티스(Kubernetes)와 같은 일부 오케스트레이션 플랫폼의 경우, 이러한 요소들을 "파드(Pod)"라고 부르며, 이러한 파드를 제공하는 노드의 그룹을 "클러스터"라 한다.Containers, which often have short lifetimes, can be created and moved more efficiently than virtual machines. And it is possible to manage containers as groups of logically related elements. For example, in some orchestration platforms, such as Kubernetes, these elements are called “Pods,” and the group of nodes providing these Pods is called a “Cluster.”

이하 본 발명에서 파드 및 클러스터라는 용어를 사용하지만, 이는 요소 및 그것들의 그룹을 지칭하기 위한 용어일 뿐, 본 발명이 쿠버네티스 플랫폼에 한정되지는 않을 것이다.Hereinafter, the terms pod and cluster will be used in the present invention, but these are only terms to refer to elements and their groups, and the present invention will not be limited to the Kubernetes platform.

컨테이너 오케스트레이션 아키텍처에서 클러스터(Cluster)란 컨테이너 형태의 애플리케이션을 호스팅하는 물리/가상 환경의 노드들로 이루어진 집합을 의미한다. 즉, 컨테이너 오케스트레이션에 기초한 데이터 센터 환경에서는, 호스트 환경에 구성된 자원들을 클러스터 단위로 추상화하여 관리한다.In container orchestration architecture, a cluster refers to a set of nodes in a physical/virtual environment that hosts container-type applications. That is, in a data center environment based on container orchestration, resources configured in the host environment are abstracted and managed on a cluster basis.

하나의 클러스터 안에는 클러스터 내부 요소들을 제어하는 컨트롤 플레인(Control plane) 역할을 수행하는 마스터 노드(Master node)를 두고, 관리자는 이 마스터 노드를 통하여 클러스터 전체를 제어하는 구성을 따른다. 컨트롤 플레인 역할이란, 클러스터 전체의 워크로드 리소스 등 주요 구성 요소들을 배포하고 제어하는 것을 말한다.Within one cluster, there is a master node that acts as a control plane to control internal elements of the cluster, and the administrator follows the configuration of controlling the entire cluster through this master node. The control plane role refers to distributing and controlling major components such as workload resources across the cluster.

이하에서는, 클라우드 서비스를 제공하기 위한 서비스로 오픈스택을 대표적인 예시로 설명하지만, 본 발명이 오픈스택에 한정되는 것은 아니다.Below, OpenStack is described as a representative example as a service for providing cloud services, but the present invention is not limited to OpenStack.

클라우드 서비스를 구성하기 위한 노드들을 역할을 기준으로 구분하였을 때, 아래와 같이 콘트롤러 노드, 컴퓨트 노드 및 네트워크 노드 세 가지 종류를 포함할 수 있다.When the nodes for configuring a cloud service are divided based on their roles, they can include three types: controller node, compute node, and network node, as shown below.

콘트롤러 노드(Controller node): 전체 클라우드 서비스를 관리하기 위한 서비스들이 설치된다. 인증 서비스(예를 들면 오픈스택에서의 keystone), 이미지 서비스(예를 들면 오픈스택에서의 glance), 컴퓨트 서비스(예를 들면 오픈스택에서의 nova) 관리 서비스 등의 레스트풀(RESTful) API 서버가 설치된다.Controller node: Services to manage the entire cloud service are installed. RESTful API server including authentication service (e.g. keystone in OpenStack), image service (e.g. glance in OpenStack), and compute service (e.g. nova in OpenStack) management service. is installed.

컴퓨트 노드(Compute node): 컴퓨트 서비스(Nova 기반)의 인스턴스를 작동하기 위해 사용되는 노드이다. 이러한 컴퓨트 노드는 HA(High Availability , 고가용성)을 위하여 AZ(Availability Zone) 단위로 나누어 하이퍼바이저(hypervisor)로 관리된다.Compute node: A node used to run an instance of a compute service (Nova-based). These compute nodes are divided into AZ (Availability Zones) and managed by a hypervisor for HA (High Availability).

DHCP(동적 호스트 구성 프로토콜) 노드(Dynamic Host Configuration Protocol node): 자동으로 IP를 호스트 하는 IP 주소 및 서브넷 마스크 및 기본 게이트웨이 등의 기타 관련된 구성 정보를 제공한다.Dynamic Host Configuration Protocol (DHCP) node: Automatically hosts IP addresses and provides other relevant configuration information such as subnet mask and default gateway.

이하에서 설명되는 본 발명에서는 컨테이너 오케스트레이션 시스템에 기초하여 IT 리소스를 확장하거나 리소스의 상태를 관리하는데 있어서, 절차적인 방식이 아니라 선언적인 방식을 이용하도록 제안한다.The present invention described below proposes using a declarative method rather than a procedural method when expanding IT resources or managing the state of resources based on a container orchestration system.

IT 리소스란 하드웨어 소프트웨어 네트워킹 등 IT 서비스 및 솔루션을 제공하는데 이용되는 구성요소로, 그 종류가 매우 다양하다.IT resources are components used to provide IT services and solutions, such as hardware, software, and networking, and are of very diverse types.

PM(Physical Machine), VM(Virtual Machine), DB(Database), network, k8s(Kubernetes), IaaS(Infrastructure as a service)나 PaaS(Platform as a service) 등의 다양한 IT 리소스를 같은 방법으로 일원화하여 관리할 수 있다면, 편의성은 상당히 높아질 것이다.By unifying various IT resources such as PM (Physical Machine), VM (Virtual Machine), DB (Database), network, k8s (Kubernetes), IaaS (Infrastructure as a service), and PaaS (Platform as a service) in the same way, If you can manage it, convenience will increase significantly.

본 발명의 일실시예에 따른 선언적인 방식이란, 대상 IT 리소스의 원하는 상태(이하 본 발명에서는 이상적 상태라 함, desire state)를 정의하고, 현재 IT 리소스의 상태를 원하는 상태와 동일해지도록 반복적인 작업(loop process)을 수행하는 방식을 말한다.The declarative method according to an embodiment of the present invention defines the desired state (hereinafter referred to as the ideal state in the present invention, desire state) of the target IT resource, and repeats the process so that the current state of the IT resource becomes the same as the desired state. It refers to a method of performing a task (loop process).

선언적인 방식은 쿠버네티스 오픈소스에서 제공되는 내부의 컨트롤러 패턴을 통하여 수행될 수 있다. 이러한 쿠버네티스 오픈소스에서는 자원의 상태가 변경될 때마다 이벤트를 발생시키고, 요청된 IT 리소스의 상태(이상적 상태)와 비교하여 이상적 상태와 같아질 때까지 반복적인 작업을 시도하고 있다. 이렇게 반복적으로 수행되는 작업을 재조정 작업(reconciliation)이라고 부른다.The declarative method can be performed through the internal controller pattern provided in Kubernetes open source. In this Kubernetes open source, an event is generated every time the state of a resource changes, and an attempt is made to compare it to the state (ideal state) of the requested IT resource and repeat the task until it becomes the same as the ideal state. This repetitive task is called reconciliation.

본 발명의 일실시예에서는 컨트롤러 패턴을 사용하여 IT 리소스를 확장하고 선언적으로 IT 리소스의 상태를 관리하는 구성을 오퍼레이터(Operator)라고 부른다.In one embodiment of the present invention, a configuration that extends IT resources using a controller pattern and declaratively manages the state of the IT resources is called an operator.

관리의 대상이 되는 IT 리소스를 커스텀하여 정의하는 것을 "CR(Custom Resource) definition"이라고 하며, 이렇게 정의된 이상적인 상태를 CR이라고 부른다. 이하에서 좀 더 상세히 설명한다.Customizing and defining IT resources subject to management is called “CR (Custom Resource) definition,” and the ideal state defined in this way is called CR. This is explained in more detail below.

도 1은 본 발명의 일실시예에 따른 장애 노드 제거 장치(100)의 블록도를 도시하는 도면이다.FIG. 1 is a block diagram of an apparatus 100 for removing a faulty node according to an embodiment of the present invention.

장애 노드 제거 장치(100)는, AZ 클러스터(Available Zone Cluster, 111), 콘트롤러 클러스터(112, Controller Cluster), 언더클라우드 클러스터(113, Undercloud Cluster), 공유 클러스터(Shared Cluster, 114) 및 링0 클러스터(Ring 0 Cluster, 115)를 포함하도록 구성될 수 있다.The failed node removal device 100 includes AZ Cluster (Available Zone Cluster, 111), Controller Cluster (112), Undercloud Cluster (113), Shared Cluster (114), and Ring 0 Cluster. It may be configured to include (Ring 0 Cluster, 115).

도 1에 도시된 구성요소들은 장애 노드 제거 장치(100)를 구현하는데 있어서 필수적인 것은 아니어서, 본 명세서 상에서 설명되는 장애 노드 제거 장치(100)는 위에서 열거된 구성요소들 보다 많거나, 또는 적은 구성요소들을 가질 수 있다.The components shown in FIG. 1 are not essential for implementing the device 100 for removing a faulty node, so the device 100 for removing a faulty node described herein has more or fewer components than the components listed above. It can have elements.

AZ 클러스터(Available Zone Cluster, 111)는 컴퓨트 클러스터라고도 하며, 오픈스택 서비스의 경우 가용 영역(Available Zone)과 매핑된다. 아래에서 설명하는 콘트롤러 클러스터(112) 하나 당 복수 개의 AZ 클러스터(111)가 매핑될 수 있다. AZ 클러스터(111)는 오픈스택 서비스의 경우에는 오픈스택 하이퍼바이저를 통하여 Nova 기반 VM을 실행한다.AZ Cluster (Available Zone Cluster, 111) is also called a compute cluster, and in the case of OpenStack services, it is mapped to an availability zone. A plurality of AZ clusters 111 may be mapped per controller cluster 112 described below. In the case of OpenStack service, the AZ cluster 111 runs Nova-based VMs through the OpenStack hypervisor.

AZ 클러스터(111)는 지역(Region) 별로 제 1 내지 제 3 셀(110-1 ~ 110-3)으로 구분될 수 있으며, 각 셀(110-1 ~ 110-3)에는 적어도 하나의 노드가 구비될 수 있다.The AZ cluster 111 can be divided into first to third cells (110-1 to 110-3) by region, and each cell (110-1 to 110-3) has at least one node. It can be.

콘트롤러 클러스터(112, Controller Cluster)는 클라우드 서비스를 관리하기 위한 서비스들을 제공한다. 예를 들어 오픈소스 서비스의 경우 인증 서비스(keystone), 이미지 서비스(glance), 컴퓨트 서비스(nova)를 관리하기 위한 API 서버 등이 설치 및 실행된다.The controller cluster (112) provides services for managing cloud services. For example, in the case of open source services, an API server to manage the authentication service (keystone), image service (glance), and compute service (nova) are installed and run.

언더클라우드 클러스터(113, Undercloud Cluster)는 언더클라우드 네트워크의 메타데이터(네트워크, 서브넷, 포트 등)를 관리한다. 본 발명의 일실시예에 따른 언더클라우드 클러스터(113)는, 노드 제거 파드(116, Node eliminate Pod)를 구비하도록 제안한다.Undercloud Cluster (113) manages metadata (network, subnet, port, etc.) of the undercloud network. The undercloud cluster 113 according to an embodiment of the present invention is proposed to be equipped with a node elimination pod (116).

노드 제거 파드(116)는, AZ 클러스터(111)에 포함되어 있는 적어도 하나의 노드를 모니터링하고, 장애 노드가 발생하는 경우 장애 노드를 제거처리하기 위한 파드이다.The node removal pod 116 is a pod for monitoring at least one node included in the AZ cluster 111 and removing the failed node when a failed node occurs.

노드 제거 파드(116)의 내부 구성에 대해서는 이하 도면을 참고하여 보다 상세하게 후술한다.The internal configuration of the node removal pod 116 will be described in more detail below with reference to the drawings.

공유 클러스터(114, Shared Cluster)는 지역(Region) 별 공용 자원을 관리하는 클러스터이다.Shared Cluster (114) is a cluster that manages common resources for each region.

링0 클러스터(115, Ring 0 Cluster)는 다른 클러스터(111, 112, 113 및 114)를 프로비저닝하는 클러스터이다. 언더클라우드, 오버클라우드 클러스터 등 IDC에 존재하는 자원과 전체 메타데이터를 관리한다. 링0 클러스터(115)는 AZ 클러스터(111)에 포함되어 있는 적어도 하나의 노드에 대한 접근 권한 정보(AZ node kubeconfig)를 저장한다. 즉, 링0 클러스터(115)가 AZ 클러스터(111)를 프로비저닝했기 때문에, 이에 대한 접근 권한 정보 역시 저장하고 있는 것이다.Ring 0 Cluster (115) is a cluster that provisions other clusters (111, 112, 113, and 114). Manages resources and overall metadata that exist in IDC, such as undercloud and overcloud clusters. Ring 0 cluster 115 stores access authority information (AZ node kubeconfig) for at least one node included in AZ cluster 111. In other words, because the Ring 0 cluster 115 has provisioned the AZ cluster 111, access authority information for it is also stored.

이하 도 2를 참고하여, 노드 제거 파드(116)의 구체적은 구성에 대해서 설명한다.Hereinafter, with reference to FIG. 2, the specific configuration of the node removal pod 116 will be described.

도 2는 본 발명의 일실시예에 따른 장애 노드 제거 파드(116)의 블록도를 도시하는 도면이다.Figure 2 is a block diagram of a failed node removal pod 116 according to an embodiment of the present invention.

노드 제거 파드(116)는, 노드 제거 콘트롤러(201), AZ 클러스터 콘트롤러(202), 노드 CR 저장부(203), 클러스터 CR 저장부(204), 접근 권한 정보 저장부(205) 및 모니터링부(206)를 포함하도록 구성될 수 있다.The node removal pod 116 includes a node removal controller 201, an AZ cluster controller 202, a node CR storage unit 203, a cluster CR storage unit 204, an access authority information storage unit 205, and a monitoring unit ( 206).

도 2에 도시된 구성요소들은 노드 제거 파드(116)를 구현하는데 있어서 필수적인 것은 아니어서, 본 명세서 상에서 설명되는 노드 제거 파드(116)는 위에서 열거된 구성요소들 보다 많거나, 또는 적은 구성요소들을 가질 수 있다.The components shown in FIG. 2 are not essential for implementing the node removal pod 116, so the node removal pod 116 described herein may include more or less components than those listed above. You can have it.

AZ 클러스터 콘트롤러(202)는 관리 대상 클러스터(들)에 대한 정보를 수집하고, 수집된 정보를 기반으로 관리 대상 클러스터(들)에 대한 이상적 상태(desire state)를 정의한다. 이상적 상태란, 특정 대상(클러스터나 노드 등)이 정상적으로 운영될 수 있는 바람직한 상태를 의미할 수 있으며, 이상적인 상태에서 벗어나는 경우 그 특정 대상을 장애라고 판단할 수 있을 것이다.The AZ cluster controller 202 collects information about the cluster(s) to be managed, and defines an ideal state (desire state) for the cluster(s) to be managed based on the collected information. An ideal state may mean a desirable state in which a specific object (cluster, node, etc.) can operate normally, and if it deviates from the ideal state, the specific object may be judged to be a failure.

이렇게 정의된 이상적 상태를 CR(Custom Resource)라고 하며, CR은 적어도 하나의 클러스터에 대한 클러스터 CR 및 적어도 하나의 노드에 대한 노드 CR로 구분될 수 있다.The ideal state defined in this way is called a Custom Resource (CR), and CR can be divided into a cluster CR for at least one cluster and a node CR for at least one node.

즉, AZ 클러스터 콘트롤러(202)는 AZ 클러스터의 CRD(Custom Resource Definition)로 정의된 콘트롤러로서, CR을 생성 및 관리하는 역할을 수행하는 구성이다.That is, the AZ cluster controller 202 is a controller defined by the Custom Resource Definition (CRD) of the AZ cluster and is responsible for creating and managing CR.

클러스터 CRD란 AZ 클러스터(111)의 고유한 정보 및 원하는 상태 정보를 클러스터 스펙(Specification)으로 정의하고, AZ 클러스터(111)의 현재 상태를 나타내는 정보와 재조정(reconciliation)에서 사용하는 정보를 상태 정보(status information)로 정의한 것을 의미한다.Cluster CRD defines the unique information and desired state information of the AZ cluster 111 as a cluster specification, and information representing the current state of the AZ cluster 111 and information used in reconciliation are defined as state information ( means defined as status information.

클러스터 스펙은, 링0 클러스터(115) 접근 정보, AZ 클러스터(111) 접근 정보 및 콘트롤 플레인(ControlPlane) 고유 정보 중 적어도 하나를 포함할 수 있다.The cluster specification may include at least one of ring 0 cluster 115 access information, AZ cluster 111 access information, and control plane (ControlPlane) unique information.

클러스터 상태 정보는, 클러스터 준비(Cluster Ready) 상태, 전체 하이퍼바이저(hypervisor) 개수, 삭제된 하이퍼바이저(hypervisor) 정보 중 적어도 하나를 포함할 수 있다.Cluster status information may include at least one of the Cluster Ready status, the total number of hypervisors, and deleted hypervisor information.

클러스터 CR이란, 특정 클러스터가 가져야 하는 이상적 상태를 정의하는 정보로서, 위에서 미리 정의한 CRD를 통해 각 클러스터의 실제 정보를 갖고 있는 형태로 구성될 수 있다.Cluster CR is information that defines the ideal state that a specific cluster should have, and can be configured to contain the actual information of each cluster through the CRD predefined above.

클러스터 CR은 적어도 하나의 클러스터 각각에 대응되도록 구비될 수 있다. 즉, N개의 클러스터에 대해서 N개의 클러스터 CR이 정의될 수 있다.A cluster CR may be provided to correspond to each of at least one cluster. That is, N cluster CRs can be defined for N clusters.

노드 CR이란, 특정 노드가 가져야 하는 이상적 상태를 정의하는 정보로서, 적어도 하나의 노드 각각에 대응되도록 구비될 수 있다. 즉, N개의 노드에 대해서 N개의 노드 CR이 정의될 수 있다.Node CR is information that defines the ideal state that a specific node should have, and may be provided to correspond to each of at least one node. That is, N node CRs can be defined for N nodes.

본 발명의 일실시예에 따른 재조정 작업은, 이러한 CR 단위로 수행되어질 수 있다.The readjustment operation according to an embodiment of the present invention can be performed in these CR units.

노드 제거 콘트롤러(201)는, AZ 클러스터(111)에 포함된 노드 중에서 장애가 발생한 노드를 제거하는 구성이다.The node removal controller 201 is configured to remove a failed node from among the nodes included in the AZ cluster 111.

본 발명의 일실시예에 따른 노드 제거 콘트롤러(201)는, 쿠버네티스의 컨트롤러 패턴을 오픈스택 하이퍼바이저 클러스터(Openstack hypervisor cluster)에 적용한 것이다.The node removal controller 201 according to an embodiment of the present invention applies the Kubernetes controller pattern to an Openstack hypervisor cluster.

본 발명의 일실시예에 따른 노드 제거 콘트롤러(201)는, 관리 대상이 되는 노드를 CR(Custom Resource)로 정의하도록 제안한다. 그리고 해당 자원에 장애가 발생하여 제거가 필요할 때(사용자가 설정한 원하는 상태: nodeHealthy → 실제 상태 : nodeError), 오퍼레이터를 활용하여 노드 상태에 따라 오픈스택 서비스로 관리되는 자원들(VM, IP, 스토리지와 네트워크 등)을 클러스터의 다른 정상적인 노드로 마이그레이션(이관)시킨다. 그리고 노드 제거 콘트롤러(201)는 마이그레이션이 완료된 후 자동으로 장애 노드에 대한 제거를 수행할 수 있다.The node removal controller 201 according to an embodiment of the present invention proposes to define the node subject to management as a CR (Custom Resource). And when a failure occurs in the relevant resource and needs to be removed (desired state set by the user: nodeHealthy → actual state: nodeError), the operator is used to remove resources (VM, IP, storage and network, etc.) to another normal node in the cluster. And the node removal controller 201 can automatically remove the failed node after migration is completed.

즉, 노드 제거 콘트롤러(201)는 컨트롤러 패턴과 CR에 정의된 내용을 이용하여 노드의 장애를 감지하고, 해당 노드를 사용중인 가상화 리소스(VM, IP, STORAGE 등)를 클러스터의 다른 노드로 이관하고, 노드를 제거하는 역할을 수행한다.In other words, the node removal controller 201 detects a node failure using the controller pattern and the contents defined in the CR, transfers the virtualization resources (VM, IP, STORAGE, etc.) using the node to another node in the cluster, and , Plays the role of removing nodes.

본 발명의 일실시예에 따른 노드 제거 콘트롤러(201)는 가상화 자원의 이관시 중단시간(downtime)을 최소화 하는 방식으로 함으로써 SLA(Service-Level Agreement)를 올릴 수 있다는 장점이 존재한다.The node removal controller 201 according to an embodiment of the present invention has the advantage of being able to increase SLA (Service-Level Agreement) by minimizing downtime when transferring virtual resources.

일반적으로 오픈스택 오픈소스를 이용하는 경우 노드를 제거하기 위해서는 절차적인 방식을 이용한다.In general, when using OpenStack open source, a procedural method is used to remove nodes.

도 3의 순서도를 참고하여, 이러한 절차적인 방식을 보다 상세히 설명한다.Referring to the flow chart in FIG. 3, this procedural method will be described in more detail.

도 3은 본 발명의 일실시예에 따른 오픈스택 오픈소스를 활용한 노드 제거 절차를 도시하는 도면이다.Figure 3 is a diagram showing a node removal procedure using OpenStack open source according to an embodiment of the present invention.

이러한 절차적인 방식은 상술한 선언적인 방식과는 달리 사용자(관리자)에 의해서 오픈스택에서 제공되는 다양한 종류의 마이크로 서비스 각각에 대하여 스크립트 기반의 명령 전달로 수행(API 등을 통하여)된다. 즉, 사용자가 직접 명령을 수작업으로 입력해야 할 뿐만 아니라, 여러 단계의 명령어 입력 절차를 순서에 맞추어 진행해야 하는 단점이 있다. 또한 절차를 중간에 실패하는 경우, 롤백 내지 절차의 재수행을 어떻게 진행해야 하는지에 대한 확인이 어렵다.Unlike the declarative method described above, this procedural method is performed by the user (administrator) by passing script-based commands (via API, etc.) to each of the various types of microservices provided in OpenStack. In other words, there is a disadvantage that not only does the user have to input commands manually, but also a multi-step command input procedure must be performed in order. Additionally, if the procedure fails in the middle, it is difficult to determine how to rollback or re-perform the procedure.

먼저 사용자는 장애 노드에 대한 접근 권한 정보를 준비(S301)한다. 이때 접근 권한 정보는 상술한 링0 클러스터(115)에서 획득할 수 있을 것이다.First, the user prepares access authority information for the failed node (S301). At this time, access authority information may be obtained from the ring 0 cluster 115 described above.

해당 장애 노드가 발생한 컴퓨트 노드에 접근(S302)한다.Access the compute node where the corresponding faulty node occurred (S302).

장애 노드의 VM(Virtual Machine)과 포트(Port)가 더 이상 추가되지 않도록 언스케쥴을 수행(S303)한다.Unschedule is performed (S303) so that the VM (Virtual Machine) and port of the failed node are no longer added.

이어서 VM과 포트에 대한 마이그레이션 절차를 수행(S304)하고, 해당 장애 노드에 대한 쿠버네티스 라벨을 제거(S305)한다.Next, the migration procedure for the VM and port is performed (S304), and the Kubernetes label for the failed node is removed (S305).

오픈스택 컴퓨트 서비스 및 네트워크 에이전트를 제거(S306)하고, 파드를 드레인(S307) 한 후 마침내 장애 노드를 제거(S308)할 수 있다.You can remove the OpenStack compute service and network agent (S306), drain the pods (S307), and finally remove the failed node (S308).

위 절차적인 방식의 순서를 변경하거나 어느 하나의 순서가 생략될 경우, 장애 노드가 완벽하게 제거되지 못하거나, 제거되더라도 다른 종류의 오류 등의 문제가 발생할 수 있다.If the order of the above procedural methods is changed or any order is omitted, the faulty node may not be completely removed, or even if it is removed, problems such as other types of errors may occur.

그뿐만 아니라 오류 등의 이슈 발생시 항상 사람이 처리해야 하므로 많은 리소스 필요하며, 필수 관리자 인원의 위의 전체 프로세스를 이해하고 있어야 했다. 사용자에 의해서 처리하게 되므로 장애를 패턴화하기 어렵고 전체 장애 프로세스의 자동화 및 선언적 관리가 어렵다는 다양한 문제점이 존재한다.In addition, since issues such as errors always have to be handled by humans, a lot of resources are needed, and essential administrators must understand the entire process above. Since it is handled by the user, there are various problems such as difficulty in patterning failures and difficulty in automating and declarative management of the entire failure process.

이러한 문제점을 해결하기 위하여 본 발명의 일실시예에 따른 노드 제거 파드(116)는 노드 제거 콘트롤러(201)를 구비함으로써 오픈스택 하이퍼바이저들에 대한 자동화 및 선언적 관리가 가능해졌다.To solve this problem, the node removal pod 116 according to an embodiment of the present invention is equipped with a node removal controller 201, enabling automation and declarative management of OpenStack hypervisors.

이런 자동화 및 선언적 관리에 따르면, 관리자의 관리는 아래와 같은 절차로 간소화될 수 있다.According to this automation and declarative management, the administrator's management can be simplified to the following procedures.

(1) 공유 클러스터(114)에 노드 제거 파드(116) 배포(1) Deploying node removal pods (116) into shared clusters (114)

(2) 장애 발생 노드의 라벨을 변경(2) Change the label of the failed node

(3) 오퍼레이터에서 자동으로 노드 제거(3) Automatically remove nodes from the operator

즉, 본 발명의 일실시예에 따르면, 오픈스택 하이퍼바이저에 오퍼레이터를 적용함으로써 클라우드-네이티브한 관리가 가능해진다. 그리고 노드 장애 발생 이슈에 대한 패턴화가 가능해진다. 또한 다양한 에러 이슈를 수집하고 이에 대응할 수 있게 된다.That is, according to one embodiment of the present invention, cloud-native management is possible by applying an operator to the OpenStack hypervisor. And it becomes possible to pattern the issue of node failure. Additionally, it is possible to collect and respond to various error issues.

본 발명의 일실시예가 대응 가능한 하이퍼바이저에는 상술한 컴퓨트 노드뿐만 아니라, DHCP 노드 등 여러 대상이 가능하다.Hypervisors to which an embodiment of the present invention can respond include not only the above-described compute nodes but also various targets such as DHCP nodes.

다시 도 2를 참고하면, 클러스터 CR 저장부(204)는 AZ 클러스터 콘트롤러(202)에 의해서 정의된 적어도 하나의 클러스터 CR을 저장한다.Referring again to FIG. 2, the cluster CR storage unit 204 stores at least one cluster CR defined by the AZ cluster controller 202.

노드 CR 저장부(203)는 AZ 클러스터 콘트롤러(202)에 의해서 정의된 적어도 하나의 노드 CR을 저장한다.The node CR storage unit 203 stores at least one node CR defined by the AZ cluster controller 202.

접근 권한 정보 저장부(205)는, 링0 클러스터(115)에게 AZ 클러스터(111)에 포함되어 있는 적어도 하나의 노드에 대한 접근 권한 정보(AZ node kubeconfig)를 요청하고, 요청에 대하여 회신 받은 접근 권한 정보를 저장한다. 이때, 언더클라우드 클러스터(113)의 보안성을 고려하였을 때, 접근 권한 정보 저장부(205)는 접근 권한 정보를 저장하는데 있어서 보다 높은 보안 방식으로 저장할 수 있다. 보다 높은 보안 방식의 일 실시예로, 쿠버네티스 오픈소스에서의 "Sealed-secret" 방식을 들 수 있다.The access authority information storage unit 205 requests the ring 0 cluster 115 for access authority information (AZ node kubeconfig) for at least one node included in the AZ cluster 111, and provides the access authority information (AZ node kubeconfig) received in response to the request. Save permission information. At this time, considering the security of the undercloud cluster 113, the access permission information storage unit 205 can store the access permission information in a higher security manner. An example of a higher security method is the "Sealed-secret" method in Kubernetes open source.

만약 "Sealed-secret" 방식으로 접근정보를 암호화 하지 않으면, 외부의 침입으로 인하여 접근 권한 정보가 유출되었을 때, 외부에서 유출된 접근 권한 정보에 기초하여 직접 클러스터에 접근 가능해 질 수 있다.If access information is not encrypted using the "Sealed-secret" method, when access rights information is leaked due to an external intrusion, direct access to the cluster may be possible based on the access rights information leaked from the outside.

하지만 상술한 방식으로 접근 권한 정보가 암호화되는 경우, 암호화시 이용한 키 정보가 없다면 정보를 해독할 수 없다. 이에 따라 만약에 외부 침입으로 인하여 접근 권한이 유출되더라고, 이를 이용한 클러스터의 직접적인 접근이 불가능하다.However, when access authority information is encrypted in the above-described manner, the information cannot be decrypted without the key information used during encryption. Accordingly, even if access rights are leaked due to an external intrusion, direct access to the cluster using this is impossible.

이에 따라 혹시나 있을 외부 침입으로 인하여 권한 정보가 유출되더라도 다른 클러스터의 접근을 막을수 있다는 효과가 존재한다.Accordingly, even if authority information is leaked due to an external intrusion, there is an effect of preventing access from other clusters.

모니터링부(206)는 지속적으로 AZ 클러스터(111)의 IT 리소스들에 대한 모니터링을 수행한다.The monitoring unit 206 continuously monitors the IT resources of the AZ cluster 111.

도 4는 본 발명의 일실시예에 따른 AZ 클러스터 콘트롤러(202)의 CR을 정의하는 제어 순서도를 도시하는 도면이다.FIG. 4 is a diagram illustrating a control flowchart defining CR of the AZ cluster controller 202 according to an embodiment of the present invention.

먼저 AZ 클러스터 콘트롤러(202)는 미리 정의된 CRD(Custom Resource Definition)에 기초하여 클러스터 CR을 획득(S401)한다. 그리고 획득된 클러스터 CR에 기초하여 상술한 재조정 작업(reconciliation)을 수행(S402)한다.First, the AZ cluster controller 202 acquires a cluster CR based on a predefined Custom Resource Definition (CRD) (S401). Then, the above-described reconciliation is performed based on the obtained cluster CR (S402).

AZ 클러스터 콘트롤러(202)는 링0 클러스터(115)에서 접근 권한 정보를 획득(S403)하고, 이를 저장(S404)한다. 이때 저장하는 방식은 상술한 쿠버네티스 오픈소스에서의 "Sealed-secret" 방식을 이용하여 수행될 수 있다.The AZ cluster controller 202 obtains access authority information from the ring 0 cluster 115 (S403) and stores it (S404). At this time, the storage method can be performed using the “Sealed-secret” method in Kubernetes open source described above.

이어서 AZ 클러스터 콘트롤러(202)는 적어도 하나의 노드 정보를 획득(S405)하고, 획득된 노드 정보에 기초하여 노드 CR을 생성 및 저장(S406)한다.Next, the AZ cluster controller 202 acquires at least one node information (S405), and generates and stores a node CR based on the obtained node information (S406).

S407 단계에서 AZ 클러스터 콘트롤러(202)는 적어도 하나의 노드를 모니터링(tracking)하고 있다가, 이벤트가 발생하는지 여부를 판단(S408)한다. 이벤트가 발생하지 않으면, S407 단계를 반복적으로 수행할 수 있을 것이다.In step S407, the AZ cluster controller 202 monitors at least one node and determines whether an event occurs (S408). If the event does not occur, step S407 may be performed repeatedly.

이벤트란, 신규 노드가 유입되거나 삭제되는 이벤트를 의미할 수 있다. 특히 본 발명의 일실시예에 따른 이벤트란 쿠버네티스 오픈소스에서 정의할 수 있는 이벤트로, 이벤트의 트리거 여부에 따라 다양한 종류의 동작들이 수행될 수 있을 것이다.An event may refer to an event in which a new node is introduced or deleted. In particular, an event according to an embodiment of the present invention is an event that can be defined in Kubernetes open source, and various types of operations can be performed depending on whether the event is triggered.

이벤트가 발생된 것으로 판단되면, S409 단계로 진행하여 재조정 작업을 수행한다.If it is determined that an event has occurred, the process proceeds to step S409 and readjustment is performed.

재조정 작업에 의해서 변경된 사항을 반영하여 CR(노드 CR 및 클러스터 CR)을 업데이트(S410)시킬 수 있다. 이 경우 업데이트란, 신규 유입된 노드에 대한 CR을 새로 생성하거나, 삭제된 노드의 CR을 제거하거나, 변경된 클러스터 정보를 클러스터 CR에 반영하는 것을 포함할 수 있다.CR (node CR and cluster CR) can be updated (S410) to reflect changes made by the readjustment operation. In this case, updating may include creating a new CR for a newly introduced node, removing the CR of a deleted node, or reflecting changed cluster information in the cluster CR.

이하 도 5에서는 노드 제거 파드(116)에 의해서 노드가 제거되는 구체적인 순서를 설명한다.Below, FIG. 5 describes the specific order in which a node is removed by the node removal pod 116.

도 5는 본 발명의 일실시예에 따른 노드 제거 파드(116)의 노드 제거 순서도를 도시하는 도면이다.FIG. 5 is a diagram illustrating a node removal flowchart of the node removal pod 116 according to an embodiment of the present invention.

도 6은 본 발명의 일실시예에 따른 노드 제거 파드(116)가 노드를 제거하는 개념도를 도시하는 도면이다.FIG. 6 is a conceptual diagram showing how the node removal pod 116 removes a node according to an embodiment of the present invention.

이하에서는 도 5 및 도 6을 함께 참고하여 설명한다.Hereinafter, the description will be made with reference to FIGS. 5 and 6 together.

상술한 바와 같이 노드 제거 콘트롤러(201)는 AZ 클러스터 콘트롤러가 생성한 노드(Node) CRD(또는 노드 제거 CRD)의 리소스를 제어하기 위한 콘트롤러로서, 오픈스택 기반의 IT 리소스를 무중단으로 이관하고, 장애 노드를 제거하는 역할을 한다.As described above, the node removal controller 201 is a controller for controlling the resources of the node CRD (or node removal CRD) created by the AZ cluster controller, and transfers OpenStack-based IT resources without interruption and prevents failures. It serves to remove nodes.

본 발명의 일실시예에 따른 노드 CRD는, 노드 제거 스펙과 노드 제거 상태 정보를 포함하도록 구성될 수 있다.The node CRD according to an embodiment of the present invention may be configured to include node removal specifications and node removal status information.

노드 제거 스펙은, 해당 노드를 갖고 있는 클러스터 명칭(Cluster name), 오픈스택 Nova와 Neutron에 대한 접근 권한 정보, 노드 자체 정보(UID, InternalIP) 중 적어도 하나를 포함하도록 구성될 수 있다.The node removal specification may be configured to include at least one of the name of the cluster that holds the node, information on access rights to OpenStack Nova and Neutron, and information about the node itself (UID, InternalIP).

노드 제거 상태 정보는, 제거 동작 단계(EliminatePhase), 동작 중 발생한 에러 정보, 제거한 VM(Virtual Machine)과 Port, 파드(Pod) 정보, 변동 가능한 노드 정보(Node Label, Annotation, Conditions, Ready) 중 적어도 하나를 포함하도록 구성될 수 있다.Node removal status information includes at least the elimination operation phase (EliminatePhase), error information that occurred during operation, removed VM (Virtual Machine) and Port, Pod information, and changeable node information (Node Label, Annotation, Conditions, Ready). It can be configured to include one.

노드 CR은 상술한 노드 CRD에 기초하여 노드의 개수만큼 CR이 생성될 수 있다.Node CR can be generated as many CRs as the number of nodes based on the node CRD described above.

먼저 S501 단계에서 노드 제거 콘트롤러(201)는 노드 CR 저장부(203)에 저장되어 있는 노드 CR을 획득한다. 그리고 획득한 노드 CR을 이용하여 앞서 상술한 재조정 작업을 수행(S502)한다. 만약 AZ 클러스터 콘트롤러(202)에서 노드 CR을 아직 생성하기 전이라면, 노드 CR을 생성할 때까지 재요청(Requeue)을 수행할 수 있다.First, in step S501, the node removal controller 201 obtains the node CR stored in the node CR storage unit 203. Then, the readjustment described above is performed using the obtained node CR (S502). If the AZ cluster controller 202 has not yet created the node CR, requeue can be performed until the node CR is created.

이어서 노드 제거 콘트롤러(201)는 접근 권한 저장부(205)에 저장되어 있는 접근 권한 정보를 획득(S503)한다.Next, the node removal controller 201 obtains the access authority information stored in the access authority storage unit 205 (S503).

획득된 접근 권한 정보를 이용하여 모니터링부(206)는 AZ 클러스터(111)의 클라이언트를 획득한 후, 노드의 라벨을 모니터링(S504, tracking)한다. 여기서 클라이언트란, AZ 클러스터(111) 내에 리소스(자원)들을 다룰 수 있는 권한을 의미할 수 있다. 본 발명의 일실시예에 따른 모니터링은, 쿠버네티스 오픈소스의 인포머(informer)를 이용할 수 있다.Using the obtained access authority information, the monitoring unit 206 acquires a client of the AZ cluster 111 and then monitors the label of the node (S504, tracking). Here, the client may mean authority to handle resources within the AZ cluster 111. Monitoring according to an embodiment of the present invention can use the Kubernetes open source informer.

만약 모니터링한 노드의 라벨이 "node-eliminate=enabled"이 존재하는지 여부를 판단(S505)하고, 변경될 경우 노드를 제거하기 위한 작업을 수행(S506)할 수 있다. 위와 같이 노드의 라벨이 존재하는 경우, 장애 노드라고 판단하는 것이다. 본 발명의 일실시예에 따른 S505 단계의 판단은, "node-eliminate=enabled" 라벨로 변경되는지 여부에 따라 수행될 수 있을 것이다.If the label of the monitored node is "node-eliminate=enabled", it can be determined (S505), and if it changes, an operation to remove the node can be performed (S506). If a node's label exists as shown above, it is judged to be a failed node. The determination in step S505 according to an embodiment of the present invention may be performed depending on whether the label is changed to “node-eliminate=enabled”.

한편, 본 발명의 일실시예에 따른 모니터링부(206)는, 노드 장애를 로그 단계에서 판별하기 위한 노드 장애 디텍터(node problem detector)를 더 구비하도록 제안한다. 노드 장애 디텍터는 오픈 소스 형태로 구비될 수도 있으며, 로그 정보에 기초하여 장애를 판별함으로써 사용자의 관여가 불필요하면서도 신속하고 정확하게 장애 노드를 판별해 낼 수 있다.Meanwhile, the monitoring unit 206 according to an embodiment of the present invention is proposed to further include a node problem detector for determining node failure at the log stage. The node failure detector may be provided in open source form, and by determining failures based on log information, it is possible to quickly and accurately identify failed nodes without requiring user involvement.

이와 같은 라벨의 변경은 라벨 입력부(미도시)를 통하여 입력되는 관리자(사용자)의 라벨 변경 입력에 의해서 수행될 수도 있지만, 노드 제거 장치(100)가 자동으로 장애를 감지하고 라벨을 변경시킬 수도 있을 것이다. 본 발명은 라벨이 변경되는 방식에 한정되지 않을 것이다.Such label change may be performed by an administrator (user)'s label change input through a label input unit (not shown), but the node removal device 100 may automatically detect a failure and change the label. will be. The present invention will not be limited to the way the label is changed.

이때 본 발명의 일실시예에 따른 장애 노드 제거 장치(100)는 장애 노드 안에 존재하는 오픈스택 리소스들을 다른 안전한 노드로 옮길 수 있을 것이다. 즉, 장애가 발생한 노드의 역할을 다른 안전한 노드로 옮겨져 서비스가 연속적으로 제공되어질 수 있을 것이다.At this time, the failed node removal device 100 according to an embodiment of the present invention will be able to move the OpenStack resources existing in the failed node to another safe node. In other words, the role of the failed node can be transferred to another safe node and the service can be provided continuously.

컴퓨트 노드의 제거는, 상술한 도 3의 순서도를 따라 수행될 수 있으나, 다른 종류의 노드(예를 들어 상술한 DHCP 노드 등)의 경우 제거하는 순서나 절차가 상이할 수는 있을 것이다.Removal of the compute node may be performed according to the flowchart of FIG. 3 described above, but for other types of nodes (e.g., the DHCP node described above, etc.), the removal order or procedure may be different.

제거가 완료되면, 제거된 장애 노드를 반영한 현재의 상태를 노드 CR에 업데이트(S507)할 수 있다.When removal is completed, the current status reflecting the removed failed node can be updated in the node CR (S507).

도 7은 일 실시예에 따른 장애 노드 제거 장치(100)의 구성을 도시한 도면이다.FIG. 7 is a diagram illustrating the configuration of a faulty node removal device 100 according to an embodiment.

도 7을 참조하면, 장애 노드 제거 장치(100)는 프로세서(701) 및 메모리(702)를 포함한다. 메모리(702)는 프로세서(701)에 의해 실행 가능한 하나 이상의 명령어를 저장한다. 프로세서(701)는 메모리(702)에 저장된 하나 이상의 명령어를 실행한다.Referring to FIG. 7, the apparatus 100 for removing a faulty node includes a processor 701 and a memory 702. Memory 702 stores one or more instructions executable by processor 701. Processor 701 executes one or more instructions stored in memory 702.

프로세서(701)는 명령어를 실행하는 것에 의해 도 1 내지 도 6과 관련하여 위에서 설명된 하나 이상의 동작을 실행할 수 있다. 또한 도 1과 함께 상술한 본 발명의 구성은 프로세서(701)에 의해서 실행되는 명령어에 의해서 구현되는 구성일 수 있을 것이다.Processor 701 may perform one or more operations described above with respect to FIGS. 1-6 by executing instructions. Additionally, the configuration of the present invention described above with reference to FIG. 1 may be implemented by instructions executed by the processor 701.

이상으로 본 발명에 따른 장애 노드 제거 장치(100)의 실시예를 설시하였으나 이는 적어도 하나의 실시예로서 설명되는 것이며, 이에 의하여 본 발명의 기술적 사상과 그 구성 및 작용이 제한되지는 아니하는 것으로, 본 발명의 기술적 사상의 범위가 도면 또는 도면을 참조한 설명에 의해 한정／제한되지는 아니하는 것이다.Above, an embodiment of the faulty node removal device 100 according to the present invention has been described, but this is explained as at least one embodiment, and the technical idea, configuration, and operation of the present invention are not limited thereby. The scope of the technical idea of the present invention is not limited/limited by the drawings or the description referring to the drawings.

또한 본 발명에서 제시된 발명의 개념과 실시예가 본 발명의 동일 목적을 수행하기 위하여 다른 구조로 수정하거나 설계하기 위한 기초로써 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에 의해 사용되어질 수 있을 것인데, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에 의한 수정 또는 변경된 등가 구조는 청구범위에서 기술되는 본 발명의 기술적 범위에 구속되는 것으로서, 청구범위에서 기술한 발명의 사상이나 범위를 벗어나지 않는 한도 내에서 다양한 변화, 치환 및 변경이 가능한 것이다.In addition, the concept and embodiments of the invention presented in the present invention can be used by those skilled in the art as a basis for modifying or designing other structures to achieve the same purpose of the present invention. , Equivalent structures modified or changed by those skilled in the art to which the present invention pertains are bound to the technical scope of the present invention described in the claims, and do not depart from the spirit or scope of the invention described in the claims. Various changes, substitutions, and changes are possible within limits.

Claims

a monitoring unit that monitors the current status of at least one node;
A node CR storage unit that stores a node CR (Custom Resource), which is the ideal status (desire status) of the node; and
Comprising a node removal controller that identifies a failed node among the at least one node based on the monitoring result and the node CR, and performs processing on the identified failed node,
Node removal device.

According to claim 1,
The faulty node is a node whose monitored current state is different from the stored ideal state,
Node removal device.

The method of claim 1, wherein the node removal controller,
Performing processing on the identified faulty node in a declarative manner,
Node removal device.

The method of claim 3, wherein the node removal controller,
Performing reconciliation and removing the identified faulty node,
Node removal device.

According to claim 1,
Further comprising an access permission storage unit that stores access permission information for accessing the at least one node,
Node removal device.

The method of claim 5, wherein the monitoring unit,
Obtaining a client for the at least one node using the stored access authority information,
Node removal device.

According to claim 1,
Further comprising an Availability Zone (AZ) cluster controller defining a node CR for the at least one node,
Node removal device.

The method of claim 7, wherein the node CR stored in the node CR storage unit is
A node CR defined by the AZ cluster controller,
Node removal device.

According to claim 1,
Further comprising a label input unit that receives a label change input from the user,
The monitoring unit identifies a node whose label has changed as a faulty node based on the label change input.
Node removal device.

monitoring the current state of at least one node;
Storing a node CR (Custom Resource), which is the ideal status (desire status) of the node; and
Comprising the step of identifying a failed node among the at least one node based on the monitoring result and the node CR, and performing processing on the identified failed node.
Control method of node removal device.

According to claim 10,
The faulty node is a node whose monitored current state is different from the stored ideal state,
Control method of node removal device.

The method of claim 10, wherein the node removal controller,
Performing processing on the identified faulty node in a declarative manner,
Control method of node removal device.

13. The method of claim 12, wherein performing the processing comprises:
Performing reconciliation and removing the identified faulty node,
Control method of node removal device.

According to claim 10,
Further comprising the step of the access authority storage unit storing access authority information for accessing the at least one node,
Control method of node removal device.

The method of claim 14, wherein the monitoring step includes:
Obtaining a client for the at least one node using the stored access authority information,
Control method of node removal device.

According to claim 10,
Further comprising the step of defining, by an Availability Zone (AZ) cluster controller, a node CR for the at least one node.
Control method of node removal device.

The method of claim 16, wherein the node CR stored in the node CR storage unit is
A node CR defined by the AZ cluster controller,
Control method of node removal device.

According to claim 10,
The label input unit further includes receiving a label change input from the user,
The monitoring step includes identifying a node whose label has changed as a faulty node based on the label change input.
Control method of node removal device.