KR101512647B1

KR101512647B1 - Method For Choosing Query Processing Engine

Info

Publication number: KR101512647B1
Application number: KR1020130124835A
Authority: KR
Inventors: 이재영; 박성열; 안성화; 박근태; 최승운
Original assignee: 에스케이 텔레콤주식회사
Priority date: 2013-10-18
Filing date: 2013-10-18
Publication date: 2015-04-16
Anticipated expiration: 2033-10-18

Abstract

The present invention discloses a query process engine selection method. The query process engine selection method is used in a data processing system including multiple query process engines. The method includes: a query reception step; a query process engine evaluation step of evaluating the query process engines which are to process the query; a query transmission step of transmitting the query to one of the query process engines selected based on the evaluation result obtained in the query process engine evaluation step; and a query processing step of enabling the selected query process engine to process the query.

Description

A method for selecting a query processing engine {Method For Choosing Query Processing Engine}

본 발명은 복수의 질의처리엔진을 포함하는 데이터 처리 시스템에서 질의를 처리할 질의처리엔진을 선택하는 방법에 관한 것이다.The present invention relates to a method for selecting a query processing engine for processing a query in a data processing system including a plurality of query processing engines.

이하에 기술되는 내용은 단순히 본 실시예와 관련되는 배경 정보만을 제공할 뿐 종래기술을 구성하는 것이 아님을 밝혀둔다.It should be noted that the following description merely provides background information related to the present embodiment and does not constitute the prior art.

PC(Personal Computer), 모바일 장치 및 인터넷의 이용이 일상화되면서 IT 사업자가 처리해야 하는 데이터양은 기하급수적으로 증가하고 있다. 사용자가 제작하는 UCC(User Created Contents), SNS(Social Network Service) 데이터는 데이터 증가 속도 뿐 아니라 형태와 질에서도 기존과 다른 양상을 보이고 있다. 따라서 이와 같이 다양하고 방대한 규모의 데이터는 기업이나 국가의 미래 경쟁력을 좌우하는 중요한 요소로 활용될 수 있다. 과거에도 대규모 데이터를 분석하고 의미 있는 정보를 찾아내려는 시도는 있었지만, 현재의 빅데이터(BigData) 환경은 과거와 비교해 데이터양과 다양성 측면에서 과거와는 비교할 수 없을 정도가 되었다.With the everyday use of personal computers (PCs), mobile devices and the Internet, the amount of data that an IT service operator has to process is growing exponentially. UCC (User Created Contents) and SNS (Social Network Service) data produced by users are not only in terms of data growth rate but also in form and quality. Therefore, such diverse and vast amounts of data can be used as important factors for future competitiveness of companies and countries. In the past, attempts have been made to analyze large data and find meaningful information, but the current BigData environment is far from the past in terms of data volume and diversity compared to the past.

최근에 등장한 빅데이터 처리 시스템으로 하둡(Hadoop)은 구글(Google)의 GFS(Google File System)을 기반으로 인터넷 환경에서 HTML, TEXT 등의 다양한 대규모의 비정형 데이터를 처리할 수 있도록 개발되어 왔다. 하둡은 HDFS(Hadoop Distributed File System)와 HDFS에서 관계형 데이터베이스(Relational Database)에서 사용하는 SQL(Structured Query Language)과 같은 질의를 처리하는 엔진을 포함한다. 빅데이터를 처리하는 시스템에 질의를 하여 결과를 얻기까지 수 일 이상이 소요되는 것이 일반적이다. HDFS에서 질의를 처리하는 엔진으로 Tajo, Impala, Hive, MapReduce 등 다양하게 개발되어 있지만, 각 질의처리엔진(Query Process Engine)에 따라 각자의 특성을 가지고 있어 같은 질의를 수행하더라도 질의처리엔진에 따라 결과를 얻기까지 수행시간, 수행하는 데 필요한 자원에 있어 많은 차이를 보이고 있다. 따라서 질의에 따라 적합한 질의처리엔진을 선택하는 방법이 문제된다.Hadoop has been developed to handle large-scale unstructured data such as HTML and TEXT in the Internet environment based on Google's GFS (Google File System), a recently introduced big data processing system. Hadoop includes an engine that handles queries such as Hadoop Distributed File System (HDFS) and Structured Query Language (SQL) used in relational databases in HDFS. It is common to query a system that processes big data and take more than several days to get results. Although HDFS has been developed for various queries such as Tajo, Impala, Hive, and MapReduce, each query engine has its own characteristics, so even if the same query is performed, Time to obtain the resource, and resources required to perform it. Therefore, there is a problem in selecting a suitable query processing engine according to the query.

본 실시예는, 데이터를 처리하는 시스템이 복수의 질의처리엔진을 사용하는 경우, 질의처리엔진을 선택하는 방법을 제공하는 데 주된 목적이 있다.The main object of the present embodiment is to provide a method of selecting a query processing engine when a system for processing data uses a plurality of query processing engines.

본 실시예의 일 측면에 의하면, 복수의 질의처리엔진(Query Process Engine)을 포함하는 데이터 처리 시스템에서, 질의(Query)를 수신하는 과정; 상기 질의를 수행할 상기 복수의 질의처리엔진을 평가하는 질의처리엔진 평가 과정; 상기 질의처리엔진 과정에서 평가한 결과로 선택된 상기 질의처리엔진에 상기 질의를 전달하는 과정; 및 상기 선택된 질의처리엔진이 상기 질의를 수행하는 질의 수행 과정을 포함하는 것을 특징으로 하는 질의처리엔진 선택 방법을 제공한다.According to an aspect of the present invention, there is provided a data processing system including a plurality of query processing engines, the method comprising: receiving a query; A query processing engine evaluation process of evaluating the plurality of query processing engines to perform the query; Transmitting the query to the query processing engine selected as a result of the evaluation in the query processing engine process; And a query processing step of performing a query in which the selected query processing engine performs the query.

또한, 본 실시예의 다른 측면에 의하면, 질의를 수신하고 복수의 질의처리엔진 중 어느 하나의 질의처리엔진을 선택하는 질의처리엔진 선택부; 상기 질의에 대한 히스토리 정보를 저장하는 질의히스토리로그; 상기 질의를 수행하는 자원을 할당하는 동적자원할당부; 및 복수의 데이터노드를 포함하는 데이터 저장부를 포함하는 것을 특징으로 하는 질의처리엔진 선택 장치를 제공한다.According to another aspect of the present invention, there is provided a query processing system comprising: a query processing engine selection unit that receives a query and selects one of the plurality of query processing engines; A query history log storing history information on the query; A dynamic resource allocation unit allocating resources for performing the query; And a data storage unit including a plurality of data nodes.

이상에서 설명한 바와 같이 본 실시예에 의하면, 질의처리엔진이 수신한 질의와 동일, 유사 또는 복수의 질의를 결합하여 동일하거나 동일하다고 평가되는 과거 수행했던 질의에 대한 히스토리 정보, 질의 요청을 받은 당시의 시스템 상황, 각 질의처리엔진의 질의의 실행계획을 평가하여 빅데이터를 처리하기 위한 복수의 질의처리엔진 중에서 가장 효율적인 엔진을 선정하는 방법을 제공할 수 있다. 빅데이터를 처리하는데 소요되는 시간은 질의처리엔진에 따라 수행시간의 차이가 매우 커서 질의에 따라 적합한 엔진을 선택하는 것이 중요한 문제인데, 본 발명의 실시예에 따르면 최적의 질의처리엔진을 선택할 수 있다. As described above, according to the present embodiment, the query processing engine combines the same query, similar query, or plural queries with the received query to obtain history information about past queries that are evaluated to be the same or the same, It is possible to provide a method of selecting the most efficient engine among a plurality of query processing engines for processing the big data by evaluating the system situation and the execution plan of the query processing of each query processing engine. The time required for processing the big data is very different from the execution time according to the query processing engine, so it is important to select an appropriate engine according to the query. According to the embodiment of the present invention, the optimum query processing engine can be selected .

수신된 질의와 질의히스토리로그에 남아 있는 과거 수행했던 질의 중 질의가 동일하고 사용 가능한 시스템 상황이 동일하다면 히스토리 기반의 평가를 수행하고, 만약, 히스토리 정보가 전혀 없다면 질의처리엔진(Query Processing Engine)으로부터 질의에 대한 실행계획(Explain Plan)을 제공받아 실행계획을 평가한다. 또한 질의를 수신할 당시의 시스템 상황이 변경되었거나 및 히스토리 기반 평가만으로 최적의 엔진을 평가하기 곤란한 경우 히스토리 기반 평가와 실행계획 기반 평가를 가중하여 평가하는 하이브리드 평가를 할 수 있다. 이와 같은 평가를 거쳐 최적의 질의처리엔진을 선정한다면 질의를 수행하는 시간을 최대한 단축시킬 수 있다.If the received query and the queries that have been performed in the history log of the query history are the same and the available system conditions are the same, the history based evaluation is performed. If there is no history information from the query processing engine An execution plan is evaluated by receiving an Explain Plan for the query. In addition, a hybrid assessment can be made that weighs the historical and performance plan based evaluations if the system conditions at the time of receiving the query are changed or if it is difficult to evaluate the optimal engine by history based evaluation alone. If an optimal query processing engine is selected through such evaluation, the time for performing the query can be shortened as much as possible.

또한 본 실시예에 의하면, 동적자원할당부(Dynamic Resource Allocator)는 자원(Resource)을 요청한 시점에 사용 가능한 최대한의 자원을 할당하지만, 질의를 수행하고 있는 도중에, 수행 중인 질의 보다 우선순위가 높은 질의 요청을 받는 경우에는 질의를 수행하는 자원을 재조정할 수 있다. 동적자원할당부는 자원의 활용이 낮은 경우 자원을 회수할 수 있고, 질의를 수행하는 데 할당되는 자원의 최대 또는 최소의 범위를 지정할 수 있는 효과가 있다.According to this embodiment, the dynamic resource allocator allocates a maximum available resource at the time of requesting a resource. However, during the execution of the query, the dynamic resource allocator allocates a resource having a higher priority than the currently executing query When a request is received, the resource performing the query can be readjusted. The dynamic resource allocation unit is capable of retrieving resources when resource utilization is low, and can specify a maximum or minimum range of resources allocated for performing a query.

도 1은 HDFS(Hadoop Distributed File System)의 구성도이다.
도 2는 본 실시예에 따른 복수의 질의처리엔진을 포함하는 빅데이터 처리 시스템의 구성도이다.
도 3는 본 실시예에 따른 질의처리엔진을 선택하는 과정의 순서도이다.
도 4는 본 실시예에 따른 동적으로 자원을 할당하는 과정의 순서도이다.1 is a block diagram of a HDFS (Hadoop Distributed File System).
2 is a configuration diagram of a big data processing system including a plurality of query processing engines according to the present embodiment.
3 is a flowchart of a process of selecting a query processing engine according to the present embodiment.
FIG. 4 is a flowchart of a process of dynamically allocating resources according to the present embodiment.

이하, 본 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, the present embodiment will be described in detail with reference to the accompanying drawings.

본 실시예에서는 하둡(Hadoop) 및 HDFS(Hadoop Distributed File System) 기반으로 설명하지만, 데이터를 저장하는 구조는 이에 한정되지 않는다. 빅데이터를 처리하는 시스템으로는 GFS(Google File System)와 MapReduce 등 다양한 시스템이 있을 수 있고, 본 발명의 기술적 사상은 빅데이터를 처리하는 특정 시스템에 한정되지 않는다.Although the present embodiment is described on the basis of Hadoop (Hadoop) and HDFS (Hadoop Distributed File System), the structure for storing data is not limited thereto. As a system for processing big data, there may be various systems such as GFS (Google File System) and MapReduce, and the technical idea of the present invention is not limited to a specific system for processing big data.

도 1은 HDFS(Hadoop Distributed File System)의 구성도이다.1 is a block diagram of a HDFS (Hadoop Distributed File System).

HDFS(Hadoop Distributed File System)은 하둡 분산 파일 시스템으로 빅데이터를 처리하기 위해 수집된 대용량의 데이터를 여러 서버에 나눠서 저장하도록 하는 기술이다. HDFS는 네임노드(NameNode)(110)와 데이터노드(120)로 구성된다. 네임노드(110)는 데이터노드(DataNode)에 저장되는 실제 파일의 메타(Meta) 정보를 저장하는 곳으로 실제 데이터가 저장되는 곳은 아니다. 네임노드(110)는 네임노드(마스터)(112)와 네임노드(세컨더리)(114)로 구성되는데, 네임노드(세컨더리)(114)는 네임노드(마스터)(112)에 장애가 발생하면 네임노드(마스터)(112)를 대신하여 사용하거나 네임노드(마스터)(112)를 복구하기 위해 사용한다. The Hadoop Distributed File System (HDFS) is a technology that allows Hadoop distributed file systems to store large amounts of collected data on multiple servers for processing big data. The HDFS is composed of a NameNode 110 and a data node 120. The name node 110 stores meta information of an actual file stored in a data node (DataNode), and is not a place where actual data is stored. The name node (secondary) 114 includes a name node (master) 112 and a name node (secondary) 114. When a name node (master) 112 fails, (Master) 112 or to recover a name node (master)

데이터노드(120)의 구성원인 데이터노드01(121), 데이터노드02(122), 데이터노드03(123), 데이터노드04(124), 데이터노드05(125)는 실제 데이터가 저장되는 공간으로 네트워크로 연결된 서버 또는 스토리지이다. 네임노드(110)에는 데이터노드(120)에 저장된 파일과 실제로 저장된 데이터노드(120)의 정보를 가지고 있다. 응용프로그램이나 사용자가 파일에 접근하고자 할 때에는 네임노드(110)에서 파일이 저장된 데이터노드(120)를 찾아 접근하게 된다.Data nodes 01 to 121, data nodes 02 to 122, data nodes 03 to 123, data nodes 04 to 124, and data nodes 05 to 125, which are members of the data node 120, Networked servers or storage. The name node 110 has a file stored in the data node 120 and information of the actually stored data node 120. When an application program or a user wants to access a file, the name node 110 searches for and accesses the data node 120 in which the file is stored.

도 2는 본 실시예에 따른 복수의 질의처리엔진을 포함하는 빅데이터 처리 시스템의 구성도이다.2 is a configuration diagram of a big data processing system including a plurality of query processing engines according to the present embodiment.

사용자, 응용프로그램 등의 클라이언트(Client)가 빅데이터(BigData) 처리 시스템(200)에 질의(Query) 요청을 하면, 빅데이터 처리 시스템(200)의 질의처리엔진 선택부(210)는 질의 요청을 수신하고, 동적자원할당부(220)에 하둡 클러스터(Hadoop Cluster)의 메모리 사용량, CPU 점유율 등을 포함하는 자원(Resource) 정보를 요청하여 이를 수신하며, 질의히스토리로그(230)으로부터 수신된 질의에 대한 히스토리 정보를 획득한다. 다만, 자원 정보는 메모리 사용량, CPU 점유율 정보에만 한정되지 않는다. 질의리스토리로그(230)는 질의, 질의 수행 시간, 질의 수행에 사용한 메모리양, CPU 점유율 정보를 포함하는 자원 정보를 보관하고 있다. 질의처리엔진 선택부(210)은 질의 히스토리 정보, 빅데이터 처리 시스템의 자원 현황 및 각 질의처리엔진별 질의 실행계획 등을 평가하여 질의처리엔진부(240)에서 최적의 질의처리엔진을 선택한다. 질의처리엔진 선택부(210)에서 질의처리엔진부(240)에 포함된 복수의 질의처리엔진 중 최적의 질의처리엔진을 선정하는 방법에 대해서는 도 3에서 상세하게 설명한다.When a client such as a user or an application requests a query to the BigData processing system 200, the query processing engine selection unit 210 of the big data processing system 200 transmits a query request And receives resource information including a memory usage amount and a CPU usage rate of a Hadoop cluster in the dynamic resource allocation unit 220 and receives the resource information. In response to the query received from the query history log 230, And obtains the history information about the user. However, the resource information is not limited to the memory usage amount and the CPU usage rate information. The query restrike log 230 stores resource information including the query, the query execution time, the amount of memory used for query execution, and the CPU occupancy information. The query processing engine selection unit 210 evaluates the query history information, the resource status of the big data processing system, and the query execution plan for each query processing engine, and selects the optimum query processing engine in the query processing engine unit 240. A method for selecting an optimal query processing engine among a plurality of query processing engines included in the query processing engine unit 240 by the query processing engine selection unit 210 will be described in detail with reference to FIG.

하둡 클러스터 기반의 질의처리엔진으로는 타조(Tajo)(242), 임팔라(Impala)(244), 하이브(Hive)(246), 맵리듀스(MapReduce), 에이치베이스(HBase), 피그(Pig) 등으로 다양하다. 그러나, 각 엔진들은 각각의 고유한 특징을 포함하고 있어 특정 질의에 대해 엔진에 따라 수행 시간의 차이가 매우 크다. 빅데이터(BigData) 처리 시스템에서 배치로 데이터를 처리하여 결과를 생성하는데 수 일이 소요되는 것이 일반적이다. 따라서, 질의에 따라 적합한 질의처리엔진을 선택하는 것은 중요한 문제이다.Hadoop cluster-based query processing engines include Tajo 242, Impala 244, Hive 246, MapReduce, HBase, Pig, etc. . However, since each engine includes unique characteristics, there is a great difference in the execution time depending on the engine for a specific query. It is common for a BigData processing system to take several days to process the data in batches and generate the results. Therefore, it is important to select an appropriate query processing engine according to the query.

동적자원할당부(220)는 하둡 클러스트의 시스템 자원 정보를 모니터링하고 있다가 질의처리엔진 선택부(210)에서 요청하면 전달하고, 질의를 수행할 질의처리엔진에 시스템 자원을 할당하여 질의가 수행되도록 한다. 또한, 질의가 수행되고 있는 도중에 새로운 질의 요청이 오면 자원을 재할당할 수 있다. 도 4에서 시스템 자원을 할당하는 방법에 대해서 상세하게 설명한다. The dynamic resource assignment unit 220 monitors system resource information of the Hadoop cluster, and when the request is made by the query processing engine selection unit 210, the dynamic resource assignment unit 220 allocates the system resource to the query processing engine for performing the query, do. In addition, resources can be reassigned when a new query request arrives while the query is being performed. A method of allocating system resources will be described in detail with reference to FIG.

질의를 수행하도록 선택된 질의처리엔진은 동적자원할당부(220)에서 자원을 할당 받아 실제 데이터를 저장하고 있는 하둡 클러스터(250)의 데이터를 처리하여 질의 결과를 생성하고, 그 결과를 질의처리엔진 선택부(210)에 전달하고 질의처리엔진 선택부는 다시 클라이언트에 전달한다. 동적자원할당부(220)는 질의가 실행되는 도중 하둡 클러스터(250)로부터 시스템의 자원 정보를 보고 받는다. 하둡 클러스터(250)는 도 1의 HDFS를 구성하는 네임노드(110)와 데이터노드(120)으로 구성될 수 있다.The query processing engine selected to perform the query processes the data of the Hadoop cluster 250 storing the actual data by allocating resources in the dynamic resource assignment unit 220 to generate a query result, Unit 210, and the query processing engine selection unit transfers it to the client again. The dynamic resource assignment unit 220 receives resource information of the system from the Hadoop cluster 250 while the query is being executed. The Hadoop cluster 250 may be composed of a name node 110 and a data node 120 constituting the HDFS of FIG.

질의처리엔진은 질의 수행을 완료하면 질의, 질의 수행시간, 질의 수행에 사용된 메모리 사용량, CPU 점유율을 포함하는 자원 정보를 질의히스토리로그(230)에 저장한다.When the query processing engine finishes executing the query, the query processing engine stores resource information including the query, the query execution time, the memory usage used in query execution, and the CPU usage rate in the query history log 230.

본 발명의 실시예에 따른 빅데이터 처리 시스템(200)은 개인용 컴퓨터(PC: Personal Computer), 노트북 컴퓨터, 태블릿(Tablet), 개인 휴대 단말기(PDA: Personal Digital Assistant), 게임 콘솔, 휴대형 멀티미디어 플레이어(PMP: Portable Multimedia Player), 플레이스테이션 포터블(PSP: PlayStation Portable), 무선 통신 단말기(Wireless Communication Terminal), 스마트폰(Smart Phone), TV, 미디어 플레이어 등과 같은 사용자 단말기를 포함할 수 있고, 사용자 단말기는 빅데이터 처리 시스템(200)의 일부일 수 있다. 본 발명의 실시예에 따른 빅데이터 처리 시스템(200)은 응용 서버와 서비스 서버 등 서버 단말기일 수 있다. 본 발명의 실시예에 따른 빅데이터 처리 시스템(200)은 각기 (i) 각종 기기 또는 유무선 통신망과 통신을 수행하기 위한 통신 모뎀 등의 통신 장치, (ii) 프로그램을 실행하기 위한 데이터를 저장하기 위한 메모리, (iii) 프로그램을 실행하여 연산 및 제어하기 위한 마이크로프로세서 등을 구비하는 다양한 장치를 의미할 수 있다. 적어도 일 실시예에 따르면, 메모리는 램(Random Access Memory: RAM), 롬(Read Only Memory: ROM), 플래시 메모리, 광 디스크, 자기 디스크, 솔리드 스테이트 디스크(Solid State Disk: SSD) 등의 컴퓨터로 판독 가능한 기록/저장매체일 수 있다. 적어도 일 실시예에 따르면, 마이크로프로세서는 명세서에 기재된 동작과 기능을 하나 이상 선택적으로 수행하도록 프로그램될 수 있다. 적어도 일 실시예에 따르면, 마이크로프로세서는 전체 또는 부분적으로 특정한 구성의 주문형반도체(Application Specific Integrated Circuit: ASIC) 등의 하드웨어로써 구현될 수 있다.The big data processing system 200 according to the embodiment of the present invention may be applied to a personal computer (PC), a notebook computer, a tablet, a personal digital assistant (PDA), a game console, a portable multimedia player A user terminal such as a portable multimedia player (PMP), a PlayStation Portable (PSP), a wireless communication terminal, a smart phone, a TV, a media player, And may be part of the Big Data Processing System 200. The big data processing system 200 according to the embodiment of the present invention may be a server terminal such as an application server and a service server. (I) a communication device such as a communication modem for performing communication with various devices or wired / wireless communication networks, (ii) a communication device for storing data for executing a program, A memory, (iii) a microprocessor for executing and controlling a program, and the like. According to at least one embodiment, the memory may be a computer such as a random access memory (RAM), a read only memory (ROM), a flash memory, an optical disk, a magnetic disk, or a solid state disk Readable recording / storage medium. According to at least one embodiment, a microprocessor can be programmed to selectively perform one or more of the operations and functions described in the specification. In accordance with at least one embodiment, the microprocessor may be implemented in hardware, such as an Application Specific Integrated Circuit (ASIC), in wholly or partially of a particular configuration.

도 3는 본 실시예에 따른 질의처리엔진을 선택하는 과정의 순서도이다.3 is a flowchart of a process of selecting a query processing engine according to the present embodiment.

도 3은 클라이언트가 빅데이터 처리 시스템(200)에 질의를 요청하면 질의를 수신한 빅데이터 처리 시스템(200)이 질의를 처리하기에 적합한 질의처리엔진을 선택하는 과정을 설명한다. 질의처리엔진 선택부(210)은 질의를 수신하고(S310) 질의히스토리로그로부터 수신된 질의와 동일한 질의가 과거에 수행된 적이 있다면 그 히스토리 정보 또는 수신된 질의와 동일하다고 평가할 수 있는 유사한 질의나 질의를 결합하여 수신된 질의와 동일하거나 동일하다고 평가된 질의에 대한 히스토리 정보를 획득한다(S320). 히스토리 정보는 질의, 질의 수행시간, 질의 수행에 사용된 메모리양, CPU 점유율 등의 포함하는 자원에 관한 정보이지만, 열거된 항목에 한정되는 않는다. 질의처리엔진 선택부(210)는 동적자원할당부(220)로부터 질의 수행을 위해 사용 가능한 자원 정보를 수신한다. FIG. 3 illustrates a process in which, when a client requests a query to the big data processing system 200, the big data processing system 200 receiving the query selects a query processing engine suitable for processing the query. The query processing engine selection unit 210 receives the query (S310) and, if the same query as the query received from the query history log has been performed in the past, a similar query or query that can be evaluated to be the same as the history information or the received query And obtains history information about the query that is evaluated to be equal to or identical to the received query (S320). The history information is information about the resource including the query, the execution time of the query, the amount of memory used for executing the query, and the CPU occupancy rate, but is not limited to the listed items. The query processing engine selection unit 210 receives available resource information from the dynamic resource assignment unit 220 for query execution.

수신된 질의와 동일한 질의, 유사한 질의, 복수의 질의를 결합하여 수신된 질의와 동일하거나 동일하다고 평가된 질의가 과거에 수행된 적이 없다면 히스토리 기반 평가를 할 수 없으므로 히스토리 존재 여부에 대해 판단한다(S340).If a query that is the same as or identical to the query received by combining the same query, the similar query, and the plurality of queries as the received query has not been performed in the past, the history based evaluation can not be performed, and thus the presence or absence of the history is determined ).

만약, 히스토리가 존재하지 않아 히스토리 기반의 평가를 수행할 수 없다면 질의처리엔진 선택부(210)는 질의처리엔진부(240)의 각 질의처리엔진에 질의를 전달하고 질의에 대한 실행계획(Explain Plan 또는 Execution Plain)을 요청하여 수신한다(S350). 질의처리엔진 선택부(210)는 각 엔진의 실행계획에서 질의를 수행하기 위한 수행 단계의 수, 분산 처리 할 수 있는 연산의 수를 평가하는 실행계획 기반 평가를 수행하여(S355) 수행 단계가 적고, 분산 처리가 많은 질의처리엔진을 선정한다.If history-based evaluation can not be performed, the query processing engine selection unit 210 transmits a query to each of the query processing engines of the query processing engine unit 240 and outputs an execution plan for the query Or Execution Plain) (S350). The query processing engine selection unit 210 performs an execution plan based evaluation for evaluating the number of executing steps for performing a query in the execution plan of each engine and the number of operations that can be distributedly processed (S355) , And selects a query processing engine having many distributed processes.

질의히스토리로그에 수신된 질의에 대한 히스토리 정보가 있어, 히스토리 기반의 평가를 하게 되는 경우에는 히스토리 기반으로만 평가하여도 최적의 질의처리엔진을 선정할 수 있는지, 그것만으로는 부족하여 실행계획 평가도 함께 이루어지는 하이브리드 기반 평가를 수행할지 여부를 판단한다(S360). 수신된 질의와 동일한 질의가 이미 수행된 적이 있어 질의히스토리로그(230)에 남고 있고, 사용 가능한 리소스 정보도 동일하거나 동일하다고 평가될 수 있다면 다시 실행계획을 평가할 필요는 없을 것이다. 질의에 대한 히스토리 로그가 없는 최초 상태에서는 실행계획을 기반으로 평가하였기 때문이다. 따라서 이 경우에는 히스토리 기반으로 평가를 수행한다(S370).If there is history information about the query received in the query history log and history-based evaluation is performed, whether or not an optimal query processing engine can be selected even if it is evaluated only based on the history base, It is determined whether to perform the hybrid based evaluation (S360). If the same query as the received query has already been performed and remains in the query history log 230 and the available resource information can be evaluated to be the same or the same, there is no need to evaluate the execution plan again. This is because the initial state without the history log of the query is evaluated based on the execution plan. Therefore, in this case, the evaluation is performed based on the history (S370).

그러나 동일한 질의에 대한 로그 기록이 있다하더라도 사용 가능한 자원의 상태에 변경이 있거나, 동일한 질의에 대한 평가는 없지만, 유사한 질의를 수행한 기록이 있거나 복수의 질의를 결합하여 동일하거나 동일하다고 평가할 수 있는 질의에 대한 히스토리 로그가 있는 경우에는 히스토리 기반으로만 평가하여 최적의 질의처리엔진을 선정하기에는 부족하다. 따라서 이 경우에는 하이브리드(Hybrid) 평가를 수행한다(S380). 하이브리드 평가는(S380)는 히스토리 기반 평가의 질의의 수행시간과 실행계획 기반 평가의 질의 수행 단계의 수 및 질의에서 분산 처리가 가능한 연산의 수를 가중하여 평가한다.However, even if there is a log record for the same query, there is a change in the state of available resources, or there is no evaluation for the same query, but there is a record that has performed similar queries, or a query It is not enough to select an optimal query processing engine by evaluating only based on the history. Therefore, in this case, a hybrid evaluation is performed (S380). In the hybrid evaluation (S380), the execution time of the query based on the history based evaluation, the number of the query execution steps of the execution plan based evaluation, and the number of operations that can be distributedly processed in the query are weighted and evaluated.

히스토리 기반 평가, 실행계획 기반 평가, 하이브리드 평가를 통해 복수의 질의처리엔진 중에서 가장 최적의 질의처리엔진을 선정한다(S390).The most optimal query processing engine among a plurality of query processing engines is selected through history-based evaluation, execution plan-based evaluation, and hybrid evaluation (S390).

도 4는 본 실시예에 따른 동적으로 자원을 할당하는 과정의 순서도이다.FIG. 4 is a flowchart of a process of dynamically allocating resources according to the present embodiment.

질의처리엔진 선택부(210)에서 질의처리엔진부(240) 중 질의를 수행할 질의처리엔진을 선정하면 질의처리엔진은 질의를 수신하고(S410), 동적자원할당부(220)에 질의 수행을 위한 자원을 요청한다(S420). When the query processing engine selecting unit 210 selects a query processing engine to execute a query among the query processing engine unit 240 (S410), the query processing engine receives the query and performs a query to the dynamic resource requesting unit 220 (S420).

한편, 도 1의 데이터노드(120)에서 각 데이터노드(데이터노드01(121),…, 데이터노드(125))는 각 데이터노드가 가진 CPU의 수와 동일한 수의 선점용 프로세스(Preemptive Resource Occupier)를 포함한다. 선점용 프로세스는 자원을 선점하기 위한 프로세스이다. 동적자원할당부(210)는 자원 할당 요청을 수신하면 가능한 최대의 선점용 프로세스를 할당하여 사용 가능한 최대의 자원을 질의처리엔진에게 할당하여 질의를 수행할 수 있도록 한다.In the data node 120 of FIG. 1, each data node (data node 01 (121), ..., data node 125) has the same number of preemptive resource occupations ). The preemption process is a process for preempting resources. When receiving the resource allocation request, the dynamic resource assignment module 210 assigns the largest possible preemption process to allocate the largest usable resource to the query processing engine to perform the query.

자원을 할당 받은 질의처리엔진은 질의를 수행하는데(S430), 질의처리엔진이 질의를 수행하는 도중, 질의처리엔진 선택부(210)가 새로운 질의를 수신하고, 질의를 처리할 질의처리엔진을 선정한 후 동적자원할당부(220)에 자원을 요청하는 경우, 수행할 질의에 대한 우선순위 등의 요소에 따라 질의를 수행할 자원을 재조정할 필요가 발생한다. 동적자원할당부(220)는 데이터노드(120)의 선점용 프로세스로부터 사용 중인 자원 정보를 보고 받아 활용률이 낮은 경우에 선점용 프로세스를 회수하여 자원 할당을 재조정한다(S432). 질의처리엔진은 질의 수행의 종료 여부를 판단하여(S434) 질의 수행이 종료되면 질의 수행 히스토리 정보를 질의히스토리로그(230)에 저장한다(S440). 히스토리 정보에는 질의 내용, 질의 수행시간, 질의 수행에 사용된 메모리양, CPU 점유율을 포함하지만, 이에 한정되지 않는다. 질의처리엔진이 질의 수행을 완료하고 결과를 생성하여 결과를 질의처리엔진 선택부에 전달한다(S460).The query processing engine that has been allocated resources performs a query (S430). While the query processing engine is executing the query, the query processing engine selection unit 210 receives a new query and selects a query processing engine to process the query When a resource is requested to the dynamic resource assignment unit 220, there is a need to readjust the resource to perform the query according to factors such as the priority of the query to be performed. The dynamic resource assignment unit 220 receives the resource information in use from the process for preemption of the data node 120, recovers the preemption process to reallocate resources when the utilization rate is low (S432). The query processing engine determines whether the execution of the query is terminated (S434). When the execution of the query is completed, the query processing engine stores the query execution history information in the query history log 230 (S440). History information includes, but is not limited to, query content, query execution time, amount of memory used for query execution, and CPU utilization. The query processing engine completes the query execution, generates the result, and transmits the result to the query processing engine selection unit (S460).

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The foregoing description is merely illustrative of the technical idea of the present embodiment, and various modifications and changes may be made to those skilled in the art without departing from the essential characteristics of the embodiments. Therefore, the present embodiments are to be construed as illustrative rather than restrictive, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The scope of protection of the present embodiment should be construed according to the following claims, and all technical ideas within the scope of equivalents thereof should be construed as being included in the scope of the present invention.

110 네임노드 120 데이터노드
200 빅데이터 처리 시스템 210 질의처리엔진 선택부
220 동적자원할당부 230 질의히스토리로그
240 질의처리엔진부 250 하둡 클러스터110 Namenode 120 data node
200 Big Data Processing System 210 Query Processing Engine Selection Unit
220 Dynamic resource allocation part 230 Query history log
240 query processing engine part 250 Hadoop cluster

Claims

In a data processing system including a plurality of query processing engines,
Receiving a query;
A query processing engine evaluation process of evaluating the plurality of query processing engines to perform the query;
Transmitting the query to the query processing engine selected as a result of evaluation in the query processing engine evaluation process; And
Wherein the selected query processing engine executes a query process
Wherein the query processing engine evaluation process comprises: receiving an execution plan from the plurality of query processing engines; Extracting the number of execution steps of the query and the number of executable operations of the query in the execution plan; And selecting one of the plurality of query processing engines by comparing the number of execution steps with the number of the distributed execution executable operations.

delete

The method according to claim 1,
The query processing engine evaluation process includes:
And an execution plan based evaluation process for evaluating the execution plan based on the execution plan.

delete

5. The method of claim 4,
The execution plan-
Wherein the execution plan based evaluation is performed when there is no history information on a query that is the same as or identical to the query by combining the same query, similar query, and multiple queries with the query.

In a data processing system including a plurality of query processing engines,
Receiving a query;
A query processing engine evaluation process of evaluating the plurality of query processing engines to perform the query;
Transmitting the query to the query processing engine selected as a result of evaluation in the query processing engine evaluation process; And
Wherein the selected query processing engine executes a query process
Wherein the query processing engine selection process includes a hybrid method of weighting and evaluating the execution time of the query based on the history based evaluation, the number of execution steps of the query of the execution plan based evaluation, And evaluating the query processing engine.

8. The method of claim 7,
In the hybrid evaluation process,
The hybrid evaluation is performed when the information included in the history information is a query similar to the query, a query evaluated to be the same as or identical to the query by combining a plurality of queries, or the resource information is not similar even though the query is the same query A query processing engine selection method.

8. The method of claim 7,
Wherein the step of performing the query includes a step of allocating a resource from a dynamic resource allocator.

10. The method of claim 9,
Wherein the dynamic resource assignment unit allocates a preemption process occupying resources in a plurality of data nodes (DataNode) storing data.

11. The method of claim 10,
Wherein the number of preempting processes is equal to the number of CPUs of the data node.

10. The method of claim 9,
Wherein the dynamic resource assignment unit is capable of retrieving a process for a preemption assigned to the query processing engine.

10. The method of claim 9,
The dynamic resource allocation portion may include information on a minimum number and a maximum number of preemption processes to be allocated to the query processing engine and a minimum time and a maximum time when the preemption process can be used by the query processing engine A query processing engine selecting method characterized by:

In a data processing system including a plurality of query processing engines,
Receiving a query;
A query processing engine evaluation process of evaluating the plurality of query processing engines to perform the query;
Transmitting the query to the query processing engine selected as a result of evaluation in the query processing engine evaluation process; And
Wherein the selected query processing engine executes a query process
Wherein the query execution process includes the step of storing the history information of the query including the execution time of the query, the memory information used for executing the query, and CPU information in a query history log .

delete

A query processing engine selection unit which receives a query and selects a query processing engine from among a plurality of query processing engines;
A query history log storing history information on the query;
A dynamic resource allocation unit allocating resources for performing the query; And
A data storage unit including a plurality of data nodes
Wherein the query processing engine selection unit receives an execution plan from the plurality of query processing engines and extracts the number of execution steps of the query and the number of distributed executable operations of the query in the execution plan And an execution plan based evaluation unit for comparing the number of execution steps with the number of distributed execution operations to select any one of the plurality of query processing engines.

A query processing engine selection unit which receives a query and selects a query processing engine from among a plurality of query processing engines;
A query history log storing history information on the query;
A dynamic resource allocation unit allocating resources for performing the query; And
A data storage unit including a plurality of data nodes
Wherein the query processing engine selection unit includes a hybrid evaluation unit for weighting and evaluating the execution time of the query based on the history based evaluation, the number of execution steps of the query of the execution plan based evaluation, and the number of the distributed execution executable queries And a query processing engine selection device.