KR100511164B1

KR100511164B1 - Method for caching index data for and operations based on real-time analysis of user's queries in a web search engine

Info

Publication number: KR100511164B1
Application number: KR10-2002-0078185A
Authority: KR
Inventors: 임성채
Original assignee: 정보통신연구진흥원; 주식회사 코리아 와이즈넛
Priority date: 2002-12-10
Filing date: 2002-12-10
Publication date: 2005-08-31
Also published as: KR20040050370A

Abstract

본 발명에 따른 웹 검색 엔진에서의 AND 질의 처리를 위한 색인 데이터의 캐슁 방법은 캐슁하는 메모리 영역을 응용프로그램의 메모리 공간에 국한하지 않고 OS(Operating system) 커널의 I/O 버퍼링 영역까지 포함시킨다. 또한, 디스크에 저장된 색인 데이터 중에서 캐슁할 데이터를 보다 정확하게 결정하기 위해서, 입력된 AND 질의어에 포함된 검색 대상 단어들에 대한 중요도를 실시간으로 분석하여 캐슁할 색인 데이터를 결정한다. 또한, 본 발명에 따른 색인 데이터의 캐슁 방법은, OS 커널과 응용프로그램이 사용할 수 있는 메모리 공간을 최대로 하기 위해서, 캐슁하지 않을 데이터에 대해서는 OS 커널의 버퍼링 공간에 저장되지 않도록 한다. 본 발명에 의하면, 웹 검색 엔진을 사용하는 사용자의 AND 질의에 포함된 단어들의 사용빈도 등에 기초하여 앞으로 가장 사용될 가능성이 큰 단어들에 해당하는 색인 데이터를 메모리에 캐슁함으로써, 웹 검색 엔진의 잦은 디스크 I/O(Input/Output operation)에 의한 성능 저하를 방지할 수 있다. 또한, 본 발명에 의하면, AOP의 내부 메모리 영역 뿐만아니라 OS의 I/O 버퍼링 영역를 함께 캐슁 메모리 영역으로 사용함으로써 색인 데이터의 캐슁을 효율적으로 수행할 수 있다.The caching method of index data for AND query processing in the web search engine according to the present invention includes not only the memory area to be cached but also the I / O buffering area of an operating system (OS) kernel. Also, in order to more accurately determine the data to be cached among the index data stored in the disk, the importance of the search target words included in the input AND query is analyzed in real time to determine the index data to be cached. In addition, the index data caching method according to the present invention, in order to maximize the memory space available to the OS kernel and the application program, so that data not to be cached is not stored in the buffering space of the OS kernel. According to the present invention, by frequently indexing the index data corresponding to the words most likely to be used in the future based on the frequency of use of the words included in the AND query of the user using the web search engine, frequent disks of the web search engine Performance degradation due to input / output operations (I / O) can be prevented. In addition, according to the present invention, by using not only the internal memory area of the AOP but also the I / O buffering area of the OS as a cache memory area, caching of index data can be efficiently performed.

Description

METHOD FOR CACHING INDEX DATA FOR AND OPERATIONS BASED ON REAL-TIME ANALYSIS OF USER'S QUERIES IN A WEB SEARCH ENGINE}

본 발명은 인터넷 상에서의 웹 검색 서비스에 관한 것으로, 특히, 웹 검색 서비스를 위한 웹 검색 엔진을 구현함에 있어서 디스크로부터 데이터를 읽을 때 발생하는 병목현상에 의한 성능 저하를 개선할 수 있는 AND 질의 처리를 위한 색인 데이터의 캐슁 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a web search service on the Internet. In particular, in implementing a web search engine for a web search service, an AND query process that can improve performance degradation due to a bottleneck caused when reading data from a disk is provided. It relates to a method of caching index data.

웹 검색 엔진이라 함은, 예를 들어, 엠파스(EMPAS), 네이버(NAVER), 구글(GOOGLE)과 같이, 인터넷 상에서 접근가능한 방대한 양의 웹 페이지를 모아서 저장하고, 이들 웹 페이지를 그 안에 포함된 중요 단어(key word)를 기준으로 하여 색인(indexing)하며, 웹 검색 엔진을 사용하는 사용자의 질의(query)에 대해 그 질의와 연관성이 높다고 평가되는 웹 페이지를 검색하여 사용자에게 보여주는 서비스에 사용되는 응용 프로그램을 의미한다.Web search engines collect and store huge amounts of web pages accessible on the Internet, such as, for example, EMPAS, NAVER, and GOOGLE, and store these web pages Indexing is based on key words, and is used for services that search and display web pages that are highly related to the query of users who use the web search engine. Means application.

웹 검색 서비스의 사용자는 이러한 웹 검색 엔진에 몇 개의 단어로 이루어진 질의(예를 들어, "특허청 위치")를 입력한다. 이와 같이, 두 개 이상의 단어를 포함하는 사용자 질의를 "AND 질의(AND query)"라고 칭하며, 이러한 AND 질의를 웹 검색 엔진에서 처리하는 것을 "AND 질의 처리(AND query processing)"라고 칭한다. 사용자가 AND 질의를 입력하면, 웹 검색 엔진에서는 두 개의 단어(예를 들어, "특허청"과 "위치")를 모두 포함하는 웹 문서를 검색하여 웹 검색 결과를 생성한다. 즉, 입력된 질의에 대해, 웹 검색 엔진은 그 질의에 포함된 두 개 이상의 단어를 색인어(indexed word)로 갖고 있는 웹 페이지(또는 웹 문서)들을 검색하고, 검색된 웹 문서들의 중요도(rank)를 계산하여, 중요도가 높은 순서대로 정렬된 웹 문서들을 웹 검색 결과로서 출력한다. 본 발명에서는, 이러한 방식으로 두 개 이상의 단어로 구성된 사용자 질의를 처리하는 응용 프로그램을 "AND 연산 처리기(AND-Operation Processor: AOP)"라고 칭한다. The user of the web search service enters a query consisting of several words (eg, "patent office location") into this web search engine. As such, a user query that includes two or more words is called an "AND query", and processing such AND queries in a web search engine is called "AND query processing." When the user enters an AND query, the web search engine searches for a web document containing both words (eg, "Patent Office" and "Location") to generate a web search result. That is, for an entered query, the web search engine searches for web pages (or web documents) that have two or more words included in the query as indexed words, and ranks the retrieved web documents. The web documents sorted in order of high importance are output as the web search results. In the present invention, an application program that processes a user query consisting of two or more words in this manner is referred to as an "AND-Operation Processor (AOP)".

도 1은 종래기술에 따른 웹 검색 엔진의 동작을 보여주는 흐름도이다. 먼저, 사용자는 웹 검색 엔진의 입력 창에 n개의 단어(w₁, w₂, ..., w_n)를 포함하는 AND 질의를 입력한다(단계 101). 여기서, n은 양의 정수이며, 이들 n개의 단어는 서로 논리곱(AND)의 관계를 갖고 있다고 가정한다. 사용자의 AND 질의를 입력받은 AOP는 n개의 단어(w₁, w₂, ..., w_n)에 해당되는 색인 데이터를 저장매체(예를 들어, 하드 디스크)로부터 읽는다(단계 102). AOP는 읽어들인 색인 데이터에 기초하여 n개의 단어(w₁, w₂, ..., w_n)가 모두 포함되어 있는 웹 문서를 검색한다 (단계 103). 그리고 나서, AOP는 검색된 웹 문서들에 대해 사전결정된 중요도 결정 알고리즘(ranking algorithm)을 실행하여, 각 웹 문서에 그것의 중요도에 따라 순위(rank)를 부여한다(단계 104). 이와 같이, 검색된 웹 문서들 각각의 중요도가 결정되면, AOP는 중요도에 따라 결정된 순위에 따라 웹 문서들을 정렬하고, 정렬된 웹 문서들의 ID를 웹 화면 생성기에 전송한다(단계 105).1 is a flowchart illustrating the operation of a web search engine according to the prior art. First, a user enters an AND query including _n words (w ₁ , w ₂ , ..., w _n ) in an input window of a web search engine (step 101). Here, n is a positive integer, and these n words are assumed to have a logical AND relationship with each other. The AOP receiving the AND query of the user reads index data corresponding to _n words (w ₁ , w ₂ ,..., W _n ) from the storage medium (eg, a hard disk) (step 102). The AOP retrieves a web document containing all _n words (w ₁ , w ₂ ,..., W _n ) based on the index data read (step 103). The AOP then executes a predetermined ranking algorithm for the retrieved web documents, assigning each web document a rank according to its importance (step 104). As such, when the importance of each of the retrieved web documents is determined, the AOP sorts the web documents according to the ranking determined according to the importance, and transmits the IDs of the sorted web documents to the web screen generator (step 105).

웹 화면 생성기는 AOP로부터 전송 받은 웹 문서들의 ID들 각각에 해당하는 웹 문서에 관한 정보를 이용하여, 해당 웹 문서의 인터넷 상의 위치를 나타내는 URL(universal resource locator) 및 해당 웹 문서의 내용에 대한 요약 정보를 생성한다(단계 106). 그리고, 이렇게 생성된 URL 및 요약 정보를 사용자의 컴퓨터 화면에 디스플레이한다(단계 107).The web screen generator uses information about the web document corresponding to each of the IDs of the web documents received from the AOP, and summarizes the contents of the web document and the URL (universal resource locator) indicating the location of the web document on the Internet. Generate information (step 106). The generated URL and the summary information are then displayed on the user's computer screen (step 107).

일반적으로, AOP는 초당 수십 개씩 입력되는 사용자 질의를 처리하며, AOP는 다양한 단어들을 포함하는 AND 질의를 처리하기 위해서 각 단어에 대해 미리 생성해둔 색인 데이터를 디스크로부터 읽어서 AND 질의의 모든 단어가 포함된 웹 문서들을 검색해야 한다. 이 경우, AOP가 디스크로부터 읽어야 하는 색인 데이터는 일반적으로 수백 기가바이트(Giga-Byte) 이상의 크기를 가지기 때문에, AOP가 AND 질의를 처리하기 위해서는 많은 디스크 I/O(Input/Output operation) 시간을 필요로 하며, 따라서, AOP의 AND 질의 처리 중에 I/O 병목현상(bottleneck)이 발생되기 쉽다. 따라서, AOP가 AND 질의를 효율적으로 처리하기 위해서는 이처럼 큰 크기의 색인 데이터를 하드디스크에서 메모리로 효율적으로 읽어들이는 방법이 필요하며, 특히, 효과적인 색인 데이터의 캐슁 기법을 통하여 디스크로부터 읽혀지는 데이터의 크기 및 데이터의 액세스 횟수를 최소로 하는 것이 매우 중요하다.In general, AOP processes user queries that are input several tens per second, and AOP reads the index data generated for each word from disk to process an AND query that includes various words. You have to search the web documents. In this case, the index data that AOP needs to read from disk typically has a size of more than a few hundred gigabytes, so AOP requires a lot of disk input / output operation (I / O) time to process the AND query. Therefore, an I / O bottleneck is likely to occur during the AND query processing of the AOP. Therefore, in order for AOP to efficiently process AND queries, AOP needs a method of efficiently reading such large sized index data from the hard disk into memory. It is very important to minimize the size and the number of times of data access.

색인 데이터의 디스크로부터 메모리로의 I/O의 횟수를 줄기 위해서는, 디스크에 저장된 색인 데이터 중에서 앞으로 반복적으로 사용될 확률이 높은 것을 AOP가 실행되는 서버의 메모리 영역으로 미리 읽어들이는 것이 바람직하다. 종래의 색인 데이터의 캐슁 방법으로는 LRU(Least Recently Used) 방식이 많이 사용되고 있으며, LRU 방식에서는 디스크로부터 새로운 색인 데이터를 메모리에 읽어들일 때, 최근에 가장 사용 빈도가 작은 색인 데이터가 저장된 메모리 영역에 그 새로운 색인 데이터를 저장하는 방식을 사용한다. LRU 방식은, 일반적으로 최근에 가장 오랫동안 사용되지 않았던 데이터는 앞으로도 사용되지 않을 확률이 가장 크다는 시간적 집약성(temporal locality)에 기반을 두고 있다. In order to reduce the number of I / Os from the disk to the memory of the index data, it is preferable to read in advance into the memory area of the server where the AOP is executed a high probability that the index data stored in the disk will be used repeatedly in the future. Conventionally, the least recently used (LRU) method is used as a caching method of index data. When the new index data is read from the disk into the memory, the LRU method is used to store the most recently used index data in the memory area. Use the way to store the new index data. The LRU scheme is generally based on the temporal locality that data that has not been used for the longest time is most likely not to be used in the future.

그러나, 인터넷 환경에서는 많은 수의 사용자들이 동시에 웹 검색 엔진을 사용하며, 일반적으로 이러한 환경에서 웹 검색 엔진이 처리해야 할 사용자 질의에 포함될 수 있는 단어의 수는 보통 천만 개 이상이다. 또한, 그 사용자 질의에는 매우 다양한 내용의 단어가 포함되어, 앞으로 가장 사용가능성이 높은 색인 데이터를 결정하는 것은 매우 어려운 작업이 된다. 게다가, 사용자가 입력하는 질의에 포함된 단어들의 분포는 시간에 따라 자주 변하기 때문에, 그 단어에 해당하는 캐슁할 데이터 역시 이에 따라 적절히 변화해야 하지만, 종래 기술의 캐슁 기법은 이러한 검색 대상 단어들의 분포의 잦은 변화를 반영하지 못하는 문제점이 있다.However, in an Internet environment, a large number of users use a web search engine at the same time, and generally, in such an environment, the number of words that can be included in a user query to be processed by the web search engine is usually 10 million or more. In addition, the user query includes a wide variety of words, which makes it very difficult to determine the most usable index data in the future. In addition, since the distribution of words included in a query that a user enters frequently changes over time, the data to be cached corresponding to that word must also change accordingly, but the prior art caching technique is based on the distribution of these searched words. There is a problem that does not reflect frequent changes.

한편, 종래 기술의 캐슁 기법은 AOP가 사용하는 내부 메모리 공간을 이용하여 색인 데이터를 캐슁한다. 하지만, AOP가 검색해야할 색인 데이터의 크기가 그 내부 메모리 공간을 초과하는 경우에는, 내부 메모리 공간에 저장되지 않는 색인 데이터의 일부분을 검색하기 위해 디스크 I/O를 별도로 실행해야 하므로 AOP의 전체 성능을 저하시키는 문제점이 발생된다. 따라서, AOP가 실행되는 서버의 메모리 공간을 좀 더 효율적으로 활용하는 기법이 필요하다.Meanwhile, the conventional caching technique uses the internal memory space used by AOP to cache index data. However, if the size of the index data that AOP needs to retrieve exceeds its internal memory space, you must run separate disk I / O to retrieve the portion of the index data that is not stored in the internal memory space. The problem of deterioration arises. Therefore, a technique for more efficiently utilizing the memory space of the server on which AOP is executed is needed.

본 발명은 상기와 같은 종래 기술의 문제점을 해결하기 위한 것으로서, 웹 검색 엔진에서의 AND 질의 처리에 필요한 디스크 I/O의 횟수 및 시간을 최소화 할 수 있는 색인 데이터 캐슁 방법 및 색인 데이터의 구조를 제공한다. 또한, 본 발명은 웹 검색 엔진이 실행되는 서버의 메모리 공간을 좀 더 효율적으로 활용할 수 있는 색인 데이터의 캐슁 방법 및 색인 데이터의 구조를 제공한다. The present invention is to solve the problems of the prior art as described above, to provide an index data caching method and structure of the index data that can minimize the number and time of disk I / O required for AND query processing in the web search engine do. In addition, the present invention provides a method of caching index data and a structure of index data that can more efficiently utilize a memory space of a server running a web search engine.

상기한 목적을 달성하기 위해, 본 발명은 웹 검색엔진에서의 실시간 사용자 질의 분석에 기반한 AND 질의에 대한 색인데이터의 캐슁 방법에 있어서, (a) 상기 AND 질의에 포함된 단어가 제 1 단어 리스트 및 상기 제 2 단어 리스트에 포함되어 있는지 결정하는 단계 - 여기서, 제 1 단어 리스트는 일정기간 동안에 사용자가 상기 검색 엔진에 입력한 AND 질의에 포함된 단어들의 사용 빈도에 기초하여 작성되며, 제 2 단어 리스트는 제 1 단어 리스트의 부분 집합임 -, (b) 상기 (a) 단계에서 상기 AND 질의에 포함된 단어가 상기 제 1 단어 리스트 및 상기 제 2 단어 리스트에 포함되어 있으면, 상기 단어에 해당하는 색인 데이터를 제 1 캐슁 메모리 영역에서 상기 웹 검색 엔진의 메모리 영역으로 읽는 단계, (c) 상기 (a) 단계에서 상기 단어가 상기 제 1 단어 리스트에 포함되어 있으나 상기 제 2 단어 리스트에 포함되어 있지 않으면, 상기 단어에 해당하는 색인 데이터를 제 2 캐슁 메모리 영역에서 상기 웹 검색 엔진의 메모리 영역으로 읽는 단계를 포함하는 방법을 제공한다.In order to achieve the above object, the present invention provides a method of caching index data for an AND query based on real-time user query analysis in a web search engine, wherein (a) a word included in the AND query is a first word list and Determining whether the second word list is included in the second word list, wherein the first word list is generated based on a frequency of use of words included in an AND query input to the search engine by a user for a predetermined period of time; Is a subset of the first word list, (b) if the word included in the AND query in step (a) is included in the first word list and the second word list, an index corresponding to the word Reading data from a first caching memory area into a memory area of the web search engine, (c) in step (a) the word is included in the first word list Is, but provides a method comprising a step to read the memory area of the web search engine to the indexed data is not included in the second word list, corresponding to the word in the second memory area caching.

본 발명에서는 종래 기술에서 주로 사용하였던 캐슁 알고리즘인 LRU(Least Recently Used) 방식을 사용하지 않는다. 또한, 캐슁하는 메모리 영역을 종래기술과 같이 응용프로그램의 메모리 공간에 국한하지 않고 OS(Operating system) 커널의 I/O 버퍼링 영역까지 확장시킨다. 또한, 디스크에 저장된 색인 데이터 중에서 캐슁할 데이터를 보다 정확하게 결정하기 위해서, 입력된 질의에 포함된 단어들에 대한 중요도를 실시간으로 분석하여 캐슁할 색인 데이터를 결정한다. In the present invention, the LRU (Least Recently Used) method, which is a caching algorithm mainly used in the prior art, is not used. In addition, the memory area to be cached is extended to the I / O buffering area of an operating system (OS) kernel without being limited to the memory space of an application program as in the prior art. Also, in order to more accurately determine the data to be cached among the index data stored in the disk, the importance of the words included in the input query is analyzed in real time to determine the index data to be cached.

또한, 본 발명에 따른 색인 데이터의 캐슁 방법은, OS 커널과 응용프로그램이 사용할 수 있는 메모리 공간을 최대로 하기 위해서, 캐슁하지 않을 색인 데이터는 OS 커널의 버퍼링 공간을 거치지 않도록 디스크에서 직접 읽어 사용한다. 이러한 방식으로, 본 발명에 따른 색인 데이터의 캐슁 방법은 메모리에 캐슁된 데이터가 OS의 페이징 동작에 의해 메모리에서 디스크로 스워핑(swapping-out)되는 것을 막을 수 있다. 즉, 캐슁될 전체 색인 데이터의 크기를 서버 내의 사용 가능한 메모리 크기보다 작게 유지하여, 메모리에 일단 캐슁된 데이터가 AOP의 요구가 없이 메모리에서 삭제되는 것을 막을 수 있다.In addition, in the index data caching method according to the present invention, in order to maximize the memory space available to the OS kernel and the application program, the index data not to be cached is read directly from the disk so as not to pass through the OS kernel buffering space. . In this manner, the caching method of the index data according to the present invention can prevent the data cached in the memory from being swapped out from the memory to the disk by the paging operation of the OS. That is, the size of the entire index data to be cached can be kept smaller than the available memory size in the server, thereby preventing data once cached from memory from being deleted from the memory without requiring an AOP.

이하, 본 발명에 따른 색인 데이터의 구조 및 색인 데이터의 캐슁 방법을 첨부된 도면을 참조하여 상세히 설명한다.Hereinafter, the structure of the index data and the caching method of the index data according to the present invention will be described in detail with reference to the accompanying drawings.

발명에 사용되는 색인 데이터의 구조Structure of Index Data Used in the Invention

본 발명에 따른 AOP가 사용하는 색인 데이터는 MCF(Main Catalog File)과 RIF(Rank Information File)로 나뉜다. 도 2는 본 발명에 따른 색인 데이터의 구조, 즉, MCF와 RIF의 구조를 도시하고 있다. 설명의 편의를 위해, 도 2에서는 두 개의 단어(w₁, w₂)에 대한 색인 데이터만을 도시하였으나, 본 발명에 따른 색인 데이터의 구조는 두 개 이상의 단어에 대한 데이터 구조를 포함할 수도 있다. 도 2에 도시된 바와 같이, MCF(210)에는 각 단어(w₁또는 w₂) 별로 해당 단어를 포함하는 웹 문서의 ID(Document ID: DID)(211, 213)와 그 웹 문서의 중요도(rank) 계산에 이용되는 중요도 정보의 위치(212, 214)가 포함된다. 또한, RIF(220)에는 각 단어(w₁또는 w₂)와 관련된 웹 문서의 중요도 계산에 필요한 여러 가지 정보(221, 222)가 포함된다. 이러한 웹 문서의 중요도 계산에 필요한 정보(221, 222)는 MCF(210)에 포함된 중요도 정보의 위치(212, 214)를 이용하여 접근할 수 있다.Index data used by the AOP according to the present invention is divided into MCF (Main Catalog File) and RIF (Rank Information File). 2 illustrates a structure of index data according to the present invention, that is, a structure of MCF and RIF. For convenience of description, FIG. 2 illustrates only index data for two words w ₁ and w ₂ , but the structure of the index data according to the present invention may include a data structure for two or more words. As shown in FIG. 2, the MCF 210 includes IDs (Document IDs) (DIDs) 211 and 213 of web documents including the corresponding words for each word (w ₁ or w ₂ ) and the importance of the web documents ( The positions 212 and 214 of the importance information used for the rank calculation are included. In addition, the RIF 220 includes various pieces of information 221 and 222 necessary for calculating the importance of the web document associated with each word w ₁ or w ₂ . The information 221 and 222 necessary for calculating the importance of such a web document may be accessed using the positions 212 and 214 of the importance information included in the MCF 210.

상기한 본 발명에 따른 색인 데이터의 구조를 이용하여, 두 개의 단어(w₁, w₂)에 대한 AND 연산은 다음과 같은 방식으로 행해진다. 먼저, 단어(w₁)의 색인 데이터의 위치 정보(215)를 이용하여 MCF(210)에서 단어(w₁)에 해당되는 정보들(211, 212)을 메모리로 읽어들이고, 이어서 단어(w₂)의 색인 데이터의 위치 정보(216)를 이용하여 MCF(210)에서 단어(w₂)에 해당되는 정보들(213, 214)을 메모리로 읽어들인다. 두 개의 단어(w₁, w₂)가 동시에 포함된 웹 문서를 찾기 위해서, 두 개의 단어(w₁, w₂)에 해당하는 DID들을 포함하는 정보(211, 213)에서 공통으로 포함된 DID를 찾는다. 이렇게 두 개의 단어(w₁, w₂)가 모두 포함된 웹 문서들에 해당하는 DID를 결정한 후, 결정된 DID와 연관된 중요도 정보의 위치(212, 214)를 이용하여 RIF(220)로부터 두 개의 단어(w₁, w₂)에 대한 중요도 정보(221, 222)를 메모리로 읽어들인다. 그리고 나서, 중요도 정보(221, 222)로부터 두 개의 단어(w1 및 w2)가 포함된 각 웹 문서의 중요도를 계산한다. 웹 문서의 중요도를 결정하는 방법은 본 발명의 범위에 속하지 아니하므로 여기서 더 자세히 기술하지 않는다.Using the structure of the index data according to the present invention described above, the AND operation on two words (w ₁ , w ₂ ) is performed in the following manner. First, read the the information by using the position information 215 of the index data of the word (w ₁₎ for the word (w ₁₎ in the MCF (210) (211, 212 ) into memory, then the word (w ₂ The information 213 and 214 corresponding to the word w ₂ is read from the MCF 210 into the memory by using the position information 216 of the index data of FIG. In order to find a web document containing two words (w ₁ , w ₂ ) at the same time, the DIDs commonly included in the information (211, 213) including the DIDs corresponding to the two words (w ₁ , w ₂ ) Find. After determining the DIDs corresponding to the web documents including the two words (w ₁ , w ₂ ), two words from the RIF 220 using the positions 212 and 214 of the importance information associated with the determined DIDs. The importance information 221, 222 for (w ₁ , w ₂ ) is read into the memory. Then, the importance of each web document including two words w1 and w2 is calculated from the importance information 221, 222. The method of determining the importance of the web document is not included in the scope of the present invention and thus will not be described in more detail here.

도 2에 도시된 RIF(220)에서 빗금친 부분들은 단어(w₁, w₂)의 AND 질의 처리를 수행할 때 중요도 계산을 위해 읽어 들여야 할 RIF 데이터를 나타낸다. 도 2에 도시된 바와 같이, 단어(w₁, w₂)에 연관된 웹 문서들의 중요도를 계산하기 위해 메모리로 읽어야 하는 RIF 데이터는 RIF(220) 상에 산재하여 분포하며, RIF(220) 상에서의 RIF 데이터의 위치 또한 단어(w₁, w₂)의 조합에 따라 매우 유동적으로 변한다. 또한, RIF의 크기는 MCF의 크기에 비해 매우 크기 때문에, 한번 사용한 RIF 데이터를 메모리에 캐슁한다고 해도, 이 데이터가 다시 이용될 확률은 매우 낮다. 이러한 RIF 데이터의 특성 때문에, 본 발명에 따른 캐슁 방법은 바람직하게는 RIF 데이터를 캐슁하지 않으며, 대신에 RIF 데이터는 디스크로부터 AOP의 메모리 영역으로 읽어서 웹 문서의 중요도 계산에 사용된 후 곧바로 메모리에서 삭제된다.The hatched portions of the RIF 220 shown in FIG. 2 represent RIF data to be read for importance calculation when performing AND query processing of the words w ₁ and w ₂ . As shown in FIG. 2, the RIF data that should be read into memory to calculate the importance of web documents associated with the words w ₁ , w ₂ are distributed scattered on the RIF 220 and distributed on the RIF 220. The position of the RIF data also varies very fluidly with the combination of words w ₁ , w ₂ . In addition, since the size of the RIF is very large compared to the size of the MCF, even if the RIF data used once is cached in the memory, the probability of the data being used again is very low. Because of this nature of RIF data, the caching method according to the present invention preferably does not cache RIF data, but instead RIF data is read from disk into the memory area of the AOP and deleted from memory immediately after it is used in calculating the importance of web documents. do.

색인 데이터의 캐슁에 사용되는 메모리의 논리적 구성Logical Organization of Memory Used to Cache Index Data

AOP가 동작하는 서버에 장착된 전체 메모리의 크기를 M_server라고 하자. 또한, 서버에서 동작 중인 다른 프로그램(즉, AOP를 제외한 다른 응용 프로그램)과 OS가 사용하는 메모리의 크기를 M_other라고 할 때, 본 발명에 따른 색인 데이터의 캐슁 방법이 이용할 수 있는 실질적인 메모리 크기는 M_available = M_server - M_other 가 된다. 따라서, 캐슁이 되는 데이터의 크기는 M_available 보다 작게 유지되어야 한다.Let M _server be the total amount of memory installed in the server running AOP. In addition, when the size of the memory used by the OS and other programs running on the server (that is, other applications except AOP) is M _other , the actual memory size that can be used by the caching method of index data according to the present invention is M _available = M _server -M _other Therefore, the size of data to be cached should be kept smaller than M _available .

M_available 이하의 크기를 갖는 캐슁 메모리는 다시 두 개의 메모리 영역으로 분리된다. 즉, 캐슁 메모리는 AOP의 내부 메모리 공간(MS_process)과 OS의 I/O 버퍼링에 사용될 커널 메모리 공간(MS_kernel)으로 나뉜다. 본 발명에 따른 캐슁 방법은 두 메모리 공간(MS_process,MS_kernel)의 크기를 M_available 보다 작게 유지하며, 캐슁할 데이터의 중요도에 따라서 캐슁할 데이터를 MS_process에 위치시킬 것인지 아니면 MS_kernel에 위치시킬 것인지를 결정한다.Cache memory having a size of M _available or smaller is divided into two memory areas. That is, the cache memory is divided into an internal memory space (MS _process ) of the AOP and a kernel memory space (MS _kernel ) to be used for I / O buffering of the OS. The caching method according to the present invention provides two memory spaces (MS _process , Keep the size of the MS _kernel smaller than M _available and decide whether to place the data to be cached in the MS _process or the MS _kernel , depending on the importance of the data to be cached.

도 3은 본 발명에 따른 색인 데이터의 캐슁에 사용할 메모리의 구조를 도시한다. 도 3에 도시된 메모리 영역 중에서 두 메모리 공간(MS_process,MS_kernel)(340, 350)이 본 발명에 따른 색인 데이터의 캐슁 방법이 사용할 캐슁 메모리 영역이다. 본 발명에 따른 색인 데이터의 캐슁 방법의 중요한 특징 중의 하나는 캐슁되지 않을 데이터는 OS에 의해 관리되는 I/O 버퍼링 공간(350)에 저장되지 않도록 한다는 점이다. 일반적으로 응용프로그램이 디스크 I/O를 통해 데이터를 디스크에서 메모리로 읽어 들일 때는, 해당 데이터는 디스크에서 읽혀져 일단 OS의 I/O 버퍼링 공간(350)에 읽혀진 후, 다시 응용프로그램의 메모리 공간(330)으로 복사된다. 즉, 원래 의도된 것은 아니지만, 응용프로그램이 디스크로부터 읽는 데이터는 항상 OS의 I/O 버퍼링 공간(350)에 캐슁되는 것이다.3 illustrates a structure of a memory to be used for caching index data according to the present invention. Among the memory areas shown in FIG. 3, two memory spaces (MS _process , MS _kernels 340 and 350 are cache memory areas to be used by the caching method of index data according to the present invention. One important feature of the method of caching index data according to the present invention is that data that is not to be cached is not stored in the I / O buffering space 350 managed by the OS. In general, when an application reads data from disk to memory via disk I / O, the data is read from disk, once read into the OS's I / O buffering space 350, and then back to the application's memory space (330). Is copied to). That is, although not originally intended, data read from disk by an application is always cached in the OS's I / O buffering space 350.

이렇게 원하지 않은 데이터가 OS의 I/O 버퍼링 공간(350)에 저장된다면, 일반적인 OS의 버퍼링 관리 기법인 LRU 알고리즘에 의해 OS의 I/O 버퍼링 공간(350)에 이미 캐슁된 데이터가 스와핑(swapping-out)되는 현상이 발생되며, 따라서, 자주 불필요한 디스크 I/O가 발생된다. 이러한 문제점을 해결하기 위해, 본 발명에 따른 캐슁 방법은 특별히 고안된 시스템 함수를 이용하여 캐슁하려고 하는 데이터 이외의 데이터는, 디스크에서 읽어서 OS의 I/O 버퍼링 공간(350)에 저장하지 않고, 곧바로 AOP의 메모리 영역(330)에 저장하도록 한다. 이렇게 AOP의 메모리 영역(330)에 저장된 데이터는 AOP의 AND 질의 처리 후에 곧바로 삭제되며, 해당 메모리 영역은 다음의 AND 질의 처리를 위해 이용된다. 즉, AND 질의 처리에 사용된 데이터는 캐슁 공간인 MS_process 및 MS_kernel(340, 350)을 절대로 점유하지 않게 된다.If such unwanted data is stored in the I / O buffering space 350 of the OS, data already cached in the I / O buffering space 350 of the OS is swapped by the LRU algorithm, which is a general OS buffering management technique. out), and thus, unnecessary disk I / O is frequently generated. In order to solve this problem, the caching method according to the present invention does not read data to be cached using a specially designed system function, without reading from disk and storing it in the I / O buffering space 350 of the OS. In the memory area 330. The data stored in the memory area 330 of the AOP is immediately deleted after the AND query processing of the AOP, and the corresponding memory area is used for the next AND query processing. In other words, the data used for AND query processing never occupies the MS _process and MS _kernel (340, 350) the cache space.

색인 데이터의 캐슁 방법How to Cache Index Data

본 발명에 따른 색인 데이터의 캐슁 방법은 캐슁할 색인 데이터에 해당하는 단어들의 목록(LIST_cache)을 이용하여 캐슁을 수행한다. 도 4는 캐슁할 색인 데이터에 해당하는 단어들의 목록(LIST_cache)을 생성하는 과정을 나타낸 흐름도를 도시한다.Caching method of index data in accordance with the present invention performs the caching using the list (LIST _cache) corresponding to the index of data to be caching the word. FIG. 4 is a flowchart illustrating a process of generating a list _{cache of} words corresponding to index data to be cached.

먼저, 웹 검색 엔진을 사용하는 사용자들이 입력한 AND 질의가 기록되어 있는 웹 로그 데이터(web log data)를 분석하여(단계 410), 최근 일정 기간 내에 AND 질의에 자주 포함되는 단어들을 추출한다(단계 420). 그리고 나서, 이렇게 추출된 단어들 중에서 이에 해당하는 색인 데이터의 크기가 일정 크기 이상인 단어들을 다시 추출한다(단계 430). 여기서, 그 크기가 서로 비교되는 색인 데이터는 바람직하게는 MCF(210)이지만, MCF(210) 및 RIF(220) 중의 하나 또는 모두가 포함될 수 있다. 단계 430에서 추출된 단어들을 색인 데이터의 크기 순으로 정렬하여 파일로 저장한다(단계 440). 이렇게 저장된 파일은 LIST_cache로 사용되며, AOP의 AND 질의 처리가 시작될 때 입력되어 캐슁할 색인 데이터를 선택하는데 사용된다.First, web log data in which an AND query input by a user using a web search engine is recorded is recorded (step 410), and words frequently included in an AND query within a recent period (step 410) are extracted. 420). Then, among the words extracted, words having a size larger than or equal to the index data are extracted again (step 430). Here, the index data whose sizes are compared with each other is preferably MCF 210, but may include one or both of MCF 210 and RIF 220. The words extracted in step 430 are sorted in order of the size of the index data and stored in a file (step 440). The saved file is used as a LIST _cache and is used to select index data to be cached when the AND query of AOP starts.

본 발명에 따른 색인 데이터의 캐슁 방법에서는, 사용자가 웹 검색 엔진에 AND 질의를 입력했을 때에, 그 AND 질의에 포함된 각 단어가 LIST_cache에 포함되어 있는지 먼저 판단한다. 일단 LIST_cache에 포함된 단어에 해당하는 색인데이터는 서버의 메모리 공간에 캐슁될 대상이 된다. 앞서 설명한 대로, 본 발명에 따른 캐슁 메모리의 구조는 두 메모리 공간(MS_kernel, MS_process)(340, 350)을 색인 데이터의 캐슁 메모리로 사용한다. 따라서, AOP는 AND 질의에 포함된 단어가 LIST_cache에 포함될 경우에는 그 단어에 해당하는 색인 데이터를 두 메모리 공간(MS_kernel, MS_process)(340, 350) 중에 어디에 저장할 것인지를 결정해야 한다.In the caching method of index data according to the present invention, when a user inputs an AND query to a web search engine, it is first determined whether each word included in the AND query is included in the LIST _cache . Once the index data corresponding to the words contained in the LIST _cache are to be cached in the server memory space. As described above, the cache memory structure according to the present invention uses two memory spaces (MS _kernel , MS _process ) 340 and 350 as caching memory for index data. Therefore, when a word included in an AND query is included in a LIST _cache , the AOP must determine which of two memory spaces (MS _kernel , MS _process ) 340 and 350 store the index data corresponding to the word.

AOP는 다음과 같은 기준에 따라서 캐슁 메모리 공간(MS_kernel, MS_process)(340, 350)에서의 색인 데이터의 저장 위치를 결정한다. 즉, AOP는 LIST_cache에 포함되는 단어 중에서도 좀 더 자주 검색될 것으로 예상되는 단어를 다시 추출하여 LIST_process에 포함시킨다. 그리고, 사용자가 입력한 AND 질의에 포함된 단어가 LIST_process에 포함되어 있는 것일 경우에는 그 단어에 해당하는 색인 데이터를 메모리 공간(MS_process)(350)에 저장하며, 그 단어가 LIST_cache에 포함되어 있으나 LIST_process에 포함되어 있지 않은 경우에는 그 단어에 해당하는 색인 데이터를 메모리 공간(MS_kernel)(340)에 저장한다. 여기서, LIST_process는 사용자가 입력한 AND 질의에 포함된 단어 중에서 가장 사용빈도가 높을 것으로 예상되는 것들의 리스트이므로, LIST_cache에 보다 더 자주 갱신될 필요가 있다. 예를 들어, LIST_cache가 한 달 단위로 갱신된다면, LIST_process는 일 단위 또는 시간 단위로 갱신될 수 있다.The AOP determines the storage location of the index data in the cache memory space (MS _kernel , MS _process ) 340, 350 according to the following criteria. That is, AOP re-extracts the words that are expected to be searched more frequently among the words in the LIST _cache and includes them in the LIST _process . When the word included in the AND query input by the user is included in the LIST _process , index data corresponding to the word is stored in the memory space (MS _process ) 350, and the word is included in the LIST _cache . If it is, but not included in the LIST _process , the index data corresponding to the word is stored in the memory space (MS _kernel ) 340. Here, the LIST _process is a list of the words most frequently used among words included in the AND query input by the user. Therefore, the LIST _process needs to be updated more frequently in the LIST _cache . For example, if the LIST _cache is updated on a monthly basis, the LIST _process can be updated on a daily or hourly basis.

본 발명에 따른 색인 데이터의 캐슁 방법은, LIST_process를 시간변화에 따라 가장 최적으로 관리하기 위해, 사용자가 입력하는 AND 질의에 포함된 단어들을 실시간으로 분석하여 LIST_process를 갱신한다. LIST_process의 갱신은 일정한 시간 간격으로, 즉, 주기적으로 실행되거나, AOP의 요청에 따라 실시간으로 실행될 수 있다. 또한, LIST_process의 갱신 작업은 AOP의 요청에 의해서, 또는 AOP로부터 독립적으로 실행되는 캐쉬 알고리즘 수행 프로세스 또는 쓰레드(thread)에 의해 실행될 수 있다. 이하에서는, LIST_process의 갱신 작업을 수행하는 프로세스를 "LIST_process 갱신 프로세스(LIST_process updating process: LUP)"라고 칭한다.Caching method of index data in accordance with the present invention, in order to manage in the most optimal according to the time variation LIST _process, by analyzing the words in the AND query that the user enters in real time and updates the LIST _process. The update of the LIST _process may be executed at regular time intervals, i.e., periodically, or in real time at the request of the AOP. In addition, the update operation of the LIST _process may be executed by a request of the AOP or by a cache algorithm execution process or a thread executed independently from the AOP. Hereinafter, a process for performing the update operation of the _process LIST "LIST _process update process _{(LIST process updating process: LUP)} " as referred to.

도 5는 본 발명에 따른 LIST_process를 생성하는 방법을 도시한 흐름도이다. AOP가 LUP에게 LIST_process의 갱신을 요청하면(단계 501), LUP는 최근의 LIST_process의갱신 시점으로부터 현재까지 입력된 AND 질의를 수집한다(단계 502). 도 5에는 AOP의 요청에 의하여 LIST_process의 갱신 작업이 시작되는 것으로 기술되어 있으나, LUP는 AOP의 요청과 무관하게 LIST_process의 갱신 작업을 주기적으로 수행할 수도 있다. 다음으로, LUP는 수집된 AND 질의에 포함된 단어들 중에서 LIST_cache에 포함된 단어들을 추출하고(단계 503), 추출된 각각의 단어(w_i)에 대해 F_i와 S_i를 계산한다(단계 504). 여기서, F_i는 단어(w_i)에 대한 사용자의 검색 요청의 횟수를 나타내며, S_i는 단어(w_i)에 해당하는 색인 데이터의 크기를 나타낸다. 다음으로, LUP는 계산된 S_i와 F_i를 이용하여 단어(w_i)의 중요도 I_i를 계산한다(단계 505). 이때, 중요도 I_i는 S_i와 F_i의 값이 클수록 큰 값을 갖게된다. LUP는 이와 같이 계산된 중요도(I_i)를 이용하여 LIST_process의 갱신 작업 및 캐슁 메모리 공간(MS_process)(350)의 갱신 작업을 수행한다.5 is a flowchart illustrating a method of generating a LIST _process according to the present invention. If the AOP requests the LUP to update the LIST _process (step 501), the LUP collects AND queries entered from the time of updating the latest LIST _process to the present (step 502). Although it is described in FIG. 5 that an update operation of a LIST _process is started by a request of an AOP, the LUP may periodically perform an update operation of a LIST _process regardless of an AOP request. Next, the LUP extracts words included in the LIST _cache among the words included in the collected AND query (step 503), and calculates F _i and S _i for each extracted word w _i (step 504). Here, F _i represents the number of times a user requests a search for the word w _i , and S _i represents the size of the index data corresponding to the word w _i . Next, the LUP calculates the importance I _i of the word w _i using the calculated S _i and F _i (step 505). At this time, the importance I _i has a larger value as the values of S _i and F _i are larger. The LUP performs an update operation of the LIST _{process and} an update operation of the cache memory space (MS _process ) 350 using the calculated importance I _i .

즉, LUP는 단어(w_i)에 대한 중요도(I_i) 보다 작은 값의 중요도를 갖는 단어가 LIST_process에 포함되어 있는지를 검사한다(단계 506). 만약, 단어(w_i)에 대한 중요도(I_i) 보다 작은 값의 중요도를 갖는 단어가 LIST_process에 포함되어 있지 않으면, LUP는 현재의 LIST_process를 AOP에 전달하고 LIST_process의 갱신작업을 종료한다(단계 509). 반대로, 만약 단어(w_i)에 대한 중요도(I_i) 보다 작은 값의 중요도를 갖는 단어가 LIST_process에 포함되어 있으면, 해당 단어를 LIST_process에서 삭제하고, 그 단어에 해당하는 색인 데이터를 캐슁 메모리 공간(MS_process)(350)에서 삭제한다(단계 507). 그리고, LUP는 단어(w_i)를 LIST_process에 새로이 포함시키고, 그 단어(w_i)에 해당하는 색인 데이터를 디스크에서 캐슁 메모리 공간(MS_process)(350)으로 읽어들인다(단계 508). 마지막으로 LUP는 갱신된 LIST_process를 AOP에 전달하고 LIST_process의 갱신작업을 종료한다(단계 509). 여기서, 색인 데이터는 바람직하게는 MCF(210)이지만, MCF(210) 및 RIF(220) 중의 하나 또는 모두가 포함될 수 있다.That is, the LUP checks whether a word having an importance of less than the importance I _i for the word w _i is included in the LIST _process (step 506). If, if the words having the importance of a value less than the significance (I _i) is not included in LIST _process for the word (w _i), LUP delivers the current LIST _process the AOP and ends the update process of the LIST _process (Step 509). Conversely, if a word with an importance less than the importance (I _i ) for a word (w _i ) is included in the LIST _process , the word is deleted from the LIST _process and the index data corresponding to that word is cached in memory. Delete in space MS _process 350 (step 507). The LUP newly includes the word w _i in the LIST _process , and reads index data corresponding to the word w _i from the disk into the cache memory space MS _process 350 (step 508). Finally, the LUP delivers the updated LIST _process to the AOP and terminates the update operation of the LIST _process (step 509). Here, the index data is preferably MCF 210, but may include one or both of MCF 210 and RIF 220.

이하에서는, 본 발명에 따라 앞서 상술한 LIST_cache 및 LIST_process를 이용하여 색인 데이터를 캐슁하는 방법을 상세히 기술한다.Hereinafter, a method of caching index data using the above-described LIST _cache and LIST _process according to the present invention will be described in detail.

도 6은 본 발명에 따른 AND 질의 처리를 위한 색인 데이터의 캐슁 방법을 나타내는 흐름도를 도시한다. 사용자가 웹 검색 엔진에 AND 질의(w₁, w₂, ..., w_n)를 입력하면(단계 601), AOP는 AND 질의에 포함된 각 단어(w_i, 여기서, i = 1, 2, ..., n)에 대해 다음 과정을 수행한다.6 is a flowchart illustrating a method of caching index data for AND query processing according to the present invention. If the user enters an AND query (w ₁ , w ₂ , ..., w _n ) into the web search engine (step 601), AOP returns each word (w _i , where i = 1, 2) included in the AND query. , ..., n)

즉, AOP는 단어(w_i)가 LIST_cache에 포함되어 있는지 검사한다(단계 602). 만약 단어(w_i)가 LIST_cache에 포함되어 있으면, 그 단어(w_i)가 LIST_process에도 포함되어 있는지 검사한다(단계 603). 단어(w_i)가 LIST_cache와 LIST_process에 모두 포함되어 있다면, 이것은 단어(w_i)에 해당하는 색인 데이터는 캐슁 메모리 공간(MS_process)(350)에 이미 캐슁되어 있는 것을 의미한다. 따라서, AOP는 캐슁 메모리 공간(MS_process)(350)에 저장된 단어(w_i)에 해당하는 색인 데이터를 찾는다(단계 604). 이 때, 캐슁 메모리 공간(MS_process)(350)에 단어(w_i)에 해당하는 색인 데이터가 저장되지 않다면, AOP는 그 색인 데이터를 디스크로부터 읽어서 캐슁 메모리 공간(MS_process)(350)에 캐슁한다. 그 후, AOP는 캐슁 메모리 공간(MS_process)(350)에 저장된 색인 데이터를 이용하여 웹 문서를 검색하고 해당 메모리 영역을 반환한다(단계 605). 여기서, 색인 데이터는 바람직하게는 MCF(210)이지만, MCF(210) 및 RIF(220) 중의 하나 또는 모두가 포함될 수 있다.That is, the AOP checks if the word w _i is included in the LIST _cache (step 602). If the word w _i is included in the LIST _cache , it is checked whether the word w _i is also included in the LIST _process (step 603). If the word w _i is included in both the LIST _cache and the LIST _process , this means that the index data corresponding to the word w _i is already cached in the cache memory space (MS _process ) 350. Accordingly, the AOP finds index data corresponding to the word w _i stored in the cache memory space (MS _process ) 350 (step 604). At this time, the caching memory space (MS _process) if the index data corresponding to a word (w _i) to (350) is not stored, AOP is caching the caching memory space (MS _process) 350 read the index data from the disk do. The AOP then retrieves the web document using the index data stored in the caching memory space (MS _process ) 350 and returns the corresponding memory area (step 605). Here, the index data is preferably MCF 210, but may include one or both of MCF 210 and RIF 220.

한편, 단계 602 및 단계 603에서의 판단 결과, 단어(w_i)가 LIST_cache에는 포함되어 있으나, LIST_process에는 포함되어 있지 않다면, AOP는 OS 버퍼링 영역(340)을 이용하는 I/O 호출함수를 이용하여 색인 데이터를 OS 버퍼링 영역(340)에서 찾는다(단계 606). 이 때, OS 버퍼링 영역(340)에 단어(w_i)에 해당하는 색인 데이터가 저장되어 있지 않다면, AOP는 그 색인 데이터를 디스크로부터 읽어서 OS 버퍼링 영역(340)에 캐슁한다. 이러한 동작은 I/O 호출함수의 동작 동안에 자동적으로 수행된다. 그후, AOP는 OS 버퍼링 영역(340)에 저장된 색인 데이터를 이용하여 웹 문서를 검색하고 해당 메모리 영역을 반환한다(단계 605).On the other hand, if it is determined in steps 602 and 603 that the word w _i is included in the LIST _cache but not in the LIST _process , the AOP uses an I / O call function using the OS buffering area 340. Index data is retrieved from the OS buffering area 340 (step 606). At this time, if the index data corresponding to the word w _i is not stored in the OS buffering area 340, the AOP reads the index data from the disk and caches the index data in the OS buffering area 340. This operation is performed automatically during the operation of the I / O call function. The AOP then retrieves the web document using the index data stored in OS buffering area 340 and returns the corresponding memory area (step 605).

한편, 단계 602에서의 판단 결과, 단어(w_i)가 LIST_cache에 포함되어 있지 않다면, AOP는 색인 데이터를 디스크로부터 OS의 버퍼링 영역(340)을 거치지 않고 AOP의 메모리 영역(330)으로 직접 읽어들인다(단계 607). 그후, AOP는 읽어들인 색인 데이터를 이용하여 웹 문서를 검색하고 해당 메모리 영역을 반환한다(단계 605).On the other hand, if it is determined in step 602 that the word w _i is not included in the LIST _cache , the AOP reads the index data directly from the disk into the memory area 330 of the AOP without passing through the OS buffering area 340. (Step 607). The AOP then retrieves the web document using the read index data and returns the corresponding memory area (step 605).

이상과 같이 본 발명은 특정 실시예에 의해 설명되었지만, 이것은 본 발명의 범위가 상기한 내용에 한정되는 것을 의도하는 것은 아니며, 아래의 특허청구범위에 정의된 본 발명의 범위를 벗어나지 않는 한도에서 다양한 수정 또는 변경이 가능할 것이다.While the invention has been described above by way of specific embodiments, it is not intended that the scope of the invention be limited to the foregoing, and may vary without departing from the scope of the invention as defined in the following claims. Modifications or changes may be possible.

이상에서 설명한 바와 같이, 본 발명에 의하면, 웹 검색 엔진에서 AND 질의에 포함된 단어들의 분포 및 사용빈도에 기초하여 앞으로 가장 사용될 가능성이 큰 단어들에 해당하는 색인 데이터를 메모리에 캐슁함으로써, 웹 검색 엔진의 잦은 디스크 I/O에 의한 성능 저하를 방지할 수 있다. 또한, 본 발명에 의하면, AOP의 내부 메모리 영역 뿐만아니라 OS의 I/O 버퍼링 영역를 함께 캐슁 메모리 영역으로 사용함으로써 색인 데이터의 캐슁을 효율적으로 수행할 수 있다.As described above, according to the present invention, a web search engine caches index data corresponding to words most likely to be used in the future, based on the distribution and frequency of use of words included in an AND query, in a memory. This can prevent performance degradation due to frequent disk I / O of the engine. In addition, according to the present invention, by using not only the internal memory area of the AOP but also the I / O buffering area of the OS as a cache memory area, caching of index data can be efficiently performed.

도 1은 종래 기술에 따른 웹 검색 엔진이 사용자 AND 질의를 처리하는 과정을 나타내는 흐름도이다.1 is a flowchart illustrating a process of processing a user AND query by a web search engine according to the related art.

도 2는 본 발명에 따른 색인 데이터의 구조를 도시한다.2 shows a structure of index data according to the present invention.

도 3은 본 발명에 따른 캐슁 영역을 포함하는 메모리의 구조를 도시한다.3 illustrates a structure of a memory including a caching area according to the present invention.

도 4는 본 발명에 따른 캐슁할 색인 데이터에 해당하는 단어들의 목록을 생성하는 방법을 나타낸 흐름도를 도시한다. 4 is a flowchart illustrating a method of generating a list of words corresponding to index data to be cached according to the present invention.

도 5는 본 발명에 따른 캐슁할 색인 데이터에 해당하는 단어들의 목록 중에서 자주 사용될 것으로 예상되는 단어들의 목록을 생성하는 방법을 도시한 흐름도이다.5 is a flowchart illustrating a method of generating a list of words that are expected to be frequently used among a list of words corresponding to index data to be cached according to the present invention.

도 6은 본 발명에 따른 AND 질의 처리를 위한 색인 데이터의 캐슁 방법을 나타내는 흐름도를 도시한다.6 is a flowchart illustrating a method of caching index data for AND query processing according to the present invention.

＜도면의 주요부분에 대한 부호의 설명＞<Description of the code | symbol about the principal part of drawing>

320: 운영체제의 기본 메모리 영역320: native memory area of the operating system

330: AOP의 메모리 영역330: Memory area of AOP

340: AOP의 내부 메모리 영역(MS_kernel)340: internal memory area of the AOP (MS _kernel )

350: 운영체제의 I/O 버퍼링 영역(MS_process)350: I / O buffering area (MS _process ) of operating system

Claims

Based on real-time user query analysis in a web search engine performed on a computer including a web search engine memory area storing a first word list and a second word list, a memory including a first memory area and a second memory area, and a disk. In the method of caching index data for an AND query,

The first word list is prepared based on a frequency of use of words included in an AND query input by the user to the web search engine for a predetermined period, and the second word list is a subset of the first word list.

The caching method,

(a) searching whether a word included in the AND query is included in the first word list and the second word list,

(b) if the word included in the AND query is included in the first word list and the second word list in step (a), the index data corresponding to the word is read from the first caching memory area. Searching for a web document that contains words,

(c) if the word is included in the first word list but not included in the second word list in step (a), the index data corresponding to the word is read from the second cache memory area and the word is read. Retrieving a web document containing

Way.

In the method of claim 1,

(d) If the word is not included in either the first word list or the second word list in step (a), the index data corresponding to the word is inputted from the disk to the input / output (I / O) of the operating system of the computer. Outputting the web document containing the word by reading directly into the web search engine memory region without going through a buffering region.

Way.

In the method of claim 1,

In the step (b), if the index data corresponding to the word is not found in the first cache memory area, the index data corresponding to the word is read from the disk and stored in the first cache memory.

Way.

In the method of claim 1,

The step (c) includes reading index data corresponding to the word from the disk and storing the index data corresponding to the word in the second cache memory when the index data corresponding to the word is not found in the second cache memory area.

Way.

In the method of claim 1,

(e) updating the first word list based on web log data storing usage history of the AND query of the user within a predetermined period of time;

Way.

The method of claim 5,

Step (e) is

(e1) analyzing the web log data to extract the most frequently used words within a recent period;

(e2) extracting words of the extracted words whose size of index data corresponding thereto is equal to or greater than a predetermined threshold value,

(e3) sorting the words extracted in the step (e2) in order of the corresponding index data size;

Way.

In the method of claim 1,

(f) updating the second word list by extracting words of high importance from words included in the first word list, wherein the importance is the number of query requests for each of the words included in the first word list; And calculating using the size of the corresponding index data.

Way.

In the method of claim 7,

Step (f)

(f1) collecting AND query terms processed by the web search engine from the time of updating the latest second word list to the present;

(f2) extracting words included in the first word list from among words included in the collected AND query word,

(f3) calculating importance using the number of query requests for each of the words w _i extracted in the step (f2) and the size of the corresponding index data;

(f4) if a word having an importance less than the importance of the word w _i is included in the second word list, the word is deleted from the second word list and the index data corresponding to the deleted word is recalled. the first caching deleted from the memory area, and wherein the insert in the second word list of the word (w _i), and comprising a read, as step index data corresponding to said word (w _i) with the first caching memory area

Way.

In the method of any one of claims 1 to 8,

The second cache memory area is an input / output operation (I / O) buffering area of the operating system.

Way.

In the method of any one of claims 1 to 8,

The index data includes IDs of web documents including words included in the AND query and location information of importance information of the web documents.

Way.