TWI778924B

TWI778924B - Method for big data retrieval and system thereof

Info

Publication number: TWI778924B
Application number: TW111107107A
Authority: TW
Inventors: 張保榮; 林郁傑
Original assignee: 國立高雄大學
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2022-09-21
Also published as: TW202334838A

Abstract

A method for big data retrieval and system thereof are adapted to solve the poor execution efficiency of the conventional big data analysis platform. The method includes: making a data clustering and proceeding a corresponding distributed indexed method by applying a stacked sparse autoencoder and an Elasticsearch techniques; in a state that multiple date-search jobs are pending to be executed, obtaining a corresponding predicted working time for each date-search job by a time computation model, and determining the executing orders for the multiple date-search jobs based on the corresponding predicted working times.

Description

Big data retrieval method and system

本發明係關於一種檢索方法及系統，尤其是一種改善執行效能的大數據檢索方法及系統。 The present invention relates to a retrieval method and system, especially a big data retrieval method and system for improving execution efficiency.

近年來，由於資料量的急速成長加上儲存設備成本下降、軟體技術進化與雲端環境越發成熟等種種客觀條件，導致大數據分析得以快速的發展。為了支持大數據的處理，基於Java語言所實現的雲端運算架構Hadoop或Spark，是目前常見的商業大數據平台。由於Hadoop預設使用先進先出(FIFO，First Input First Output)技術，依據工作到達的順序為其賦予優先級，這種排程雖然比較公平，但未考量各工作的所需處理時間不一，容易導致資料處理量偏低及執行時間偏長的情況。 In recent years, due to the rapid growth of data volume and various objective conditions such as the cost reduction of storage devices, the evolution of software technology and the increasingly mature cloud environment, big data analysis has developed rapidly. In order to support the processing of big data, Hadoop or Spark, a cloud computing architecture based on Java language, is a common commercial big data platform. Since Hadoop uses the first-in-first-out (FIFO, First Input First Output) technology by default, it assigns priority to the work according to the order in which it arrives. Although this scheduling is relatively fair, it does not take into account the different processing time required for each work. It is easy to lead to low data processing and long execution time.

有鑑於此，習知的大數據檢索的方法及系統確實仍有加以改善之必要。 In view of this, it is still necessary to improve the conventional big data retrieval method and system.

為解決上述問題，本發明的目的是提供一種大數據檢索方法及系統，係可以透過預估工作時間的大小順序作為工作排程依據，以降低工作流程的平均等待時間，並提高資料處理效率。 In order to solve the above problem, the purpose of the present invention is to provide a big data retrieval method and system, which can reduce the average waiting time of the work flow and improve the data processing efficiency by estimating the size sequence of the work time as the work scheduling basis.

本發明的次一目的是提供一種大數據檢索方法及系統，係可以透過堆疊稀疏自動編碼器的應用，提升資料的分群效果。 Another object of the present invention is to provide a big data retrieval method and system, which can improve the grouping effect of data through the application of stacked sparse autoencoders.

本發明的又一目的是提供一種大數據檢索系統及系統，係可以透過基於彈性搜索技術的分散式搜尋引擎，將分群資料進行分散式索引，提高系統的查詢效能。 Another object of the present invention is to provide a big data retrieval system and system, which can perform distributed indexing on grouped data through a distributed search engine based on elastic search technology, so as to improve the query performance of the system.

本發明全文所記載的元件及構件使用「一」或「一個」之量詞，僅是為了方便使用且提供本發明範圍的通常意義；於本發明中應被解讀為包括一個或至少一個，且單一的概念也包括複數的情況，除非其明顯意指其他意思。 The use of the quantifier "a" or "an" for the elements and components described throughout the present invention is only for convenience and provides a general meaning of the scope of the present invention; in the present invention, it should be construed as including one or at least one, and a single The concept of also includes the plural case unless it is obvious that it means otherwise.

本發明全文所述「耦接」用語，包含電性及/或訊號地直接或間接連接，係本領域中具有通常知識者可以依據使用需求予以選擇者。 The term "coupling" mentioned throughout the present disclosure includes electrical and/or signal direct or indirect connection, which can be selected by those with ordinary knowledge in the art according to the usage requirements.

本發明全文所述之「電腦(Computer)」，係指具備特定功能且以硬體或硬體與軟體實現的各式資料處理裝置，特別是具有一處理器以處理分析資訊及/或產生對應控制資訊，例如：伺服器、虛擬機器、桌上型電腦、筆記型電腦、平板電腦或智慧型手機等，係本發明所屬技術領域中具有通常知識者可以理解。 "Computer" as used throughout the present invention refers to various data processing devices with specific functions and implemented by hardware or hardware and software, especially having a processor to process analysis information and/or generate corresponding data Control information, such as a server, a virtual machine, a desktop computer, a notebook computer, a tablet computer or a smart phone, etc., can be understood by those with ordinary knowledge in the technical field to which the present invention pertains.

本發明的大數據檢索方法，係透過一電腦執行以下各步驟，包含：透過多個稀疏自動編碼器經逐層訓練後堆疊組成的一堆疊稀疏自動編碼器的神經網路，建立一學習模型；將具有多個維度的多個原始數據輸入至該學習模型，並透過該學習模型將該數個原始數據投影至三維空間的八大象限中，且將對應各原始數據的一象限值附加至該原始數據，以形成分群資料；基於彈性搜索技術且應用分散式搜尋引擎將分群資料進行分散式索引，以對應該分群資料中之每一者的象限值產生對應的索引值；再將該等對應的索引值存入一分散式檔案系統中，以供進行一資料檢索工作；基於深度神經網路預建立一時間運算模型，該時間運算模型係根據該資料檢索工作中的資料量大小、資料列數、資料欄數、程式時間複雜度、使用的檢索引擎環境及系統剩餘記憶體大小，以獲得一預估工作時間於預測執行該資料檢索工作所花費的時間；在有多個資料檢索工作為待處理的一狀態中，藉由該時間運算模型獲取對應各該資料檢索工作的該預估工作時間，並依據該多個預估工作時間的大小順序決定對應的多個資料檢索工作的執行順序。 The big data retrieval method of the present invention performs the following steps through a computer, including: establishing a learning model through a neural network of a stacked sparse auto-encoder formed by stacking a plurality of sparse auto-encoders after layer-by-layer training; Inputting multiple raw data with multiple dimensions into the learning model, projecting the multiple raw data into eight quadrants of the three-dimensional space through the learning model, and adding a quadrant corresponding to each raw data to the learning model raw data to form clustered data; based on elastic search technology and using a distributed search engine, the clustered data is distributed indexed to generate corresponding index values corresponding to the quadrature values of each of the clustered data; The corresponding index value is stored in a distributed file system for a data retrieval work; based on deep neural The network pre-establishes a time operation model. The time operation model is based on the amount of data, the number of data columns, the number of data columns, the time complexity of the program, the search engine environment used and the remaining memory size of the system in the data retrieval work. obtaining an estimated working time for predicting the time spent on executing the data retrieval job; in a state where a plurality of data retrieval jobs are pending, obtain the estimated working time corresponding to each data retrieval job by using the time operation model The estimated working time is determined, and the execution sequence of the corresponding multiple data retrieval jobs is determined according to the magnitude of the multiple estimated working time.

本發明的大數據檢索系統，包含：一電腦，具有一處理單元、一記憶體單元及一硬碟單元，該處理單元具有一處理器，該記憶體單元與該硬碟單元耦接該處理單元；一分群索引模組，具有一分群單元及一索引單元耦接該分群單元；該分群單元透過多個稀疏自動編碼器經逐層訓練後堆疊組成的一堆疊稀疏自動編碼器的神經網路以建立一學習模型；透過該學習模型，將多個維度的多個原始數據投影至三維空間的八大象限中，且將對應各原始數據的一象限值附加至該原始數據，以形成分群資料；該索引單元接收該分群單元所獲得的分群資料，並基於彈性搜索應用分散式搜尋引擎將該等分群資料進行分散式索引，以對應該分群資料中之每一者的象限值產生對應的索引值；再將該等對應的索引值存入一分散式檔案系統中，以供進行一資料檢索工作；及一排程模組，具有一時間估算模組與一排序單元；該時間估算模組具有基於深度神經網路預建立的一時間運算模型，該時間運算模型係根據該資料檢索工作中的資料量大小、資料列數、資料欄數、程式時間複雜度、使用的檢索引擎環境及系統剩餘記憶體大小，以獲得一預估工作時間於預測執行該資料檢索工作所花費的時間；在有多個資料檢索工作為待處理的一狀態中，該排序單元接收對應的多個預估工作時間，並依據該多個預估工作時間的大小順序決定對應的多個資料檢索工作的執行順序。 The big data retrieval system of the present invention includes: a computer with a processing unit, a memory unit and a hard disk unit, the processing unit has a processor, the memory unit and the hard disk unit are coupled to the processing unit A grouping index module, which has a grouping unit and an indexing unit coupled to the grouping unit; the grouping unit is stacked through a neural network of a stacked sparse auto-encoder after layer-by-layer training through a plurality of sparse auto-encoders. establishing a learning model; through the learning model, projecting a plurality of original data of multiple dimensions into eight quadrants of the three-dimensional space, and adding a quadrant corresponding to each original data to the original data to form grouping data; The indexing unit receives the grouping data obtained by the grouping unit, and applies a distributed search engine based on elastic search to perform a distributed index on the grouping data, so as to generate a corresponding index corresponding to the quad value of each of the grouping data The corresponding index values are then stored in a distributed file system for a data retrieval work; and a scheduling module having a time estimation module and a sorting unit; the time estimation module It has a pre-established time operation model based on deep neural network. The time operation model is based on the amount of data, the number of data columns, the number of data columns, the time complexity of the program, and the search engine environment and system used in the data retrieval work. Remaining memory size to obtain an estimated work time for predicting the time spent performing the data retrieval job; in a state in which there are multiple data retrieval jobs to be processed, the sorting unit receives a plurality of corresponding estimated jobs time, and the execution sequence of the corresponding plurality of data retrieval jobs is determined according to the size sequence of the plurality of estimated work times.

據此，本發明的大數據檢索方法及系統，透過堆疊稀疏自動編碼器可提升資料的分群效果；透過彈性搜索技術將分群資料進行分散式索引，可提高系統的查詢效能；透過基於深度神經網路的一時間運算模型，以獲得一預估工作時間於，並依據該多個預估工作時間的大小順序決定對應執行順序，以降低工作流程的平均等待時間，並提高資料處理效率。 Accordingly, the big data retrieval method and system of the present invention can automatically The encoder can improve the grouping effect of data; the distributed indexing of grouped data through elastic search technology can improve the query performance of the system; through a time operation model based on deep neural network, an estimated working time can be obtained, and The corresponding execution order is determined according to the magnitude order of the plurality of estimated work times, so as to reduce the average waiting time of the work flow and improve the data processing efficiency.

其中，該多個資料檢索工作的執行順序係可依據該多個資料檢索工作所各別對應的多個優先度與該多個預估工作時間決定；各該優先度是各資料檢索工作中一預先定義或依一預定義方式而設定的一數值。如此，可依據優先度與預估工作時間決定對應執行順序，以降低工作流程的平均等待時間，並提高資料處理效率。 Wherein, the execution order of the plurality of data retrieval tasks can be determined according to a plurality of priorities corresponding to the plurality of data retrieval tasks and the plurality of estimated working times; each priority is a A value that is predefined or set in a predefined manner. In this way, the corresponding execution order can be determined according to the priority and the estimated work time, so as to reduce the average waiting time of the work flow and improve the data processing efficiency.

其中，該時間複雜度可根據該資料檢索工作所包含的迴圈指令及/或呼叫函示彼此間的一關係，計算出一總執行次數以作為該時間複雜度的一計算指標。如此，可透過時間複雜度的計算，獲取精確的預估工作時間。 The time complexity can be calculated by calculating a total number of execution times as a calculation index of the time complexity according to a relationship between the loop commands and/or call letters included in the data retrieval job. In this way, an accurate estimated working time can be obtained by calculating the time complexity.

其中，該時間複雜度可根據該資料檢索工作所包含的迴圈指令及/或呼叫函示彼此間的一關係，計算出一總執行次數，並取該總執行次數中冪次，且忽略其所有係數，並以其中冪次數值為最大者作為時間複雜度的計算指標。如此，可透過取冪次進行計算時間複雜度的方式，適度簡化時間複雜度計算，而能提升計算效率的，且同時仍具有反應一程式執行時間長短的效果。 Wherein, the time complexity can be calculated according to a relationship between the loop instructions and/or call letters included in the data retrieval work to calculate a total number of executions, and take the power of the total number of executions, and ignore its All coefficients, and the one with the largest power value is used as the calculation index of time complexity. In this way, the time complexity calculation can be appropriately simplified by exponentiating the time complexity, which can improve the calculation efficiency, and at the same time still have the effect of reflecting the execution time of a program.

其中，於執行該資料檢索工作時，根據該電腦的一記憶體單元的剩餘記憶體與全部記憶體間的一可用記憶體百分比值選擇檢索引擎；當該可用記憶體百分比值小於25%時，選擇Hive引擎；當該可用記憶體百分比值不小於25%且不大於50%時，選擇Impala引擎；當該可用記憶體百分比值大於50%時，選擇SparkSQL引擎；其中，該記憶體單元的容量為20 GB。如此，可透過可用記憶體百分比值的大小選擇合適的檢索引擎，以發揮檢索引擎的最佳效能，具有提升系統整體效能的功效。 Wherein, when the data retrieval work is performed, the retrieval engine is selected according to a percentage value of available memory between the remaining memory of a memory unit of the computer and all the memory; when the percentage value of available memory is less than 25%, Select the Hive engine; when the available memory percentage value is not less than 25% and not greater than 50%, select the Impala engine; when the available memory percentage value is greater than 50%, select the SparkSQL engine; among them, the capacity of the memory unit for 20 GB. In this way, an appropriate retrieval engine can be selected according to the size of the available memory percentage value, so as to exert the best performance of the retrieval engine, and has the effect of improving the overall performance of the system.

其中，當該電腦接收一資料檢索工作時，先自該電腦的一記憶體單元查詢是否有一對應資料，若自該記憶體單元命中該對應資料，該電腦由該記憶體單元讀取該對應資料並寫入到一輸出檔案；若自該記憶體單元未命中該對應資料，則該電腦自該硬碟單元查詢是否有該對應資料；若自該硬碟單元命中該對應資料，該電腦由該硬碟單元讀取該對應資料並寫入到該輸出檔案；若自該硬碟單元未命中該對應資料，該資料檢索工作將先透過該時間運算模型預測一對應的預估工作時間；同時考量該優先度與該預估工作時間，安排對應的工作執行順序；並於執行該資料檢索工作時，依當前可用記憶體百分比值的一關係選擇對應的檢索引擎，以自該分散式檔案系統中檢索是否有該對應資料。如此，可透過先檢索記憶體單元與磁碟單元的機制，節省重複檢索所造成的過多硬體資源消耗，具有節省檢索系統資源的功效。 Wherein, when the computer receives a data retrieval job, it first inquires whether there is a corresponding data from a memory unit of the computer. If the corresponding data is hit from the memory unit, the computer reads the corresponding data from the memory unit. and write it into an output file; if the corresponding data is not hit from the memory unit, the computer inquires whether there is the corresponding data from the hard disk unit; if the corresponding data is hit from the hard disk unit, the computer The hard disk unit reads the corresponding data and writes it into the output file; if the corresponding data is not hit from the hard disk unit, the data retrieval work will first predict a corresponding estimated working time through the time operation model; The priority and the estimated work time are arranged in the corresponding work execution sequence; and when the data retrieval work is executed, the corresponding retrieval engine is selected according to a relationship between the currently available memory percentage value to select the corresponding retrieval engine from the distributed file system. Search for the corresponding data. In this way, the mechanism of retrieving the memory unit and the disk unit first can save excessive hardware resource consumption caused by repeated retrieval, and has the effect of saving retrieval system resources.

1:電腦 1: Computer

11:處理單元 11: Processing unit

12:記憶單元 12: Memory unit

13:硬碟單元 13: Hard disk unit

2:分群索引模組 2: Clustering index module

21:分群單元 21: Grouping unit

22:索引單元 22: Index unit

3:排程模組 3: Scheduling module

31:時間估算模組 31: Time Estimation Module

32:排序單元 32: Sort unit

4:檢索引擎選擇模組 4: Search engine selection module

S1:稀疏性模型建立步驟 S1: Steps to build a sparsity model

S2:資料分群步驟 S2: Data grouping step

S3:資料索引分散化步驟 S3: Data index decentralization step

S4:時間運算模型建立與估算步驟 S4: Time operation model establishment and estimation steps

S5,S5’:工作排程步驟 S5, S5': work scheduling steps

S6:檢索引擎選擇步驟 S6: Search engine selection step

S7,S7’:資料檢索執行步驟 S7, S7': data retrieval execution steps

〔第1圖〕本發明一實施例的系統架構圖。 [FIG. 1] A system architecture diagram of an embodiment of the present invention.

〔第2圖〕根據本發明系統的方法流程圖。 [FIG. 2] A flow chart of a method according to the system of the present invention.

為讓本發明之上述及其他目的、特徵及優點能更明顯易懂，下文特舉本發明之較佳實施例，並配合所附圖式作詳細說明；此外，在不同圖式中標示相同符號者視為相同，會省略其說明。 In order to make the above-mentioned and other objects, features and advantages of the present invention more obvious and easy to understand, the preferred embodiments of the present invention are given below and described in detail with the accompanying drawings; in addition, the same symbols are marked in different drawings. are considered to be the same, and their descriptions will be omitted.

請一併參照第1圖，其係本發明大數據檢索系統的一實施例的方塊圖，係包含一電腦1、一分群模組2、一排程模組3及一檢索引擎選擇模組4，其中電腦1耦接該分群索引模組2、該排程模組3及該檢索引擎選擇模組4。 Please also refer to FIG. 1, which is an embodiment of the big data retrieval system of the present invention The block diagram includes a computer 1, a grouping module 2, a scheduling module 3 and a search engine selection module 4, wherein the computer 1 is coupled to the grouping index module 2, the scheduling module 3 and the The search engine selects module 4 .

該電腦1包含一處理單元11、一記憶體單元12及一硬碟單元13。該處理單元11包含一個或多個諸如微處理器、微控制器、數位信號處理器、中央處理器、可編程邏輯控制器、類比電路、數位電路或具有前述功能的電路板等運算元件。該記憶體單元12及該硬碟單元13耦接該處理單元13，以協同該處理單元13進行訊號處理與儲存等工作。應注意的是，本發明各元件、各單元、各模組、各功能或各邏輯等執行/計算，雖未直接說明係透過該電腦之該處理單元中的處理器運算執行，惟此部分係本發明所屬技術領域中具有通常知識者可以理解，而未多加贅述。 The computer 1 includes a processing unit 11 , a memory unit 12 and a hard disk unit 13 . The processing unit 11 includes one or more arithmetic elements such as microprocessors, microcontrollers, digital signal processors, central processing units, programmable logic controllers, analog circuits, digital circuits, or circuit boards with the aforementioned functions. The memory unit 12 and the hard disk unit 13 are coupled to the processing unit 13 to cooperate with the processing unit 13 to perform signal processing and storage. It should be noted that the execution/calculation of each element, each unit, each module, each function or each logic, etc. of the present invention is not directly stated to be executed by the processor in the processing unit of the computer, but this part is performed. Those with ordinary knowledge in the technical field to which the present invention pertains can understand, and will not be described in detail.

該分群索引模組2包含一分群單元/子模組21及一索引單元/子模組22，其中該索引單元22耦接該分群單元21。該分群單元21包含一堆疊稀疏自動編碼器(SSAE，Stacked Sparse Autoencoder)，堆疊稀疏自動編碼器包含多個稀疏自動編碼器(SAE，Sparse Autoencoder)，亦即，該堆疊稀疏自動編碼器係由多個稀疏自動編碼器經逐層訓練後堆疊組成的神經網路的一學習模型。 The grouping index module 2 includes a grouping unit/sub-module 21 and an indexing unit/sub-module 22 , wherein the indexing unit 22 is coupled to the grouping unit 21 . The grouping unit 21 includes a stacked sparse autoencoder (SSAE, Stacked Sparse Autoencoder), and the stacked sparse autoencoder includes a plurality of sparse autoencoders (SAE, Sparse Autoencoder), that is, the stacked sparse autoencoder is composed of multiple A learning model of a neural network composed of a stack of sparse autoencoders trained layer by layer.

其中，習知的資料分群學習模型係採用自動編碼器(AE，Autoencoder)進行訓練，在訓練前述自動編碼器時隱藏層的節點會過於頻繁地激活，而具有太多的自由度，容易造成過度擬合的結果。為了降低隱藏層節點激活比率，本實施例針對編碼端增加了稀疏的約束函數，使每個節點僅能被特定類型的輸入信號激活，並非全部輸入信號皆能激活每個節點，以提取更相關的特徵。詳言之，添加稀疏約束的自動編碼器即為上述的稀疏自動編碼器，其損失函數定義如下公式(1)~(3)所示。 Among them, the known data clustering learning model is trained with an autoencoder (AE, Autoencoder). When training the aforementioned autoencoder, the nodes of the hidden layer will be activated too frequently, and there are too many degrees of freedom, which is easy to cause excessive fitting result. In order to reduce the activation ratio of hidden layer nodes, this embodiment adds a sparse constraint function to the encoder, so that each node can only be activated by a specific type of input signal, not all input signals can activate each node, so as to extract more relevant Characteristics. In detail, the auto-encoder with sparse constraints is the above-mentioned sparse auto-encoder, and its loss function is defined as shown in the following formulas (1)~(3).

其中，在公式(1)中，L代表無稀疏約束時的損失函數，KL則是KL散度(KL divergence)。P表示網路中神經元的期望激活程度；若激活函數為Sigmoid函數，此值可設為0.05，表示大部分神經元未被激活。

表示第j個神經元的平均激活程度。在公式(2)中，使用KL散度來衡量隱藏層節點的平均激活輸出和稀疏度ρ之間的相似性。在公式(3)中，

定義為訓練樣本集上的平均激活程度，m為訓練樣本個數，a _j與為x ⁱ隱藏層第j個節點對i個樣本的響應輸出。

Among them, in formula (1), L represents the loss function without sparse constraints, and KL is the KL divergence. P represents the expected activation degree of neurons in the network; if the activation function is a sigmoid function, this value can be set to 0.05, indicating that most neurons are not activated.

represents the average activation of the jth neuron. In Equation (2), the KL divergence is used to measure the similarity between the average activation output of hidden layer nodes and the sparsity ρ. In formula (3),

It is defined as the average activation degree on the training sample set, m is the number of training samples, and a _j is the response output of the jth node of the xi hidden layer to the ⁱ sample.

由上述多層稀疏自動編碼器(SAE)逐層訓練後，堆疊組成的神經網路稱為堆疊稀疏自動編碼器(SSAE)。其中，所應運的訓練機制為：前一階段的編碼後輸出為下一階段的輸入。詳言之，輸入層經由第一階段稀疏自動編碼器訓練完之後得到第一隱藏層的n個節點；並由第一隱藏層的n個節點再經過第二階段稀疏自動編碼器訓練得到第二隱藏層的k個節點；再由第二隱藏層的k個節點經過第三稀疏階段稀疏自動編碼器訓練得到第三隱藏層的數個節點，並依照此等模式類推至多個稀疏階段的訓練。換言之，每層的隱藏層節點可視為由上一層產生新的一組特徵，經過這樣逐層訓練可以訓練更多層，最後再把所有訓練結果結合起來，從而得到完成訓練後的學習模型。 After being trained layer by layer by the aforementioned multilayer sparse autoencoder (SAE), the stacked neural network is called a stacked sparse autoencoder (SSAE). Among them, the appropriate training mechanism is: the encoded output of the previous stage is the input of the next stage. In detail, after the input layer is trained by the first-stage sparse auto-encoder, n nodes of the first hidden layer are obtained; and the n nodes of the first hidden layer are trained by the second-stage sparse auto-encoder to obtain the second k nodes of the hidden layer; then k nodes of the second hidden layer are trained by the third sparse stage sparse autoencoder to obtain several nodes of the third hidden layer, and analogous to the training of multiple sparse stages according to these patterns. In other words, the hidden layer nodes of each layer can be regarded as a new set of features generated by the previous layer. After such layer-by-layer training, more layers can be trained, and finally all the training results are combined to obtain a learning model after training.

藉此，可利用上述訓練後的堆疊稀疏自動編碼器的學習模型對資料進行分群工作。其中，堆疊稀疏自動編碼器的輸入層與輸出層的維度向量為資料筆數乘上資料欄位，並可利用四層的堆疊稀疏自動編碼器的編碼端(encoder)作為非監督式學習的架構。每層的輸出都連接到連續層的輸入來將資料維度降至三維，再以相反順序運行每個堆疊稀疏自動編碼器的解碼端(decoder)，將資料維度恢復到原始維度。在一範例中，堆疊稀疏自動編碼器所使用的損失函數可為MSE，激活函數可為Sigmoid，優化器可為Adam。在堆疊稀疏自動編碼器訓練完成後，資料分群方式就是取出編碼端最後一個隱藏層的輸出，其代表將每筆資料經由該編碼端映射到三維空間中的一點，並根據前述三維空間中的X、Y、Z軸自動將資料分成八群，以將對應的象限值(1~8)插入原始資料表的最後一個欄位，以形成分群資料。 In this way, the data can be grouped by using the learning model of the stacked sparse autoencoder after training. Among them, the dimension of the input layer and output layer of the stacked sparse autoencoder The vector is the number of data strokes multiplied by the data field, and the encoder of the four-layer stacked sparse auto-encoder can be used as an unsupervised learning architecture. The output of each layer is connected to the input of successive layers to reduce the data dimension to three dimensions, and the decoder of each stacked sparse autoencoder is run in reverse order to restore the data dimension to the original dimension. In one example, the loss function used by the stacked sparse autoencoder may be MSE, the activation function may be Sigmoid, and the optimizer may be Adam. After the training of the stacked sparse autoencoder is completed, the data grouping method is to take out the output of the last hidden layer of the encoding end, which means that each data is mapped to a point in the three-dimensional space through the encoding end, and according to the X in the aforementioned three-dimensional space , Y and Z axes automatically divide the data into eight groups, and insert the corresponding quadrature value (1~8) into the last column of the original data table to form grouped data.

詳言之，該學習模型具有一編碼端，該編碼端係能夠用以對大數據進行分群。具體而言，該學習模型係由一輸入層(Input Layer)、一隱藏層(Hidden Layer)及一輸出層(Output Layer)所組成，該輸入層與該輸出層各別所具有的神經元(neuron)數量，係等同於該大數據所包含的原始數據的維度。在本實施例中，該學習模型係採用深度學習中的自動編碼器演算法建立而成。舉例而言，當該數個原始數據具有108個維度時，例如：該數個原始數據具有108欄的資料，則該輸入層與該輸出層係分別具有108個神經元。另一方面，在本實施例中，該隱藏層的數量可以設定為七個，且依序分別具有128、64、32、3、32、64及128個神經元，惟不以此為限。在該七個隱藏層中，前四個隱藏層係可以形成該學習模型的編碼端，該編碼端用以將輸入的數個原始數據進行壓縮，並分別產生一三維空間座標。該類神經網路模組1依照該三維空間座標所相對應的八大象限，將該數個原始數據各自的三維空間座標分別轉換為八大象限中所相對應的一象限值，並將該象限值儲存於相對應的原始數據內，以完成該數個原始數據的分群。換言之，該學習模型中的一隱藏層具有三個神經元，以將具有數個維度的原始數據投影至三維空間的八大象限中，並將對應各原始數據的一象限值附加至該原始數據，以形成分群資料。 Specifically, the learning model has an encoding end that can be used to group big data. Specifically, the learning model is composed of an input layer, a hidden layer, and an output layer, and the input layer and the output layer have neurons respectively. ), which is equivalent to the dimension of the original data contained in the big data. In this embodiment, the learning model is established by using an autoencoder algorithm in deep learning. For example, when the raw data has 108 dimensions, eg, the raw data has 108 columns of data, the input layer and the output layer respectively have 108 neurons. On the other hand, in this embodiment, the number of the hidden layers can be set to seven, and the number of hidden layers is 128, 64, 32, 3, 32, 64, and 128 neurons in sequence, but not limited thereto. Among the seven hidden layers, the first four hidden layers can form the coding end of the learning model, and the coding end is used for compressing several input original data, and respectively generating a three-dimensional space coordinate. This type of neural network module 1 converts the respective three-dimensional space coordinates of the plurality of original data into a corresponding quadrant value in the eight quadrants according to the eight quadrants corresponding to the three-dimensional space coordinates, and converts the image The limit value is stored in the corresponding raw data to complete the grouping of the plurality of raw data. In other words, a hidden layer in the learning model has three neurons to convert the original data with several dimensions The data are projected into eight quadrants in the three-dimensional space, and a quadrant corresponding to each original data is added to the original data to form grouping data.

在一實驗中，比較傳統自動編碼器(AE)與本實施例堆疊稀疏自動編碼器(SSAE)各訓練50期(epochs)的結果，由於自動編碼器沒有稀疏性的限制，在進行分群時容易將部分資料映射至同一區域，導致分群不均勻，影響分群的效果。例如，可能產生有兩個資料集被映射至同一區域糾纏在一起，集群之間卻無明顯分界；然而，具有稀疏約束函數的堆疊稀疏自動編碼器可透過使得同一集群各個點變得更加緊密且各集群間的區隔更加明顯，以獲得較佳的分群效果。 In an experiment, the results of each training 50 epochs (epochs) between the traditional auto-encoder (AE) and the stacked sparse auto-encoder (SSAE) of this embodiment are compared. Since the auto-encoder has no sparsity restriction, it is easy to perform clustering. Some data are mapped to the same area, resulting in uneven clustering and affecting the effect of clustering. For example, it is possible to have two datasets mapped to the same region entangled with no clear demarcation between clusters; however, stacked sparse autoencoders with sparse constraint functions can make the points of the same cluster more compact and The separation between clusters is more obvious to obtain better clustering effect.

該索引單元22係基於一分散式搜尋引擎/系統/架構之技術為基礎的應用，特別是運用彈性搜索(Elasticsearch)技術，所述分散式搜尋引擎的技術與應用係本發明所屬技術領域中具有通常知識者可以理解，容不多加贅述。該索引單元22接收該分群單元21所獲得的分群資料，並應用分散式搜尋引擎將該等分群資料進行分散式索引。詳言之，在所述分散式搜尋引擎的架構下，包含一叢集(Cluster)，叢集中包含多個節點，每個節點在啟動時會加入集群並推選一(台)節點成為主節點(Master Node)，透過分散式搜尋將資料加以索引化，然後將對應資料存入分散式檔案系統(HDFS，Hadoop Distributed File System)。換言之，該索引單元22可包含該分散式檔案系統。詳言之，當一份完整資料集需要建立索引時，如果當前節點是主節點便可開始建立索引，否則轉發給主節點進行索引。其中，基於彈性搜索技術的分散式搜尋引擎會先判斷此資料是否曾被索引過；如果沒有，該分散式搜尋引擎便會開始建立索引並將結果寫入Lucene index，並透過hash演算法確保將索引資料均勻分散儲存到指定的primary shard和replica shard中，同時創建一個對應的版本號碼儲存至Translog。若資料曾被索引過，該分散式搜尋引擎會先比對現有的版本號碼，檢查是否出現衝突：如果無衝突即可開始建立索引；若判斷出現衝突則會回傳創建索引失敗的錯誤結果。藉由上述技術架構，經由該分群單元21分群處理後的多組分群資料(集)，可上傳至基於彈性搜索技術的分散式搜尋引擎進行分散式索引，以對應該分群資料中之每一者的象限值產生對應的索引值，並將經分散式搜尋引擎處理後的相關資料(對應的索引值)存入一分散式檔案系統中，以供進行一資料檢索工作。其中，所述資料檢索工作可以是一使用者輸入的對應指令或操作，或是由其他任何方式，例如是該電腦1執行一工作任務中所隱含的工作(較佳因此可自動化產生/輸入對應的資料檢索工作)，或是該電腦1接收來自其他系統的資料檢索工作的對應指令。如此，分群索引模組2整合堆疊稀疏自動編碼器和基於彈性搜尋技術的分散式搜尋引擎以建立SSAE-ES快速索引方式，相較於先前技術(特別是如中華民國專利公告號I696082中所揭示：透過深度自動編碼器(Deep Autoencoder)和Solr檢索技術對資料進行分群的DAE-SOLR技術)，能改善分群效果及工作執行的平均等待時間，進而使查詢效能大幅提升約40%。 The indexing unit 22 is based on an application based on the technology of a distributed search engine/system/architecture, especially the elastic search (Elasticsearch) technology. The technology and application of the distributed search engine are those in the technical field to which the present invention belongs Usually knowledgeable people can understand it, so I won't go into details. The indexing unit 22 receives the grouped data obtained by the grouping unit 21, and applies a distributed search engine to perform a distributed index on the grouped data. To be more specific, under the architecture of the distributed search engine, it includes a cluster (Cluster), the cluster includes a plurality of nodes, each node will join the cluster when it is started and elect a (node) node to become the master node (Master node). Node), index the data through distributed search, and then store the corresponding data in the Distributed File System (HDFS, Hadoop Distributed File System). In other words, the indexing unit 22 may include the distributed file system. In detail, when a complete data set needs to be indexed, if the current node is the primary node, the indexing can be started, otherwise, it is forwarded to the primary node for indexing. Among them, the distributed search engine based on elastic search technology will first determine whether the data has been indexed; if not, the distributed search engine will start to build the index and write the result into the Lucene index, and use the hash algorithm to ensure that the data will be indexed. The index data is evenly distributed and stored in the designated primary shard and replica shard, and a corresponding version number is created and stored in Translog. If the data has been indexed However, the distributed search engine will first compare the existing version numbers to check whether there is a conflict: if there is no conflict, the indexing can be started; if it is judged that there is a conflict, it will return an error result that the indexing failed. With the above technical structure, the multi-group grouping data (sets) processed by the grouping unit 21 can be uploaded to a distributed search engine based on elastic search technology for distributed indexing, so as to correspond to each of the grouped data The corresponding index value is generated by the quadrature value of , and the relevant data (corresponding index value) processed by the distributed search engine are stored in a distributed file system for a data retrieval work. Wherein, the data retrieval work can be a corresponding command or operation input by a user, or by any other means, such as the work implicit in the computer 1 performing a work task (preferably, it can be automatically generated/inputted) corresponding data retrieval work), or the computer 1 receives corresponding instructions for data retrieval work from other systems. In this way, the clustered indexing module 2 integrates the stacked sparse autoencoder and the distributed search engine based on elastic search technology to establish the SSAE-ES fast indexing method, which is compared with the prior art (especially as disclosed in the ROC Patent Publication No. I696082). : DAE-SOLR technology that groups data through Deep Autoencoder and Solr retrieval technology), which can improve the grouping effect and the average waiting time of job execution, thereby greatly increasing the query performance by about 40%.

排程模組3包含一時間估算模組31及一排序單元32。該時間估算模組31具有基於深度神經網路(DNN，Deep Neural Network)技術的一時間運算模型，以用於獲取一預估工作時間，該預估工作時間係指一工作(另如是一指令)於所需執行總時間的估算。詳言之，在應用深度神經網路的該時間運算模型中，對應的輸入層具有六個維度向量，分別為：資料量大小、資料列數、資料欄數、程式時間複雜度、使用的檢索引擎環境及系統剩餘記憶體大小；對應的輸出層的資料係為一個維度，以用於預測執行資料檢索工作所花費的時間。其中，該時間運算模型所使用的激活函數為ReLU，損失函數為MSE，優化器為Adam，但不限於此。 The scheduling module 3 includes a time estimation module 31 and a sorting unit 32 . The time estimation module 31 has a time operation model based on Deep Neural Network (DNN, Deep Neural Network) technology for obtaining an estimated working time, and the estimated working time refers to a work (or an instruction ) is an estimate of the total execution time required. In detail, in the time operation model of applying deep neural network, the corresponding input layer has six dimension vectors, namely: data volume, data column number, data column number, program time complexity, used retrieval The engine environment and the remaining memory size of the system; the data of the corresponding output layer is a dimension used to predict the time spent performing data retrieval. The activation function used by the time operation model is ReLU, the loss function is MSE, and the optimizer is Adam, but not limited to this.

較佳地，前述程式時間複雜度的計算係以一設計的分析程式執行一時間複雜度演算流程所達成。該時間複雜度演算流程中，先判斷一資料檢索工作之程式碼中是否具有至少一迴圈指令，並透過該至少一迴圈指令中的運算子定義一迴圈的執行次數TA₁(若有N個，則為TA₁~TA_N)；接著判斷該資料檢索工作之程式碼中的函式呼叫部分是否具有至少一重複呼叫的遞迴函式，藉以定義出一呼叫函示的執行次數TB₁(若有M個，則為TB₁~TB_m)；最後考量各迴圈指令及/或各呼叫函示彼此間的關係，以計算出一總執行次數Ts以表示所有執行次數的總和，並以此作為時間複雜度的計算指標。例如，在一迴圈指令A1中包含一迴圈指令A2的關係中、在一迴圈指令A1中包含一呼叫函示B1的關係中、在一呼叫函示B1中包含一呼叫函示B2的關係中，對應的總執行次數Ts則分別為A1 x A2、A1 x B1、B1 x B2。在另一說明範例中，例如各迴圈指令及/或各呼叫函示呈未被其他任一者包含的關係中，則該總執行次數Ts為各分別次數的總和；舉例而言，例如一迴圈指令A1、一迴圈指令A2及一呼叫函示B1係彼此未被其他任一者包含的情形中，該總執行次數Ts則為A1+A2+B1。在又另一說明範例中，在一迴圈指令A1中包含一呼叫函示B1，該呼叫函示B1中又包含一迴圈指令A2，且另包含一呼叫函示B2未包含於該迴圈指令A1中的情形時，該總執行次數Ts為(A1 x B1 x A2)+A1。更佳地，可取該總執行次數Ts中冪次，且忽略其所有係數，並以其中冪次數值為最大者作為該時間複雜度的計算指標。以前述取冪次的方式，對應的冪次數值大小依序例如可為1、logn、n、nlogn、n²、n³、2ⁿ、nⁿ，並分別依序給予例如是0~7的計算指標。其中，透過上述取冪次進行計算時間複雜度的方式，可適度簡化時間複雜度計算，達成計算效率的提升，並同時仍具有反應一程式執行時間長短的效能。 Preferably, the calculation of the time complexity of the aforesaid program is achieved by executing a time complexity calculation process with a designed analysis program. In the time complexity calculation process, it is first determined whether there is at least one loop instruction in the code of a data retrieval job, and the number of execution times TA ₁ of a loop is defined by the operator in the at least one loop instruction (if there is N, then it is TA ₁ ~TA _N ); then it is judged whether the function call part in the code of the data retrieval work has at least one recursive function of repeated calls, so as to define the execution times TB of a call call ₁ (if there are M, it is TB ₁ ~TB _m ); finally, consider the relationship between each loop instruction and/or each call letter to calculate a total execution times Ts to represent the sum of all execution times, And use this as the calculation index of time complexity. For example, in a relationship in which a loop instruction A1 includes a loop instruction A2, in a relationship in which a loop instruction A1 includes a call letter B1, a call letter B1 includes a call letter B2. In the relationship, the corresponding total execution times Ts are respectively A1 x A2, A1 x B1, and B1 x B2. In another illustrative example, for example, each loop instruction and/or each call letter is in a relationship not included in any other one, then the total execution times Ts is the sum of the respective times; for example, for example, a In the case that the loop command A1, a loop command A2, and a call notification B1 are not included in each other, the total execution times Ts is A1+A2+B1. In yet another illustrative example, a loop command A1 includes a call signal B1, the call signal B1 further includes a loop command A2, and another call signal B2 is not included in the loop In the case of instruction A1, the total execution times Ts is (A1 x B1 x A2)+A1. More preferably, the power among the total execution times Ts may be taken, and all its coefficients are ignored, and the maximum value of the power may be used as the calculation index of the time complexity. In the aforementioned exponentiation method, the corresponding exponentiation values can be, for example, 1, logn, n, nlogn, n ² , n ³ , 2 ⁿ , n ⁿ in sequence, and are given in sequence, for example, 0 to 7. Calculate metrics. Among them, calculating the time complexity through the above exponentiation can moderately simplify the time complexity calculation, achieve the improvement of the calculation efficiency, and at the same time still have the performance of reflecting the execution time of a program.

藉由該時間估算模組31的該時間運算模型，在有多個資料檢索工作為待處理的一狀態中，可獲取對應的多個預估工作時間；該排序單元32接收該多個預估工作時間，並依據該多個預估工作時間的大小決定對應的多個資料檢索工作的執行順序。較佳地，該排序單元32以預估工作時間為較小者為優先排程。 With the time operation model of the time estimation module 31, there are multiple data checks. In a state in which the job is to be processed, a plurality of corresponding estimated working hours can be obtained; the sorting unit 32 receives the plurality of estimated working hours, and determines a plurality of corresponding estimated working hours according to the size of the plurality of estimated working hours The order in which data retrieval work is performed. Preferably, the sorting unit 32 prioritizes the one whose estimated working time is smaller.

在另一範例中，在有多個資料檢索工作為待處理的一狀態中(即當有多個資料檢索工作進入佇列形成待排程狀態時)，該排序單元32先考慮執行各資料檢索工作的優先度為第一過濾條件，然後將各資料檢索工作的預估工作時間視為第二過濾條件，從而決定數個資料檢索工作的執行順序，以提升整體系統在單位時間內的工作處理量。其中，所述優先度可以是各資料檢索工作中一預先定義或依一預定義方式而設定的一數值；較佳地，該優先度可至少區分為一高優先度與一低優先度。在第一過濾條件中，具有高優先度的資料檢索工作將優先於具有低優先度的資料檢索工作被執行。在第二過濾條中，當多個資料檢索工作中的至少二者具有相同優先度時，則以預估工作時間為較小者為優先排程。其中，若兩資料檢索工作具有相同優先度與相同的預估工作時間時，則依FIFO排程機制進行工作順序的排程。 In another example, in a state in which multiple data retrieval jobs are pending (ie, when multiple data retrieval jobs are queued to form a pending state), the sorting unit 32 first considers executing each data retrieval job The priority of the job is the first filter condition, and then the estimated working time of each data retrieval job is regarded as the second filter condition, so as to determine the execution order of several data retrieval jobs, so as to improve the work processing of the overall system per unit time. quantity. The priority may be a predefined value or a value set in a predefined manner in each data retrieval work; preferably, the priority may be at least divided into a high priority and a low priority. In the first filter condition, the material retrieval work with high priority will be executed in priority to the material retrieval work with low priority. In the second filter bar, when at least two of the multiple data retrieval jobs have the same priority, the one with the smaller estimated working time is prioritized. Wherein, if the two data retrieval jobs have the same priority and the same estimated working time, the job sequence is scheduled according to the FIFO scheduling mechanism.

如此，該排程模組3結合深度神經網路(DNN)預測每件工作執行所需時間，再依最短工作時間優先(SJF，Sortest-Job-First)進行排程，統稱為DNNSJF排程。相較先前技術例如FIFO排程機制而言，DNNSJF排程能有效將工作執行的平均等待時間減少約3~5%；另，相較先前技術例如MSHEFT(Memory-Sensitive Heterogeneous Early Finish Time)排程機制而言，DNNSJF排程能有效將工作執行的平均等待時間減少約1~3%。其中，基於HEFT(Heterogeneous Early Finish Time)會根據排序後的優先級列表將每項工作分配給合適的CPU，以便盡快完成工作的排成機制，所述MSHEFT機制係包含以下步驟：首先考慮工作的優先級，然後依據工作的資料大小視為第二過濾條件。 In this way, the scheduling module 3 combines the deep neural network (DNN) to predict the time required for each job to execute, and then schedules according to the shortest work time first (SJF, Sortest-Job-First), which is collectively referred to as DNNSJF scheduling. Compared with prior art such as FIFO scheduling mechanism, DNNSJF scheduling can effectively reduce the average waiting time of job execution by about 3~5%; in addition, compared with prior art such as MSHEFT (Memory-Sensitive Heterogeneous Early Finish Time) scheduling In terms of mechanism, DNNSJF scheduling can effectively reduce the average waiting time of job execution by about 1~3%. Among them, based on HEFT (Heterogeneous Early Finish Time), each work will be assigned to the appropriate CPU according to the sorted priority list, so as to complete the work as soon as possible. The MSHEFT mechanism includes the following steps: first consider the work priority, and then based on the qualifications of the work Material size is regarded as the second filter condition.

檢索引擎選擇模組4包含支援結構化查詢語言(SQL)指令的Hive、Impala及SparkSQL的引擎/介面/平台，並具有一檢索引擎選擇邏輯，用於根據當前可用記憶體大小來選擇合適的檢索引擎。該檢索引擎選擇邏輯係根據該電腦的剩餘記憶體與全部記憶體間的一可用記憶體百分比值由小至大而分別選擇Hive引擎、Impala引擎及SparkSQL引擎。較佳地，當該可用記憶體百分比值小於25%時，選擇Hive引擎；當該可用記憶體百分比值不小於25%且不大於50%時，選擇Impala引擎；當該可用記憶體百分比值大於50%時，選擇SparkSQL引擎。其中，經本發明進行一臨界數據實測後，上述百分比值的大小係特別適用在該電腦1的該記憶體12為20GB的狀態中。惟，前述分界閥值可依硬體效能及實際使用需求進行適當調整，而不以此為限。 The search engine selection module 4 includes Hive, Impala and SparkSQL engines/interfaces/platforms that support Structured Query Language (SQL) commands, and has a search engine selection logic for selecting an appropriate search engine according to the current available memory size engine. The search engine selection logic selects the Hive engine, the Impala engine and the SparkSQL engine respectively according to a percentage value of available memory between the remaining memory and the total memory of the computer from small to large. Preferably, when the available memory percentage value is less than 25%, the Hive engine is selected; when the available memory percentage value is not less than 25% and not greater than 50%, the Impala engine is selected; when the available memory percentage value is greater than 50%, choose the SparkSQL engine. Wherein, after an actual measurement of critical data is performed by the present invention, the size of the above-mentioned percentage value is particularly suitable for the state where the memory 12 of the computer 1 is 20 GB. However, the aforementioned threshold can be appropriately adjusted according to the hardware performance and actual usage requirements, and is not limited thereto.

據此，根據本發明上述系統，在一範例中，當一資料檢索工作/指令經由該電腦1被輸入時，該資料檢索工作將先經由該排程模組3(依該預估工作時間或同時考量該優先度與該預估工作時間)安排對應的工作執行順序，並於執行該資料檢索工作時，再經由該檢索引擎選擇模組4依當前可用記憶體百分比值的一關係選擇對應的檢索引擎，以自該分群索引模組2中查詢是否有對應的資料。如此，該處理器模組能夠依據可用記憶體百分比值的大小選擇合適的檢索引擎，以發揮檢索引擎的最佳效能，係具有提升系統整體效能的功效。 Accordingly, according to the system of the present invention, in an example, when a data retrieval job/command is input through the computer 1, the data retrieval job will first pass through the scheduling module 3 (according to the estimated working time or At the same time, the priority and the estimated working time are considered in order to arrange the corresponding job execution order, and when the data retrieval job is executed, the retrieval engine selection module 4 selects the corresponding job according to a relationship between the currently available memory percentage values. A search engine is used to query whether there is corresponding data from the grouping index module 2 . In this way, the processor module can select an appropriate retrieval engine according to the size of the available memory percentage value, so as to exert the best performance of the retrieval engine, which has the effect of improving the overall performance of the system.

在另一範例中，當一資料檢索工作經由該電腦1被輸入時，該資料檢索工作將先透過該處理單元11自該記憶體單元1查詢是否有對應的資料。若自該記憶體單元12命中該對應資料，該處理單元11由該記憶體單元12讀取該對應資料並寫入到一輸出檔案；若自該記憶體單元12未命中該對應資料，則該處理單元11自該硬碟單元13查詢是否有對應的資料。若自該硬碟單元13命中該對應資料，該處理單元11由該硬碟單元13讀取該對應資料並寫入到一輸出檔案；若自該硬碟單元13未命中該對應資料，則該資料檢索工作會透過該排程模組3安排對應的工作執行順序，並於執行該資料檢索工作時，再經由該檢索引擎選擇模組4選擇對應的檢索引擎，以自該分群索引模組2中查詢是否有對應的資料。如此，藉由記憶體單元12與磁碟單元13構成的兩層快取機制，節省重複檢索所造成的過多硬體資源消耗，具有節省檢索系統資源的功效。 In another example, when a data retrieval job is input through the computer 1 , the data retrieval job will firstly inquire whether there is corresponding data from the memory unit 1 through the processing unit 11 . If the corresponding data is hit from the memory unit 12, the processing unit 11 reads the corresponding data from the memory unit 12 and writes it to an output file; if the memory unit 12 misses the corresponding data For corresponding data, the processing unit 11 inquires from the hard disk unit 13 whether there is corresponding data. If the corresponding data is hit from the hard disk unit 13, the processing unit 11 reads the corresponding data from the hard disk unit 13 and writes the corresponding data into an output file; if the corresponding data is not hit from the hard disk unit 13, the corresponding data The data retrieval work will arrange the corresponding work execution order through the scheduling module 3, and when the data retrieval work is executed, the corresponding retrieval engine will be selected through the retrieval engine selection module 4 to be selected from the grouping index module 2. Check to see if there is any corresponding information. In this way, through the two-layer cache mechanism formed by the memory unit 12 and the disk unit 13, excessive hardware resource consumption caused by repeated retrieval is saved, and the retrieval system resources are saved.

請參照第2圖所示，據由前述系統，本發明大數據檢索方法，係透過一電腦執行以下所有步驟：一稀疏性模型建立步驟S1：透過多個稀疏自動編碼器經逐層訓練後堆疊組成的一堆疊稀疏自動編碼器的神經網路，建立一學習模型。 Please refer to FIG. 2 , according to the aforementioned system, the big data retrieval method of the present invention executes all the following steps through a computer: a sparse model building step S1 : stacking after layer-by-layer training through a plurality of sparse auto-encoders A neural network consisting of a stack of sparse autoencoders is constructed to build a learning model.

一資料分群步驟S2：將具有多個維度的多個原始數據輸入至該學習模型，並透過該學習模型將該數個原始數據投影至三維空間的八大象限中，且將對應各原始數據的一象限值附加至該原始數據，以形成分群資料。 A data grouping step S2: Inputting a plurality of original data with multiple dimensions into the learning model, and projecting the plurality of original data into the eight quadrants of the three-dimensional space through the learning model, and assigning a corresponding one of the original data to the learning model Quadrature values are appended to this raw data to form cluster data.

一資料索引分散化步驟S3：基於彈性搜索技術且應用分散式檢索引擎將分群資料進行分散式索引，以對應該分群資料中之每一者的象限值產生對應的索引值；再將該等對應的索引值存入一分散式檔案系統中，以供一使用者進行一資料檢索工作。 A data index decentralization step S3: based on elastic search technology and using a distributed search engine, the grouped data are distributed indexed to generate corresponding index values corresponding to the quadratic values of each of the grouped data; The corresponding index value is stored in a distributed file system for a user to perform a data retrieval job.

一時間運算模型建立與估算步驟S4：基於深度神經網路預建立一時間運算模型，該時間運算模型係根據該資料檢索工作中的資料量大小、資料列數、資料欄數、程式時間複雜度、使用的檢索引擎環境及系統剩餘記憶體大小，預測執行該資料檢索工作所花費的一預估工作時間。 A time operation model establishment and estimation step S4: a time operation model is pre-established based on the deep neural network, and the time operation model is based on the amount of data, the number of data columns, the number of data columns, and the program time complexity in the data retrieval work. , Use the search engine environment and the remaining memory size of the system, and predict an estimated working time spent performing the data search.

在一範例中，一工作排程步驟S5：在有多個資料檢索工作為待處理的一狀態中，藉由該時間運算模型獲取對應各該資料檢索工作的一預估工作時間，並依據該多個預估工作時間的大小順序決定對應的多個資料檢索工作的執行順序。 In an example, a job scheduling step S5: in a state in which there are a plurality of data retrieval jobs to be processed, an estimated working time corresponding to each of the data retrieval jobs is obtained by the time operation model, and according to the The size sequence of the plurality of estimated work times determines the execution sequence of the corresponding plurality of data retrieval jobs.

在另一範例中，一工作排程步驟S5’：在有多個資料檢索工作為待處理的一狀態中，藉由該時間運算模型獲取對應各該資料檢索工作的一預估工作時間，並依據該多個資料檢索工作所各別對應的多個優先度與該多個預估工作時間決定對應的多個資料檢索工作的執行順序。 In another example, a job scheduling step S5 ′: in a state where there are a plurality of data retrieval jobs to be processed, obtain an estimated working time corresponding to each of the data retrieval jobs by using the time operation model, and The execution order of the plurality of corresponding data retrieval jobs is determined according to the plurality of priorities and the plurality of estimated working times corresponding to the plurality of data retrieval jobs respectively.

一檢索引擎選擇步驟S6：於執行該資料檢索工作時，根據當前可用記憶體大小選擇合適的檢索引擎。詳言之，係根據該電腦1的該記憶體單元12的剩餘記憶體與全部記憶體間的一可用記憶體百分比值選擇檢索引擎。較佳地，當該可用記憶體百分比值小於25%時，選擇Hive引擎；當該可用記憶體百分比值不小於25%且不大於50%時，選擇Impala引擎；當該可用記憶體百分比值大於50%時，選擇SparkSQL引擎。更佳地，該記憶體單元的容量為20GB。 A retrieval engine selection step S6 : selecting an appropriate retrieval engine according to the current available memory size when performing the data retrieval job. Specifically, the retrieval engine is selected according to a percentage of available memory between the remaining memory of the memory unit 12 of the computer 1 and the total memory. Preferably, when the available memory percentage value is less than 25%, the Hive engine is selected; when the available memory percentage value is not less than 25% and not greater than 50%, the Impala engine is selected; when the available memory percentage value is greater than 50%, choose the SparkSQL engine. More preferably, the memory unit has a capacity of 20GB.

在一範例中，一資料檢索執行步驟S7：當該電腦1接收一資料檢索工作時，該資料檢索工作將先透過該時間運算模型預測執行該資料檢索工作所花費的一預估工作時間；依該預估工作時間或同時考量該優先度與該預估工作時間，安排對應的工作執行順序；並於執行該資料檢索工作時，依當前可用記憶體百分比值的一關係選擇對應的檢索引擎，以自分散式檔案系統中檢索是否有對應的資料。 In an example, a data retrieval execution step S7: when the computer 1 receives a data retrieval job, the data retrieval job will first predict an estimated working time spent performing the data retrieval job through the time operation model; The estimated working time or both the priority and the estimated working time are considered to arrange the corresponding job execution sequence; and when the data retrieval job is executed, the corresponding retrieval engine is selected according to a relationship between the currently available memory percentage value, To search whether there is corresponding data from the decentralized file system.

在另一範例中，一資料檢索執行步驟S7’：當該電腦1接收一資料檢索工作時，先自該電腦1的記憶體單元12查詢是否有對應的資料。若自該記憶體單元12命中該對應資料，該電腦1由該記憶體單元12讀取該對應資料並寫入到一輸出檔案；若自該記憶體單元12未命中該對應資料，則該電腦1自該硬碟單元13查詢是否有對應的資料。若自該硬碟單元13命中該對應資料，該電腦1由該硬碟單元13讀取該對應資料並寫入到一輸出檔案；若自該硬碟單元13未命中該對應資料，則該資料檢索工作將先透過該時間運算模型預測執行該資料檢索工作所花費的一預估工作時間；依該預估工作時間或同時考量該優先度與該預估工作時間，安排對應的工作執行順序；並於執行該資料檢索工作時，依當前可用記憶體百分比值的一關係選擇對應的檢索引擎，以自分散式檔案系統中檢索是否有對應的資料。 In another example, a data retrieval executes step S7': when the computer 1 receives a data retrieval job, the memory unit 12 of the computer 1 is first inquired about whether there is corresponding data. If the corresponding data is hit from the memory unit 12 , the computer 1 reads the pair from the memory unit 12 The data is applied and written to an output file; if the corresponding data is not hit from the memory unit 12 , the computer 1 inquires from the hard disk unit 13 whether there is corresponding data. If the corresponding data is hit from the hard disk unit 13, the computer 1 reads the corresponding data from the hard disk unit 13 and writes the corresponding data into an output file; if the corresponding data is not hit from the hard disk unit 13, the data The retrieval job will first predict an estimated working time spent on executing the data retrieval job through the time operation model; arrange the corresponding job execution sequence according to the estimated working time or considering the priority and the estimated working time at the same time; And when the data retrieval work is performed, a corresponding retrieval engine is selected according to a relationship between the currently available memory percentage values, so as to retrieve whether there is corresponding data from the distributed file system.

綜上所述，本發明的大數據檢索方法及系統，透過分群索引模組可有效改善分群效果並建立快速索引，可以縮減資料庫的搜尋範圍。另透過排程模組可獲取預估工作時間，賦予較短工作時間的工作較高的優先排程，縮短整體工作職平均等待時間。另根據當前可用記憶體大小選擇合適的檢索引擎，以選擇適合的檢索引擎來執行排程工作，達成優化系統執行效率的功效。此外，藉由記憶單元或硬碟單元快取機制儲存資料檢索的結果，若短時間內需重複檢索相同資料時，直接在快取儲存資料中尋找並讀取，不必再重新使用檢索引擎進行檢索，可大幅節省執行時間與減少消耗硬體資源。 To sum up, the big data retrieval method and system of the present invention can effectively improve the grouping effect and establish a fast index through the grouping index module, which can reduce the search range of the database. In addition, through the scheduling module, the estimated working hours can be obtained, and jobs with shorter working hours can be given higher priority scheduling, thereby shortening the average waiting time of the overall job. In addition, an appropriate retrieval engine is selected according to the size of the currently available memory, so as to select an appropriate retrieval engine to execute the scheduling work, so as to achieve the effect of optimizing the execution efficiency of the system. In addition, the results of data retrieval are stored by the memory unit or hard disk unit cache mechanism. If the same data needs to be retrieved repeatedly in a short period of time, the data can be directly searched and read in the cache storage, and there is no need to re-use the retrieval engine for retrieval. It can greatly save execution time and reduce consumption of hardware resources.

雖然本發明已利用上述較佳實施例揭示，然其並非用以限定本發明，任何熟習此技藝者在不脫離本發明之精神和範圍之內，相對上述實施例進行各種更動與修改仍屬本發明所保護之技術範疇，因此本發明之保護範圍當包含後附之申請專利範圍所記載的文義及均等範圍內之所有變更。 Although the present invention has been disclosed by the above-mentioned preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make various changes and modifications relative to the above-mentioned embodiments without departing from the spirit and scope of the present invention. Therefore, the protection scope of the present invention shall include all changes within the meaning and equivalent scope recorded in the appended patent application scope.

S1:稀疏性模型建立步驟 S1: Steps to build a sparsity model

S2:資料分群步驟 S2: Data grouping step

S3:資料索引分散化步驟 S3: Data index decentralization step

S5,S5’:工作排程步驟 S5, S5': work scheduling steps

S6:檢索引擎選擇步驟 S6: Search engine selection step

S7,S7’:資料檢索執行步驟 S7, S7': data retrieval execution steps

Claims

A method for retrieving big data, which executes the following steps through a computer, comprising: building a learning model through a neural network of stacked sparse auto-encoders formed by stacking a plurality of sparse auto-encoders after layer-by-layer training; A plurality of raw data of multiple dimensions are input into the learning model, and the plurality of raw data are projected into the eight quadrants of the three-dimensional space through the learning model, and a quadrant corresponding to each raw data is added to the raw data , to form grouped data; based on elastic search technology and using a distributed search engine, the grouped data will be indexed in a distributed manner to generate corresponding index values corresponding to the quadrature values of each of the grouped data; The index value is stored in a distributed file system for a data retrieval work; a time operation model is pre-established based on the deep neural network, and the time operation model is based on the amount of data and the number of data columns in the data retrieval work. , the number of data columns, the program time complexity, the search engine environment used, and the remaining memory size of the system to obtain an estimated work time to predict the time spent performing the data search; In a state of processing, the estimated working time corresponding to each data retrieval job is obtained by the time operation model, and the execution order of the corresponding plurality of data retrieval jobs is determined according to the magnitude order of the plurality of estimated working times.

The method of claim 1, wherein the execution order of the plurality of data retrieval jobs is determined according to a plurality of priorities corresponding to the plurality of data retrieval jobs and the plurality of estimated working times; each of the priorities is A value that is predefined or set in a predefined manner in each data retrieval job.

The method of claim 1 or 2, wherein the time complexity is calculated according to a relationship between loop instructions and/or call letters included in the data retrieval job to calculate a total execution time The number of lines is used as a calculation index of the time complexity.

The method of claim 2, wherein, when performing the data retrieval job, a retrieval engine is selected according to a percentage value of available memory between the remaining memory of a memory unit of the computer and the total memory; when the available memory is When the percentage value is less than 25%, the Hive engine is selected; when the free memory percentage value is not less than 25% and not greater than 50%, the Impala engine is selected; when the free memory percentage value is greater than 50%, the SparkSQL engine is selected; , the memory unit has a capacity of 20GB.

The method of claim 4, wherein when the computer receives a data retrieval job, it first inquires whether there is a corresponding data from a memory unit of the computer. If the corresponding data is hit from the memory unit, the computer retrieves the data from the memory unit. The body unit reads the corresponding data and writes it into an output file; if the corresponding data is not hit from the memory unit, the computer inquires from the hard disk unit whether there is the corresponding data; if it hits the corresponding data from the hard disk unit Corresponding data, the computer reads the corresponding data from the hard disk unit and writes it into the output file; if the corresponding data is not hit from the hard disk unit, the data retrieval work will first predict a corresponding data through the time operation model Estimate the working time; consider the priority and the estimated working time at the same time, and arrange the corresponding work execution sequence; Retrieve whether the corresponding data exists in the distributed file system.

A big data retrieval system, comprising: a computer with a processing unit, a memory unit and a hard disk unit, the processing unit has a processor, the memory unit and the hard disk unit are coupled to the processing unit; a The grouping index module has a grouping unit and an indexing unit coupled to the grouping unit; the grouping unit is formed by stacking a neural network of a stacked sparse auto-encoder after a plurality of sparse auto-encoders are trained layer by layer to establish a Learning model; through the learning model, multiple original data of multiple dimensions are projected into the eight quadrants of the three-dimensional space, and a quadrant corresponding to each original data is added to the raw data to form grouping data; the indexing unit receives the grouping data obtained by the grouping unit, and applies a distributed search engine based on elastic search to perform a distributed index on the grouping data to correspond to each of the grouping data The corresponding index values are generated from the quadratic value of the user; the corresponding index values are then stored in a distributed file system for a data retrieval work; and a scheduling module has a time estimation module and a sorting unit; the time estimation module has a time operation model pre-established based on the deep neural network, and the time operation model is based on the amount of data, the number of data columns, the number of data columns, and the complexity of the program time in the data retrieval work degree, the search engine environment used, and the remaining memory size of the system to obtain an estimated work time to predict the time spent performing the data search work; in a state where multiple data search work is pending, the sorting The unit receives a plurality of corresponding estimated working hours, and determines the execution order of the corresponding plurality of data retrieval jobs according to the magnitude order of the plurality of estimated working hours.

The system of claim 6, wherein the time complexity is calculated based on a relationship between loop instructions and/or call letters included in the data retrieval job to calculate a total number of executions as the time complexity A calculation indicator.

The system of claim 6, wherein the time complexity is based on a relationship between loop instructions and/or call letters included in the data retrieval job to calculate a total execution times, and obtain the total execution times The middle power is ignored, and all its coefficients are ignored, and the maximum value of the power is used as the calculation index of the time complexity.

The system of any one of claims 6 to 8 further comprises a retrieval engine selection module, the retrieval engine selection module has Hive, Impala and SparkSQL engines supporting structured query language commands, and the retrieval engine selection module It has a retrieval engine selection logic; the retrieval engine selection logic is defined as: when performing the data retrieval work, select retrieval according to a percentage value of available memory between the remaining memory of a memory unit of the computer and the total memory engine; when the available memory percentage value is less than 25%, select the Hive engine; when the available memory percentage is less than 25% When the percentage value is not less than 25% and not greater than 50%, the Impala engine is selected; when the available memory percentage value is greater than 50%, the SparkSQL engine is selected; the capacity of the memory unit is 20GB.

The system of claim 9, wherein when the computer receives a data retrieval job, it first inquires whether there is a corresponding data from a memory unit of the computer, and if the corresponding data is hit from the memory unit, the computer retrieves the data from the memory unit. The body unit reads the corresponding data and writes it into an output file; if the corresponding data is not hit from the memory unit, the computer inquires from the hard disk unit whether there is the corresponding data; if it hits the corresponding data from the hard disk unit Corresponding data, the computer reads the corresponding data from the hard disk unit and writes it into the output file; if the corresponding data is not hit from the hard disk unit, the data retrieval work will first predict a corresponding data through the time operation model Estimate the working time; consider the priority and the estimated working time at the same time, and arrange the corresponding work execution sequence; Retrieve whether the corresponding data exists in the distributed file system.