TW202311994A

TW202311994A - System and method of malicious domain query behavior detection

Info

Publication number: TW202311994A
Application number: TW110133747A
Authority: TW
Inventors: 陳勝裕; 蔡天浩; 陳彥仲; 施君熹
Original assignee: 中華電信股份有限公司
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2023-03-16
Also published as: TWI777766B

Abstract

A system and a method of malicious domain query behavior detection are provided. The system includes a transceiver, a storage medium, and a processor. The processor is coupled to the transmitter and the storage medium, and is configured to execute a plurality of modules, and the plurality of modules include: a domain name filtering module, filtering domain name service query records in network traffic according to a preset list so as to obtain a non-existent domain (NXDomain) query record; a feature computing module, computing a similarity of an internet protocol (IP) address to be tested based on the NXDomain query record; and a detection module, in response to the similarity is greater than a threshold value, detecting whether the IP address to be tested is a host infected by malicious programs through a machine learning model to generate a detection result, and outputting the detection result through the transceiver.

Description

System and method for detecting malicious domain query behavior

本發明是有關於一種網路安全技術，且特別是有關於一種偵測惡意網域查詢行為的系統及方法。The present invention relates to a network security technology, and in particular to a system and method for detecting malicious network domain query behavior.

由於網際網路的興起，現今網域名稱服務（Domain Name Service，DNS）已成為上網不可或缺的服務。然而，多數的單位或使用者並不會特別關注DNS查詢的流量與內容。網路犯罪者為了保持與受害主機的溝通渠道順暢，會使用一台主機當作中控中心，此主機又稱為命令與控制伺服器（Command and Control Server，C&C Server），由此主機作為中繼站進行指令派送與收容竊取到的受害主機的私密資訊。因為命令與控制伺服器在整個犯罪的過程中扮演重要角色，網路犯罪者會使用各種方法延長其存活的時間與增加其隱蔽性以躲避偵測。Due to the rise of the Internet, Domain Name Service (DNS) has become an indispensable service for surfing the Internet. However, most organizations or users do not pay special attention to the traffic and content of DNS queries. In order to maintain a smooth communication channel with the victim host, cybercriminals will use a host as the central control center, which is also called a command and control server (Command and Control Server, C&C Server), and the host acts as a relay station Perform command dispatch and contain the stolen private information of the victim host. Because command and control servers play an important role in the entire criminal process, cybercriminals use various methods to prolong their survival time and increase their stealth to avoid detection.

網域生成演算法（Domain Generation Algorithm，DGA）的技術至今為止還是駭客策劃網路攻擊時的主要手段。DGA惡意程式常常是一個進階持續性滲透攻擊的工具。DGA惡意程式除了不易偵測外，駭客還同時可以賦予DGA所產生的網域名稱不同功能進行惡意活動。The domain generation algorithm (Domain Generation Algorithm, DGA) technology is still the main method for hackers to plan cyber attacks until now. DGA malware is often a tool for advanced persistent penetration attacks. DGA malware is not easy to detect, hackers can also endow the domain names generated by DGA with different functions to carry out malicious activities.

有鑑於此，本發明提出一種偵測惡意網域查詢行為的系統及方法，可分析DNS查詢來偵測出DGA所產生的惡意網域。In view of this, the present invention proposes a system and method for detecting malicious domain query behavior, which can analyze DNS queries to detect malicious domains generated by DGA.

本發明的實施例提供一種偵測惡意網域查詢行為的系統，包括：收發器，接收網路流量；儲存媒體，儲存多個模組；以及處理器，耦接所述收發器與所述儲存媒體，經配置以執行所述多個模組，其中所述多個模組包括：網域名稱過濾模組，根據預設清單過濾所述網路流量之中的網域名稱服務查詢紀錄而取得不存在網域查詢紀錄；特徵計算模組，根據所述不存在網域查詢紀錄計算待測網際協議位址的相似度；以及偵測模組，反應於所述相似度大於門檻值，通過機器學習模型偵測所述待測網際協議位址是否為受惡意程式感染的主機以產生偵測結果，並且通過所述收發器輸出所述偵測結果。An embodiment of the present invention provides a system for detecting malicious network domain query behavior, including: a transceiver for receiving network traffic; a storage medium for storing multiple modules; and a processor for coupling the transceiver and the storage The medium is configured to execute the plurality of modules, wherein the plurality of modules include: a domain name filter module, obtained by filtering domain name service query records in the network traffic according to a preset list There is no domain query record; the feature calculation module calculates the similarity of the IP address to be tested according to the non-existent domain query record; and the detection module responds that the similarity is greater than a threshold value, through the machine The learning model detects whether the IP address to be tested is a host computer infected by a malicious program to generate a detection result, and outputs the detection result through the transceiver.

本發明的實施例提供一種偵測惡意網域查詢行為的方法，包括：接收網路流量；根據預設清單過濾所述網路流量之中的網域名稱服務查詢紀錄而取得不存在網域查詢紀錄；根據所述不存在網域查詢紀錄計算待測網際協議位址的相似度；反應於所述相似度大於門檻值，通過機器學習模型偵測所述待測網際協議位址是否為受惡意程式感染的主機以產生偵測結果；以及輸出所述偵測結果。An embodiment of the present invention provides a method for detecting malicious network domain query behavior, including: receiving network traffic; filtering domain name service query records in the network traffic according to a preset list to obtain non-existent domain name query record; calculate the similarity of the IP address to be tested according to the non-existing domain query record; in response to the similarity being greater than a threshold value, detect whether the IP address to be tested is malicious by using a machine learning model program infected hosts to generate detection results; and output the detection results.

基於上述，本發明所提供的偵測惡意網域查詢行為的系統及方法，結合異常行為分析方面與人工智慧技術的輔助，利用在DNS查詢中出現不存在網域的異常查詢進行行為分析，並利用機器學習模型偵測異常行為中的DGA網域。藉此，可以有效地找出DGA惡意程式所連線的惡意中繼站，降低DGA惡意程式的危害，阻止進階持續性滲透攻擊的入侵，並防止機敏資料被竊取。Based on the above, the system and method for detecting malicious network domain query behavior provided by the present invention, combined with the assistance of abnormal behavior analysis and artificial intelligence technology, uses abnormal queries that do not exist in DNS queries to perform behavior analysis, and Use machine learning models to detect DGA domains in anomalous behavior. In this way, the malicious relay station connected by the DGA malicious program can be effectively found out, the harm of the DGA malicious program can be reduced, the intrusion of advanced persistent penetration attacks can be prevented, and sensitive data can be prevented from being stolen.

本發明的部份實施例接下來將會配合附圖來詳細描述，以下的描述所引用的元件符號，當不同附圖出現相同的元件符號將視為相同或相似的元件。這些實施例只是本發明的一部份，並未揭示所有本發明的可實施方式。更確切的說，這些實施例只是本發明的專利申請範圍中的系統與方法的範例。Parts of the embodiments of the present invention will be described in detail with reference to the accompanying drawings. For the referenced reference symbols in the following description, when the same reference symbols appear in different drawings, they will be regarded as the same or similar components. These embodiments are only a part of the present invention, and do not reveal all possible implementation modes of the present invention. Rather, these embodiments are merely examples of systems and methods within the scope of the present invention.

圖1是依照本發明實施例的一種偵測惡意網域查詢行為的系統的方塊圖。請參照圖1，系統10可包括處理器100、收發器200以及儲存媒體300。處理器100耦接收發器200以及儲存媒體300。處理器100可經配置以執行儲存媒體200所儲存的多個模組。FIG. 1 is a block diagram of a system for detecting malicious domain query behavior according to an embodiment of the present invention. Referring to FIG. 1 , the system 10 may include a processor 100 , a transceiver 200 and a storage medium 300 . The processor 100 is coupled to the transceiver 200 and the storage medium 300 . The processor 100 can be configured to execute a plurality of modules stored in the storage medium 200 .

處理器100例如是中央處理單元（central processing unit，CPU），或是其他可程式化之一般用途或特殊用途的微控制單元（micro control unit，MCU）、微處理器（microprocessor）、數位信號處理器（digital signal processor，DSP）、可程式化控制器、特殊應用積體電路（application specific integrated circuit，ASIC）、圖形處理器（graphics processing unit，GPU）、影像訊號處理器（image signal processor，ISP）、影像處理單元（image processing unit，IPU）、算數邏輯單元（arithmetic logic unit，ALU）、複雜可程式邏輯裝置（complex programmable logic device，CPLD）、現場可程式化邏輯閘陣列（field programmable gate array，FPGA）或其他類似元件或上述元件的組合。處理器100可存取和執行儲存於儲存媒體300中的多個模組和各種應用程式以執行系統10的各種功能。The processor 100 is, for example, a central processing unit (central processing unit, CPU), or other programmable general purpose or special purpose micro control unit (micro control unit, MCU), microprocessor (microprocessor), digital signal processing Digital signal processor (DSP), programmable controller, application specific integrated circuit (ASIC), graphics processing unit (graphics processing unit, GPU), image signal processor (image signal processor, ISP) ), image processing unit (image processing unit, IPU), arithmetic logic unit (arithmetic logic unit, ALU), complex programmable logic device (complex programmable logic device, CPLD), field programmable logic gate array (field programmable gate array , FPGA) or other similar components or combinations of the above components. The processor 100 can access and execute multiple modules and various application programs stored in the storage medium 300 to perform various functions of the system 10 .

收發器200可接收網路流量。收發器200以無線或有線的方式傳送及接收訊號。收發器200還可以執行例如低噪聲放大、阻抗匹配、混頻、向上或向下頻率轉換、濾波、放大以及類似的操作。The transceiver 200 can receive network traffic. The transceiver 200 transmits and receives signals in a wireless or wired manner. Transceiver 200 may also perform operations such as low noise amplification, impedance matching, frequency mixing, up or down frequency conversion, filtering, amplification, and the like.

儲存媒體300可儲存多個模組。儲存媒體300例如是任何型態的固定式或可移動式的隨機存取記憶體（random access memory，RAM）、唯讀記憶體（read-only memory，ROM）、快閃記憶體（flash memory）、硬碟（hard disk drive，HDD）、固態硬碟（solid state drive，SSD）或類似元件或上述元件的組合，而用於儲存可由處理器100執行的多個模組或各種應用程式。多個模組可包括網域名稱過濾模組310、特徵計算模組320以及偵測模組330。The storage medium 300 can store multiple modules. The storage medium 300 is, for example, any type of fixed or removable random access memory (random access memory, RAM), read-only memory (read-only memory, ROM), flash memory (flash memory) , hard disk drive (hard disk drive, HDD), solid state drive (solid state drive, SSD) or similar elements or a combination of the above-mentioned elements for storing multiple modules or various application programs executable by the processor 100 . The plurality of modules may include a domain name filtering module 310 , a feature calculation module 320 and a detection module 330 .

網域名稱過濾模組310可根據預設清單過濾網路流量之中的網域名稱服務（Domain Name Service，DNS）查詢紀錄而取得不存在網域（Non-existent Domain，NXDomain）查詢紀錄。The domain name filtering module 310 can filter the domain name service (Domain Name Service, DNS) query records in the network traffic according to the preset list to obtain non-existent domain (Non-existent Domain, NXDomain) query records.

特徵計算模組320，可根據不存在網域查詢紀錄計算待測網際協議（Internet Protocol，IP）位址的相似度。不存在網域查詢紀錄可包含多個使用者IP位址。特徵計算模組320可從多個使用者IP位址中選出待測IP位址。The feature calculation module 320 can calculate the similarity of the Internet Protocol (IP) address to be tested according to the non-existent network domain query record. No domain lookup records can contain multiple user IP addresses. The feature calculation module 320 can select the IP address to be tested from a plurality of user IP addresses.

偵測模組330，可反應於相似度大於門檻值，通過機器學習模型偵測待測網際協議位址是否為受惡意程式感染的主機以產生偵測結果，並且通過收發器200輸出偵測結果。The detection module 330 can respond to the fact that the similarity is greater than the threshold value, detect whether the IP address to be tested is a host computer infected by a malicious program through a machine learning model to generate a detection result, and output the detection result through the transceiver 200 .

圖2是依照本發明實施例的一種偵測惡意網域查詢行為的方法的流程圖。請參照圖2，本實施例的方法適用如圖1所示的系統10，以下說明本發明實施例的詳細步驟。在步驟S210中，接收網路流量。在步驟S220中，根據預設清單過濾網路流量之中的網域名稱服務查詢紀錄而取得不存在網域查詢紀錄。在步驟S230中，根據不存在網域查詢紀錄計算待測網際協議位址的相似度。在步驟S240中，反應於相似度大於門檻值，通過機器學習模型偵測待測網際協議位址是否為受惡意程式感染的主機以產生偵測結果。在步驟S250中，輸出偵測結果。FIG. 2 is a flow chart of a method for detecting malicious domain query behavior according to an embodiment of the present invention. Please refer to FIG. 2 , the method of this embodiment is applicable to the system 10 shown in FIG. 1 , and the detailed steps of this embodiment of the present invention are described below. In step S210, network traffic is received. In step S220, the domain name service query records in the network traffic are filtered according to the preset list to obtain the non-existent domain name query records. In step S230, the similarity of the IP address to be tested is calculated according to the non-existent network domain query records. In step S240, in response to the fact that the similarity is greater than the threshold value, a machine learning model is used to detect whether the IP address to be tested is a host infected by a malicious program to generate a detection result. In step S250, the detection result is output.

在本發明的一實施例中，預設清單包括白名單。網域名稱過濾模組310可反應於網域名稱服務查詢紀錄之中的待確認網域查詢不在白名單之中，將待確認網域查詢列入不存在網域查詢紀錄。在本發明的一實施例中，預設清單包括黑名單。網域名稱過濾模組310可反應於網域名稱服務查詢紀錄之中的待確認網域查詢不在黑名單之中，將待確認網域查詢列入不存在網域查詢紀錄。In an embodiment of the present invention, the preset list includes a white list. The domain name filtering module 310 may reflect that the domain name query to be confirmed in the domain name service query record is not in the white list, and include the domain name query to be confirmed into the non-existent domain name query record. In an embodiment of the present invention, the preset list includes a blacklist. The domain name filtering module 310 may reflect that the domain name query to be confirmed in the domain name service query record is not in the blacklist, and include the domain name query to be confirmed into the non-existing domain name query record.

舉例來說，白名單儲存多個正常活動所產生的DNS查詢。白名單的作用在於降低系統的計算負擔與降低誤報率。白名單所記錄的網域也有可能包含NXDomain的查詢，如果將此流量也進行計算則會產生大量誤報。網域名稱過濾模組310將蒐集到的DNS查詢紀錄進行過濾，將白名單所記錄的正常活動產生之DNS查詢過濾掉而只留下關注的NXDomain查詢紀錄。具體來說，通常網路流量中NXDomain查詢的產生主要是來自輸入錯誤查詢網域與一些特殊用途的服務，所以NXDomain的流量只占整體流量不到十分之一。For example, the whitelist stores DNS queries generated by various normal activities. The role of the whitelist is to reduce the calculation burden of the system and reduce the false positive rate. The domains recorded in the whitelist may also contain NXDomain queries. If this traffic is also calculated, a large number of false positives will be generated. The domain name filtering module 310 filters the collected DNS query records, and filters out the DNS queries generated by normal activities recorded in the whitelist, leaving only the concerned NXDomain query records. Specifically, NXDomain queries in network traffic are mainly generated from input errors to query domains and some special-purpose services, so NXDomain traffic only accounts for less than one-tenth of the overall traffic.

另外NXDomain的流量中可能包含大量的特殊服務所產生的查詢。最常見的屬於防毒軟體中的黑名單比對，例如防毒軟體的黑名單會將防毒軟體廠商的網域與待確認網域做結合。另一個常見的服務則是電子郵件伺服器，若是使用同一個公司所提供的第三方黑名單，則網路流量中會產生與DGA所產生的異常查詢相似的網域查詢，造成系統的誤判。因此，網域名稱過濾模組310可將所有第三方黑名單的網域查詢過濾掉。如此一來，網路流量經過預設清單過濾後只需處理不到原來網路流量的二十分之一的流量。藉此，系統10不需要處理龐大的網路流量，可節省計算資源並有效提升運行效能，也同時提高整體偵測精準度。In addition, the traffic of NXDomain may contain a large number of queries generated by special services. The most common one is the blacklist comparison in antivirus software. For example, the blacklist of antivirus software will combine the network domain of the antivirus software manufacturer with the domain to be confirmed. Another common service is the email server. If a third-party blacklist provided by the same company is used, domain queries similar to the abnormal queries generated by DGA will be generated in the network traffic, resulting in misjudgment by the system. Therefore, the domain name filtering module 310 can filter out all third-party blacklisted domain queries. In this way, after the network traffic is filtered by the preset list, it only needs to process less than one-twentieth of the original network traffic. In this way, the system 10 does not need to deal with huge network traffic, which can save computing resources and effectively improve operating performance, and at the same time improve the overall detection accuracy.

在本發明的一實施例中，網域名稱服務查詢紀錄可包括使用者IP位址、查詢時間、查詢網域名稱或網域解析結果。舉例來說，使用者IP位址可以是由十進位數字組成的IPv4位址，也可以是由十六進位數字組成IPv6位址。查詢網域名稱可以是由多個部分組成的字串，這些部分通常連接在一起，並由點分隔，查詢網域名稱的英文字母可不區分大小寫。網域解析結果可以是DNS伺服器所回傳對於查詢網域名稱的解析結果。In an embodiment of the present invention, the domain name service query record may include user IP address, query time, query domain name or domain resolution result. For example, the user IP address can be an IPv4 address composed of decimal numbers, or an IPv6 address composed of hexadecimal numbers. The query domain name can be a string composed of multiple parts. These parts are usually connected together and separated by dots. The English letters of the query domain name are not case-sensitive. The domain resolution result may be the resolution result of the query domain name returned by the DNS server.

在本發明的一實施例中，特徵計算模組320可經配置以執行下列指令來計算相似度。特徵計算模組320可從網域名稱服務查詢紀錄中取得對應於待測網際協議位址的第一查詢網域清單以及對應於參考網際協議位址的第二查詢網域清單。特徵計算模組320可取得第一查詢網域清單與第二查詢網域清單的交集數量。特徵計算模組320可取得第一查詢網域清單與第二查詢網域清單的聯集數量。特徵計算模組320可將交集數量除以聯集數量而得到相似度。In an embodiment of the present invention, the feature calculation module 320 can be configured to execute the following instructions to calculate the similarity. The feature calculation module 320 can obtain the first query domain list corresponding to the IP address to be tested and the second query domain list corresponding to the reference IP address from the domain name service query record. The feature calculation module 320 can obtain the intersection quantity of the first query domain list and the second query domain list. The feature calculation module 320 can obtain the combined quantity of the first query domain list and the second query domain list. The feature calculation module 320 can divide the number of intersections by the number of unions to obtain the similarity.

舉例來說，特徵計算模組320將DNS查詢紀錄蒐集到的所有查詢網域當作輸入資料，藉以統計第一使用者（即待測IP位址）查詢過的網域以產生第一查詢網域清單，並且統計第二使用者（即參考IP位址）查詢過的網域以產生第二查詢網域清單。接著，特徵計算模組320可計算待測IP位址與參考IP位址之間查詢的共同網域（即對應於待測網際協議位址的第一查詢網域清單以及對應於參考網際協議位址的第二查詢網域清單之間共同的查詢網域）比例。For example, the feature calculation module 320 takes all the query domains collected by the DNS query records as input data, so as to count the domains queried by the first user (that is, the IP address to be tested) to generate the first query network domain list, and count the domains queried by the second user (that is, the reference IP address) to generate a second query domain list. Then, the feature calculation module 320 can calculate the common network domain of the query between the IP address to be tested and the reference IP address (that is, the first query network domain list corresponding to the IP address to be tested and the list of domains corresponding to the reference IP address common lookup domains) ratio between the second lookup domain list of URLs.

同一隻DGA惡意程式會查詢同一份的網域名單，直到查詢到有存活的網域才會停止。當使用者之間的相似度很高代表很有可能是同一隻DGA惡意程式，因此查詢的網域清單會極為相似。由於異常查詢行為會有一份相同的網域名單，而正常使用者很少會去查詢整個網域名單，所以受惡意程式感染的使用者之間的查詢網域清單會有很高的相似度。The same DGA malicious program will query the same domain name list, and will not stop until it finds a surviving domain. When the similarity between users is high, it means that it is likely to be the same DGA malicious program, so the list of domains to be queried will be very similar. Since the abnormal query behavior will have the same domain name list, and normal users will seldom query the entire domain name list, the query domain lists of users infected by malicious programs will have a high degree of similarity.

具體來說，相似度可由以下公式計算：

其中

為相似度，其中

代表使用者，

為第一IP位址（即待測IP位址）的查詢網域清單，

為第二IP位址（即參考IP位址）的查詢網域清單，

代表

與

的交集數量，

代表

與

的聯集數量。此公式計算的是兩兩使用者的相似度，而不是針對整體的網路環境進行分析，因此即使應用在不同的網路環境中也可以精準的算出使用者之間的相似度。 Specifically, the similarity can be calculated by the following formula:

in

is the similarity, where

on behalf of the user,

is the query domain list of the first IP address (that is, the IP address to be tested),

is the query domain list of the second IP address (i.e. the reference IP address),

represent

and

The number of intersections,

represent

and

number of unions. This formula calculates the similarity between two users, rather than analyzing the overall network environment, so it can accurately calculate the similarity between users even if it is applied in different network environments.

舉例而言，當特徵計算模組320所計算的相似度高於門檻值（又稱為設計水位）就可以由偵測模組330對待測IP位址與對應的查詢網域進行偵測。在一些實施例中，當設計水位之值在0.8至0.9的區間，偵測模組330即可有效的偵測待測IP位址與對應的查詢網域是否為DGA所產生的網域。For example, when the similarity calculated by the feature calculation module 320 is higher than the threshold value (also called the design water level), the detection module 330 can detect the IP address to be tested and the corresponding query domain. In some embodiments, when the value of the design water level is in the range of 0.8 to 0.9, the detection module 330 can effectively detect whether the IP address to be tested and the corresponding query domain are domains generated by DGA.

在本發明的一實施例中，機器學習模型為長短期記憶（Long Short Term Memory，LSTM）模型。在深度學習技術中，LSTM模型為遞迴神經網路（Recurrent Neural Network，RNN）最常見的變形之一，適合用於語音識別、語言建模、情感分析和文本預測等具有前後文特性的資料，具有良好的準確性和處理複雜特徵的能力。在一實施例中，偵測模組330可將網路上可蒐集到的多種DGA演算法產生的網域清單作為訓練資料集進行訓練。在一實施例中，偵測模組330將LSTM模型訓練成可以判斷字串是否符合DGA形式。例如，LSTM模型可以針對網域名稱中是否存在疑似使用DGA或是隨機亂數產生的字串進行判斷，由於此判斷為字串上的語意分析，因此非常適合使用LSTM模型。當LSTM模型判斷網域清單符合DGA形式，偵測模組330判斷此待測IP位址為受惡意程式感染的主機以產生偵測結果，並且通過收發器200輸出偵測結果。In an embodiment of the present invention, the machine learning model is a Long Short Term Memory (LSTM) model. In deep learning technology, LSTM model is one of the most common variants of Recurrent Neural Network (RNN), suitable for speech recognition, language modeling, sentiment analysis and text prediction, etc. , with good accuracy and ability to handle complex features. In one embodiment, the detection module 330 can use the network domain lists generated by various DGA algorithms collected on the Internet as a training data set for training. In one embodiment, the detection module 330 trains the LSTM model to be able to judge whether the word string conforms to the DGA form. For example, the LSTM model can judge whether there is a string suspected to be generated by using DGA or random random numbers in the domain name. Since this judgment is a semantic analysis on the string, it is very suitable to use the LSTM model. When the LSTM model judges that the network domain list conforms to the DGA format, the detection module 330 judges that the IP address to be detected is a host infected by malicious programs to generate a detection result, and outputs the detection result through the transceiver 200 .

值得一提的是，許多DGA生成的網域有一定的規律或是使用特定種子產生的字串，並且此字串會在二級域（Second level domain）中出現。因此，在一些實施例中，偵測模組330可使用隨機字串以及二級域訓練機器學習模型。偵測模組330在進行判斷時，機器學習模型可只萃取二級域作為輸入以產生偵測結果。It is worth mentioning that many domains generated by DGA have certain rules or strings generated using specific seeds, and this string will appear in the second level domain. Therefore, in some embodiments, the detection module 330 can use random word strings and secondary domains to train the machine learning model. When the detection module 330 makes a judgment, the machine learning model can only extract the secondary domain as an input to generate a detection result.

在本發明的一實施例中，偵測模組330可產生多組隨機字串。偵測模組330可從歷史不存在網域查詢紀錄中擷取多筆二級域。偵測模組330可將多組隨機字串以及多筆二級域作為訓練資料訓練機器學習模型。In an embodiment of the present invention, the detection module 330 can generate multiple sets of random word strings. The detection module 330 can retrieve a plurality of secondary domains from the historical non-existing domain query records. The detection module 330 can use multiple sets of random strings and multiple secondary fields as training data to train a machine learning model.

在本發明的一實施例中，偵測模組330可反應於相似度大於門檻值，從不存在網域查詢紀錄取得對應於待測網際協議的待測查詢網域清單。偵測模組330可從待測查詢網域清單中擷取分別對應於多個查詢網域的多個二級域。偵測模組330可將多個二級域輸入機器學習模型以產生偵測結果。In an embodiment of the present invention, the detection module 330 may respond to the fact that the similarity is greater than a threshold value, and obtain a query domain list corresponding to the IP to be tested from the non-existing domain query records. The detection module 330 can extract a plurality of secondary domains respectively corresponding to a plurality of query domains from the list of query domains to be tested. The detection module 330 can input multiple secondary domains into the machine learning model to generate detection results.

由於網路犯罪者使用DGA所產生的網域每天在變換，而防毒軟體廠商蒐集的惡意網域清單也未必會即時更新，若只單靠看網域名稱很難達到準確抓出惡意網域查詢行為。因此，本發明的實施例從使用者行為的層面與深度學習演算法做結合，彌補防毒軟體可能遺漏之惡意DGA程式所產生的網域。DGA確實會產生大量的NXDomain查詢，但是更常發現是機器內建之網域清單查詢而造成誤判。本發明的實施例更可結合人工智慧技術對網域名稱組成進行分析，使得偵測出來的告警更準確。本發明的實施例從NXDomain先篩選出非人類的查詢行為，接著再利用針對DGA網域訓練得到的機器學習模型偵測惡意DGA程式所產生的網域。如此，可達成更為精確的惡意網域偵測結果。Since the domains generated by cybercriminals using DGA are changing every day, and the list of malicious domains collected by antivirus software vendors may not be updated in real time, it is difficult to accurately catch malicious domain queries only by looking at domain names Behavior. Therefore, the embodiment of the present invention combines the user behavior level with the deep learning algorithm to make up for the network domain generated by the malicious DGA program that the antivirus software may miss. DGA does generate a large number of NXDomain queries, but it is more often found that it is the built-in domain list query of the machine that causes misjudgment. The embodiment of the present invention can further combine artificial intelligence technology to analyze the composition of the network domain name, so that the detected alarms are more accurate. The embodiment of the present invention first screens out non-human query behaviors from NXDomain, and then uses the machine learning model trained for DGA domains to detect domains generated by malicious DGA programs. In this way, more accurate malicious domain detection results can be achieved.

綜上所述，本發明所提供的偵測惡意網域查詢行為的系統及方法可達到以下之技術功效：（1）針對使用者行為模式計算相似度，不需要事前大量資料建模與訓練即可有效的區別出異常使用者查詢。（2）藉由使用者行為與查詢網域名稱分析綜合偵測，有效的避免分析網域名稱上的誤判，也可以利用使用者異常行為提供偵測結果的有效證據力。（3）利用DNS通訊無法加密之特性，針對DNS查詢的行為進行偵測，可避免因為封包加密而影響系統的偵測率或讓系統無法進行偵測。（4）利用使用者群體的連線行為進行偵測，不需知道網域清單的產生方式或是產生時間，即可偵測出不同型態之惡意DGA網域。（5）從大量的網路流量中萃取需要分析的流量，可有效提升系統運行效能，並可增加偵測的精準度。To sum up, the system and method for detecting malicious domain query behavior provided by the present invention can achieve the following technical effects: (1) Calculation of similarity for user behavior patterns does not require a large amount of data modeling and training in advance. It can effectively distinguish abnormal user queries. (2) Through the comprehensive detection of user behavior and query domain name analysis, it can effectively avoid misjudgment in domain name analysis, and can also use abnormal user behavior to provide effective evidence for detection results. (3) Using the feature that DNS communication cannot be encrypted, detect DNS query behavior, which can avoid affecting the detection rate of the system or making the system unable to detect due to packet encryption. (4) Use the connection behavior of the user group to detect different types of malicious DGA domains without knowing how or when the domain list was generated. (5) Extracting the traffic that needs to be analyzed from a large amount of network traffic can effectively improve the operating performance of the system and increase the accuracy of detection.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed above with the embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the technical field may make some changes and modifications without departing from the spirit and scope of the present invention. The scope of protection of the present invention should be defined by the scope of the appended patent application.

10:系統 100:處理器 200:收發器 300:儲存媒體 310:網域名稱過濾模組 320:特徵計算模組 330:偵測模組 S210、S220、S230、S240、S250:步驟 10: System 100: Processor 200: Transceiver 300: storage media 310: domain name filter module 320: Feature Calculation Module 330: Detection Module S210, S220, S230, S240, S250: steps

圖1是依照本發明實施例的一種偵測惡意網域查詢行為的系統的方塊圖。圖2是依照本發明實施例的一種偵測惡意網域查詢行為的方法的流程圖。 FIG. 1 is a block diagram of a system for detecting malicious domain query behavior according to an embodiment of the present invention. FIG. 2 is a flow chart of a method for detecting malicious domain query behavior according to an embodiment of the present invention.

10:系統 10: System

100:處理器 100: Processor

200:收發器 200: Transceiver

300:儲存媒體 300: storage media

310:網域名稱過濾模組 310: domain name filter module

320:特徵計算模組 320: Feature Calculation Module

330:偵測模組 330: Detection Module

Claims

A system for detecting malicious domain query behavior, comprising: Transceiver to receive network traffic; a storage medium for storing multiple modules; and A processor, coupled to the transceiver and the storage medium, configured to execute the plurality of modules, wherein the plurality of modules include: The domain name filtering module filters the domain name service query records in the network traffic according to the preset list to obtain non-existent domain name query records; A feature calculation module, which calculates the similarity of the IP address to be tested according to the non-existing domain query record; and The detection module, in response to the similarity being greater than a threshold value, detects whether the IP address to be tested is a host computer infected by a malicious program through a machine learning model to generate a detection result, and outputs it through the transceiver The detection result.

The system according to claim 1, wherein the preset list includes a white list, wherein the domain name filtering module responds that the domain name query to be confirmed in the domain name service query record is not in the white list In the list, the query of the domain to be confirmed is included in the query record of the non-existing domain.

The system according to claim 1, wherein the preset list includes a blacklist, wherein the domain name filter module responds that the domain name query to be confirmed in the domain name service query record is not in the blacklist In the list, the query of the domain to be confirmed is included in the query record of the non-existing domain.

The system according to claim 1, wherein the domain name service query record includes the user IP address, query time, query domain name and domain resolution result.

The system as claimed in claim 1, wherein the feature calculation module is configured to perform: obtaining a first query domain list corresponding to the IP address to be tested and a second query domain list corresponding to a reference IP address from the domain name service query record; obtaining the intersection quantity of the first query domain list and the second query domain list; obtain the number of unions of the first lookup domain list and the second lookup domain list; and The similarity is obtained by dividing the number of intersections by the number of unions.

The system according to claim 1, wherein the machine learning model is a long short-term memory model.

The system of claim 1, wherein the detection module is configured to perform: Generate multiple sets of random strings; Retrieve multiple second-level domains from historical non-existing domain query records; and Using the multiple sets of random word strings and the multiple secondary domains as training data to train the machine learning model.

The system of claim 1, wherein the detection module is configured to perform: Responding that the similarity is greater than the threshold value, obtaining a query domain list corresponding to the IP to be tested from the non-existing domain query record; Retrieving a plurality of secondary domains respectively corresponding to a plurality of query domains from the list of query domains to be tested; and The plurality of secondary domains are input into the machine learning model to generate the detection result.

A method of detecting malicious domain query behavior, comprising: receive network traffic; filtering the domain name service query records in the network traffic according to the preset list to obtain non-existent domain name query records; Calculating the similarity of the IP address to be tested according to the non-existent domain name query record; In response to the similarity being greater than a threshold value, using a machine learning model to detect whether the IP address to be tested is a host infected by a malicious program to generate a detection result; and output the detection result.