KR20230166933A

KR20230166933A - System for filtering false positive data based on log-based user's abnormal behavior and feature attribution

Info

Publication number: KR20230166933A
Application number: KR1020230068533A
Authority: KR
Inventors: 최승진; 백형구; 손준혁
Original assignee: 주식회사 인텔리코드
Priority date: 2022-05-31
Filing date: 2023-05-26
Publication date: 2023-12-07

Abstract

본 발명은 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템에 관한 것으로서, 미리 학습된 이상 탐지 모델에 의해 검사용 데이터에 대한 양성 또는 음성 여부를 판단하는 기능을 수행하는 이상 탐지부; 및 상기 이상 탐지부에서 양성으로 판단된 검사용 데이터에 대해 미리 학습된 필터링 모델에 의해 위 양성 또는 참 양성을 분류하여 위 양성인 검사용 데이터를 제거하는 필터링부를 포함하는 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템을 제공한다.The present invention relates to a log-based user's abnormal behavior detection and a filtering system for false positive data based on feature contribution, which performs the function of determining whether test data is positive or negative by a pre-learned anomaly detection model. detection unit; and a filtering unit that classifies test data determined as positive by the abnormality detection unit as false positive or true positive by a pre-learned filtering model and removes false positive test data. Detection of abnormal behavior by a user based on logs. Provides a filtering system for false positive data based on feature contribution.

Description

Log-based user abnormal behavior detection and filtering system for false positive data based on feature contribution {SYSTEM FOR FILTERING FALSE POSITIVE DATA BASED ON LOG-BASED USER'S ABNORMAL BEHAVIOR AND FEATURE ATTRIBUTION}

본 발명은 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템에 관한 것으로서, 보다 상세하게는 머신 러닝 또는 딥 러닝을 통해 로그 데이터에서 사용자의 이상 행위를 탐지하고 특징 기여도에 기초하여 위 양성 데이터를 효율적으로 또한 정확하게 제거할 수 있는 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템에 관한 것이다.The present invention relates to a system for detecting abnormal user behavior based on logs and filtering false positive data based on feature contributions. More specifically, the present invention relates to a system for detecting abnormal user behavior in log data through machine learning or deep learning and based on feature contributions. This relates to a filtering system for false positive data based on log-based user abnormal behavior detection and feature contribution that can efficiently and accurately remove false positive data.

이상 탐지는 일반적인 대부분의 데이터에서 크게 벗어나는 희귀 항목, 이벤트 또는 관측치를 식별하는 기법으로, 가장 대표적으로 사용되는 방법에는 학습 데이터에 라벨이 존재하지 않는 비지도 학습 기반의 방법이 있다. Anomaly detection is a technique to identify rare items, events, or observations that significantly deviate from most typical data. The most commonly used methods include unsupervised learning-based methods in which no labels exist in the training data.

비지도 학습 기반의 이상 탐지는 라벨이 지정되지 않은 데이터 세트의 대다수 데이터를 정상으로 가정하고 정상 데이터와 다른 데이터 포인트를 탐지하는 방법이다. 그러나, 이상 데이터 포인트는 매우 드물고 이질적이기 때문에 모든 이상 현상을 찾아내기는 어려워 실제 데이터에서 음성(정상) 데이터를 양성(이상) 데이터로 판단하는 위 양성(False Positive, FP) 비율이 높다는 문제점이 존재한다.Unsupervised learning-based anomaly detection is a method that assumes that the majority of data in an unlabeled data set is normal and detects data points that are different from normal data. However, because abnormal data points are very rare and heterogeneous, it is difficult to find all abnormal phenomena, so there is a problem of high false positive (FP) rate, which judges negative (normal) data as positive (abnormal) data in actual data. .

또한 이러한 문제에 대해 규칙 기반의 이상 탐지라면 규칙을 기반으로 결과에 대한 이해가 어느 정도 가능하지만, 이상 탐지 모델의 성능을 높이기 위해 복잡한 모델을 사용한 경우 모델의 판단 결과에 대해 이해하기 어려워 왜 모델이 음성(정상) 데이터를 양성(이상)으로 판단했는지 이해하기 어렵다는 문제도 있다.In addition, for this problem, if it is rule-based anomaly detection, it is possible to understand the results to some extent based on the rules, but if a complex model is used to improve the performance of the anomaly detection model, it is difficult to understand the model's judgment results and why the model is used. There is also a problem that it is difficult to understand whether negative (normal) data was judged as positive (abnormality).

본 발명은 상기한 바와 같은 문제점을 해결하기 위한 것으로서, 로그 데이터에 기반하여 사용자의 이상 행위를 탐지하는 한편 특징 기여도에 기초하여 탐지 결과로부터 위 양성 데이터를 효율적으로 또한 정확하게 제거할 수 있는 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템을 제공하는 것을 목적으로 한다.The present invention is intended to solve the problems described above, and provides a log-based user system that detects abnormal user behavior based on log data and efficiently and accurately removes false positive data from the detection results based on feature contribution. The purpose is to provide a system for detecting abnormal behavior and filtering false positive data based on feature contribution.

상기한 바와 같은 과제를 해결하기 위하여 본 발명은 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템으로서, 미리 학습된 이상 탐지 모델에 의해 검사용 데이터에 대한 양성 또는 음성 여부를 판단하는 기능을 수행하는 이상 탐지부; 및 상기 이상 탐지부에서 양성으로 판단된 검사용 데이터에 대해 미리 학습된 필터링 모델에 의해 위 양성 또는 참 양성을 분류하여 위 양성인 검사용 데이터를 제거하는 필터링부를 포함하는 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템을 제공한다.In order to solve the problems described above, the present invention is a log-based user's abnormal behavior detection and filtering system for false positive data based on feature contribution, and determines whether test data is positive or negative by a pre-learned anomaly detection model. An abnormality detection unit that performs a judgment function; and a filtering unit that classifies test data determined as positive by the abnormality detection unit as false positive or true positive by a pre-learned filtering model and removes false positive test data. Detection of abnormal behavior by a user based on logs. Provides a filtering system for false positive data based on feature contribution.

여기에서, 상기 이상 탐지부는, 입력 데이터에 대한 전처리를 수행하는 전처리부; 및 입력 데이터 중 학습용 데이터에 기초하여 이상 탐지를 위한 학습이 수행되고, 입력 데이터 중 검사용 데이터에 대해 양성 또는 음성 여부의 결과를 출력하는 이상 탐지 모델을 포함할 수 있다.Here, the anomaly detection unit includes a preprocessing unit that performs preprocessing on input data; And it may include an anomaly detection model that performs learning for anomaly detection based on learning data among the input data and outputs a result of whether it is positive or negative for the inspection data among the input data.

또한, 상기 입력 데이터는, 이상 탐지의 대상이 되는 데이터인 검사용 데이터, 이상 탐지 모델을 학습시키는 데 사용되는 학습용 데이터 및 필터링 모델을 학습시키는 데 사용되는 평가용 데이터를 포함할 수 있다.Additionally, the input data may include inspection data, which is data that is the target of anomaly detection, training data used to train an anomaly detection model, and evaluation data used to train a filtering model.

또한, 상기 전처리부에 의해 전처리가 완료된 데이터에 대해 정상 데이터와 이상 데이터를 분류하고, 정상 데이터를 분할하여 학습용 데이터 및 평가용 데이터에 포함시키고, 이상 데이터는 평가용 데이터에만 포함시킬 수 있다.In addition, the data for which preprocessing has been completed by the preprocessor can be classified into normal data and abnormal data, the normal data can be divided and included in the learning data and evaluation data, and the abnormal data can be included only in the evaluation data.

또한, 상기 필터링부는, 평가용 데이터를 이상 탐지부로 입력시키고 이상 탐지부에서 양성으로 판단된 평가용 데이터에 대해 참 양성과 위 양성으로 라벨링을 수행하고 라벨링이 수행된 평가용 데이터를 특징 기여도 생성부로 전달하는 라벨링 생성부; 입력 데이터에 대해 각각의 특징이 양성으로 예측되는데 미치는 기여도 정보를 생성하고, 각각의 특징을 생성된 기여도 정보로 대체하여 2차 특징을 생성하는 특징 기여도 생성부; 및 평가용 데이터에 대해 상기 특징 기여도 생성부에 의해 생성된 2차 특징에 의해 학습되며, 학습 완료 후에는 상기 이상 탐지부에서 양성으로 판단된 검사용 데이터에 대해 상기 특징 기여도 생성부에 의해 생성된 2차 특징을 입력받아 위 양성 또는 참 양성을 분류하고, 위 양성인 검사용 데이터를 제거하는 필터링 모델을 포함할 수 있다.In addition, the filtering unit inputs the evaluation data into the anomaly detection unit, labels the evaluation data determined as positive by the anomaly detection unit as true positive and false positive, and transfers the labeled evaluation data to the feature contribution generation unit. a labeling generation unit that transmits; a feature contribution generator that generates contribution information for predicting each feature as positive for input data, and replaces each feature with the generated contribution information to generate secondary features; and is learned by secondary features generated by the feature contribution generation unit for evaluation data, and after learning is completed, the secondary features generated by the feature contribution generation unit are used for inspection data determined to be positive by the anomaly detection unit. It may include a filtering model that receives secondary features, classifies false positives or true positives, and removes false positive test data.

또한, 상기 특징 기여도 생성부로 입력되는 입력 데이터는, 필터링 모델 학습시에는 상기 라벨링 생성부에 의해 라벨링이 이루어진 평가용 데이터이고, 이상 탐지시에는 이상 탐지부에서 양성으로 판단된 검사용 데이터일 수 있다.In addition, the input data input to the feature contribution generation unit may be evaluation data labeled by the labeling generation unit when learning a filtering model, and may be test data judged positive by the anomaly detection unit when detecting an anomaly. .

또한, 상기 특징 기여도 생성부는, 필터링 모델 학습 수행시, 상기 라벨링 생성부에 의해 라벨링이 이루어진 상기 이상 탐지부에서 양성으로 판단된 평가용 데이터의 각각의 특징이 양성으로 예측되는데 미치는 기여도 정보를 생성하고, 각각의 특징을 생성된 기여도 정보로 대체하여 2차 특징을 생성하고, 이상 탐지시, 상기 이상 탐지부에서 양성으로 판단된 검사용 데이터의 각각의 특징이 양성으로 예측되는데 미치는 기여도 정보를 생성하고, 각각의 특징을 생성된 기여도 정보로 대체하여 2차 특징을 생성할 수 있다.In addition, when performing filtering model learning, the feature contribution generation unit generates information on the contribution of each feature of the evaluation data determined to be positive by the anomaly detection unit labeled by the labeling generation unit to predicting it as positive, , Generate secondary features by replacing each feature with the generated contribution information, and when detecting an anomaly, generate contribution information for predicting that each feature of the test data determined to be positive by the anomaly detection unit is positive. , secondary features can be created by replacing each feature with the generated contribution information.

또한, 상기 특징 기여도 정보 생성부는, 입력 데이터가 이상 탐지 모델을 거치는 동안 사용되는 각각의 특징에 대해, 각 특징이 포함되었을 때와 포함되지 않았을 때의 모델 예측값의 차이를 모든 가능한 특징의 부분 집합에 대해 계산함으로써, 각 특징에 대한 기여도 정보를 생성하고, 생성된 기여도 정보로 기존의 특징을 대체시킴으로써 2차 특징을 생성할 수 있다.In addition, the feature contribution information generation unit calculates the difference between model prediction values when each feature is included and when each feature is not included for each feature used while the input data passes through the anomaly detection model into a subset of all possible features. By calculating the contribution information for each feature, secondary features can be created by replacing existing features with the generated contribution information.

본 발명에 의하면, 로그 데이터에 기반하여 사용자의 이상 행위를 탐지하는 한편 특징 기여도에 기초하여 탐지 결과로부터 위 양성 데이터를 효율적으로 또한 정확하게 제거할 수 있는 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템을 제공할 수 있다.According to the present invention, log-based user abnormal behavior detection and feature contribution-based detection of user abnormal behavior can efficiently and accurately remove false positive data from detection results based on feature contribution while detecting user abnormal behavior based on log data. A filtering system for false positive data can be provided.

도 1은 본 발명에 의한 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템(100)의 전체적인 구성을 나타낸 도면이다.
도 2는 이상 탐지부(10)의 구성을 나타낸 도면이다.
도 3은 필터링부(20)의 구성을 나타낸 도면이다.
도 4는 기여도 정보에 의해 생성된 2차 특징의 일예를 나타낸 도면이다.
도 5는 본 발명에 의한 시스템(100)의 성능을 평가한 측정 결과를 나타낸 것이다.
도 6은 필터링부(20) 적용 전후의 ROC-Curve를 비교하여 나타낸 것이다.Figure 1 is a diagram showing the overall configuration of a system 100 for detecting abnormal behavior of users based on logs and filtering false positive data based on feature contributions according to the present invention.
Figure 2 is a diagram showing the configuration of the abnormality detection unit 10.
Figure 3 is a diagram showing the configuration of the filtering unit 20.
Figure 4 is a diagram showing an example of secondary features generated by contribution information.
Figure 5 shows measurement results evaluating the performance of the system 100 according to the present invention.
Figure 6 shows a comparison of the ROC-Curve before and after applying the filtering unit 20.

이하, 첨부 도면을 참조하여 본 발명에 의한 실시예를 상세하게 설명하기로 한다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 의한 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템(100)의 전체적인 구성을 나타낸 도면이다.Figure 1 is a diagram showing the overall configuration of a system 100 for detecting abnormal behavior of users based on logs and filtering false positive data based on feature contributions according to the present invention.

도 1을 참조하면, 로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템(100, 이하 간단히 "시스템(100)"이라 한다)은, 이상 탐지부(10) 및 필터링부(20)를 포함한다.Referring to FIG. 1, a log-based user's abnormal behavior detection and filtering system 100 of false positive data based on feature contribution (hereinafter simply referred to as “system 100”) includes an anomaly detection unit 10 and a filtering unit ( 20).

이상 탐지부(10)는 미리 학습된 이상 탐지 모델(12)에 의해 검사용 데이터에 대한 이상 탐지 즉, 양성 또는 음성 여부를 판단하는 기능을 수행하고, 필터링부(20)는 이상 탐지부(10)에서 양성으로 판단된 검사용 데이터에 대해 미리 학습된 필터링 모델(23)에 의해 위 양성 또는 참 양성을 분류하여 위 양성인 검사용 데이터를 제거하는 기능을 수행한다.The anomaly detection unit 10 performs the function of detecting abnormalities in the inspection data, that is, determining whether it is positive or negative, using a pre-learned anomaly detection model 12, and the filtering unit 20 performs the function of determining whether the data for inspection is positive or negative. ) performs a function of classifying false positives or true positives by a pre-learned filtering model (23) for test data determined to be positive and removing false positive test data.

여기에서, 양성이란 이상 행위를 했을 가능성이 높은 경우를 의미하고음성이란 정상 행위로 판단되는 경우를 의미한다.Here, positive means a case in which there is a high possibility of abnormal behavior, and negative means a case in which the behavior is judged to be normal.

도 2는 이상 탐지부(10)의 구성을 나타낸 도면이다.Figure 2 is a diagram showing the configuration of the abnormality detection unit 10.

도 2를 참조하면, 이상 탐지부(10)는, 전처리부(11) 및 이상 탐지 모델(12) 을 포함할 수 있다.Referring to FIG. 2, the anomaly detection unit 10 may include a preprocessor 11 and an anomaly detection model 12.

전처리부(11)는 입력 데이터에 대한 전처리를 수행하는 수단이다. 여기에서, 입력 데이터는 로그 데이터이다.The preprocessing unit 11 is a means for performing preprocessing on input data. Here, the input data is log data.

전처리라 함은, 입력 데이터의 특징(feature)마다 값의 범위를 일치시키기 위한 정규화 처리, 데이터에 존재하는 결측값에 대한 결측값 처리, 범주형 변수에 대한 원-핫 인코딩(one-hot encoding) 처리 등을 의미한다.Preprocessing refers to normalization processing to match the range of values for each feature of the input data, missing value processing for missing values present in the data, and one-hot encoding for categorical variables. This means processing, etc.

여기에서 입력 데이터는 로그 데이터로서, 예컨대, 조직의 엔터티(예를 들어 사용자, 호스트, 설비, IP 주소 및 애플리케이션)에 대한 데이터일 수 있으며, 이상 탐지는 내부 정보를 유출하는 등 비정상적인 작업을 했을 가능성이 높은 사용자를 식별하는 것을 의미한다.Here, the input data is log data, which can be, for example, data about the organization's entities (e.g. users, hosts, facilities, IP addresses, and applications), and anomaly detection identifies the possibility of abnormal actions such as leaking internal information. This means identifying high-ranking users.

이러한 입력 데이터는, 검사용 데이터, 학습용 데이터 및 평가용 데이터로 구분할 수 있다. 검사용 데이터는 시스템(100)에 의해 이상 탐지의 대상이 되는 데이터이고, 학습용 데이터는 이상 탐지 모델(12)을 학습시키는 데 사용되는 데이터이고, 평가용 데이터는 필터링 모델(23)을 학습시키는 데 사용되는 데이터이다.This input data can be divided into inspection data, learning data, and evaluation data. Inspection data is data that is subject to anomaly detection by the system 100, learning data is data used to train the anomaly detection model 12, and evaluation data is data used to train the filtering model 23. This is the data used.

이상 탐지 모델(12)은 입력 데이터 중 학습용 데이터에 기초하여 이상 탐지를 위한 학습이 수행되고, 입력 데이터 중 검사용 데이터에 대해 이상 여부 즉, 양성 또는 음성 여부의 결과를 출력하는 기능을 수행한다.The anomaly detection model 12 performs learning to detect anomalies based on training data among the input data, and outputs a result of whether the inspection data among the input data is abnormal, that is, positive or negative.

이상 탐지 모델(12)은 머신 러닝(machine learning) 기반 신경망 모델일 수 있으며, 라벨이 존재하지 않는 학습용 데이터에 의해 학습되는 비지도 학습에 기반한 모델이다.The anomaly detection model 12 may be a machine learning-based neural network model, and is a model based on unsupervised learning that is learned from training data without labels.

비지도 학습에 기반하여 이상 탐지 모델(12)을 학습시키기 위해서, 다음과 같은 방법을 사용할 수 있다.To learn the anomaly detection model 12 based on unsupervised learning, the following method can be used.

즉, 전처리가 완료된 데이터에 대해 정상 데이터와 이상 데이터를 분류하는 작업을 하고, 정상 데이터를 예컨대 약 7:3 의 비율로 분할하여 학습용 데이터 및 평가용 데이터에 포함시킨다.In other words, the preprocessed data is classified into normal data and abnormal data, and the normal data is divided at a ratio of, for example, about 7:3 and included in the learning data and evaluation data.

또한, 이상 데이터는 평가용 데이터에만 포함시킨다. 정상 데이터만 포함하는 학습용 데이터를 통해 이상 탐지 모델(12)을 학습시킨다. Additionally, abnormal data is included only in evaluation data. An anomaly detection model (12) is trained using training data containing only normal data.

한편, 머신 러닝, 신경망 모델 등의 개념 및 비지도 학습을 수행하는 방법 자체는 종래 기술에 의해 알려져 있는 것이고, 본 발명의 직접적인 목적은 아니므로 여기에서는 상세 설명은 생략한다.Meanwhile, the concepts of machine learning, neural network models, etc. and the method of performing unsupervised learning are known in the prior art, and are not a direct purpose of the present invention, so detailed descriptions are omitted here.

학습이 완료되면, 이상 탐지 모델(12)은 입력 데이터 중 검사용 데이터에 대해 이상 탐지 즉, 양성 또는 음성 여부의 결과를 출력할 수 있다.Once learning is completed, the anomaly detection model 12 can output an anomaly detection result, that is, positive or negative, for the inspection data among the input data.

다음으로, 필터링부(20)에 대해 설명한다.Next, the filtering unit 20 will be described.

도 3은 필터링부(20)의 구성을 나타낸 도면이다.Figure 3 is a diagram showing the configuration of the filtering unit 20.

도 3을 참조하면, 필터링부(20)는, 라벨링 생성부(21), 특징 기여도 생성부(22) 및 필터링 모델(23)을 포함한다.Referring to FIG. 3, the filtering unit 20 includes a labeling generating unit 21, a feature contribution generating unit 22, and a filtering model 23.

라벨링 생성부(21)는, 평가용 데이터를 이상 탐지부(10)로 입력시키고, 이상 탐지부(10)에서 양성(이상)으로 판단된 평가용 데이터에 대해 참 양성(True Positive, TP)과 위 양성(False Positive, FP)으로 라벨링을 수행하는 수단이다.The labeling generation unit 21 inputs the evaluation data into the anomaly detection unit 10, and determines true positive (TP) and true positive (TP) for the evaluation data determined to be positive (anomaly) in the anomaly detection unit 10. It is a means of performing labeling with false positives (FP).

전술한 바와 같이, 라벨링 생성부(21)에 입력된 평가용 데이터는 양성(이상) 데이터만을 포함하고 있다.As described above, the evaluation data input to the labeling generation unit 21 includes only positive (abnormal) data.

여기에서, 참 양성(TP)은 이상 탐지부(10)의 이상 탐지 모델(12)이 실제로 양성(이상)인 데이터를 양성(이상)으로 예측하였음을 의미하고, 위 양성(FP)은 이상 탐지 모델(12)이 실제로는 음성(정상)인 데이터를 양성(이상)으로 예측하였음을 의미한다.Here, true positive (TP) means that the anomaly detection model 12 of the anomaly detection unit 10 predicted data that was actually positive (anomaly) as positive (anomaly), and false positive (FP) means anomaly detection. This means that model 12 predicted data that was actually negative (normal) as positive (abnormal).

라벨링이 수행된 평가용 데이터들은 특징 기여도 생성부(22)로 전달된다.The labeled evaluation data is transmitted to the feature contribution generation unit 22.

특징 기여도 생성부(22)는, 입력 데이터에 대해 각각의 특징(feature)이 양성으로 예측되는데 미치는 기여도 정보를 생성하고, 각각의 특징을 생성된 기여도 정보로 대체하여 2차 특징을 생성하는 기능을 수행한다.The feature contribution generation unit 22 has a function of generating contribution information on the positive prediction of each feature for input data and generating secondary features by replacing each feature with the generated contribution information. Perform.

특징 기여도 생성부(22)로 입력되는 입력 데이터는, 필터링 모델(23) 학습시에는 라벨링 생성부(21)에 의해 라벨링이 이루어진 이상 탐지부(10)에서 양성으로 판단된 평가용 데이터이고, 이상 탐지시에는 이상 탐지부(10)에서 양성으로 판단된 검사용 데이터이다.The input data input to the feature contribution generator 22 is evaluation data judged positive by the anomaly detection unit 10, which has been labeled by the labeling generator 21 when learning the filtering model 23. At the time of detection, it is test data judged positive by the abnormality detection unit 10.

우선, 필터링 모델(23) 학습 수행시에는, 특징 기여도 생성부(22)는 라벨링 생성부(21)에 의해 라벨링이 이루어진 이상 탐지부(10)에서 양성으로 판단된 평가용 데이터의 각각의 특징(feature)이 양성으로 예측되는데 미치는 기여도 정보를 생성하고, 각각의 특징을 생성된 기여도 정보로 대체하여 2차 특징을 생성한다.First, when learning the filtering model 23, the feature contribution generation unit 22 selects each feature of the evaluation data determined to be positive in the anomaly detection unit 10 labeled by the labeling generation unit 21 ( feature) is predicted to be positive, and secondary features are created by replacing each feature with the generated contribution information.

또한, 특징 기여도 생성부(22)는, 이상 탐지시에는, 이상 탐지부(10)에서 양성으로 판단된 검사용 데이터의 각각의 특징이 양성으로 예측되는데 미치는 기여도 정보를 생성하고, 각각의 특징을 생성된 기여도 정보로 대체하여 2차 특징을 생성한다.In addition, when detecting an anomaly, the feature contribution generation unit 22 generates contribution information for each feature of the test data determined to be positive by the anomaly detection unit 10 in predicting it as positive, and generates information on the contribution of each feature. Secondary features are created by replacing the generated contribution information.

여기에서, 기여도 정보는 각각의 특징이 양성으로 예측된 결과에 미치는 영향력을 나타내는 정보를 의미한다.Here, contribution information refers to information indicating the influence of each feature on the result predicted to be positive.

일반적으로, 기계 학습 기반의 분류 모델은 높은 성능을 보일수록 모델의 복잡성(flexibility)이 커지고 사람이 모델의 결정의 원인을 이해할 수 있는 정도인 해석 가능성(Interpretability)이 낮아지는 블랙박스 모델이라 할 수 있다. 복잡성이 높은 모델은 높은 성능을 보여주지만 예측 결과에 대해 사람들이 이해하기 어렵다는 문제점이 존재하고 예측에 대한 이해는 모델에 대한 신뢰로 이어지기 때문에, 본 발명에서는 모델 예측 결과에 대한 해석을 위해, XAI(eXplainable Artificial Inteligence) 기술의 하나인 기여도 정보를 고려한다.In general, a machine learning-based classification model can be considered a black box model in which the higher the performance, the greater the model's complexity (flexibility) and the lower the interpretability (the degree to which a person can understand the cause of the model's decisions). there is. Models with high complexity show high performance, but there is a problem that it is difficult for people to understand the prediction results, and understanding the prediction leads to trust in the model. Therefore, in the present invention, for the interpretation of model prediction results, (eXplainable Artificial Inteligence) Considers contribution information, which is one of the technologies.

기여도 정보를 생성하는 방법으로는, 다음과 같은 방법을 사용할 수 있다. The following methods can be used to generate contribution information.

예컨대, 입력 데이터가 이상 탐지 모델(12)을 거치는 동안 사용되는 각각의 특징(feature)에 대해, 각 특징이 포함되었을 때와 포함되지 않았을 때의 모델 예측값의 차이를 모든 가능한 특징의 부분 집합에 대해 계산함으로써, 각 특징에 대한 기여도 정보를 생성할 수 있다.For example, for each feature used while the input data goes through the anomaly detection model 12, the difference between the model predictions when each feature is included and when it is not included is calculated for all possible subsets of features. By calculating, contribution information for each feature can be generated.

각 특징에 대한 기여도 정보가 생성되면 기여도 정보로 원래 가지고 있던 특징을 대체시킴으로써 2차 특징을 생성한다.Once contribution information for each feature is generated, secondary features are created by replacing the original feature with the contribution information.

다만, 이러한 방법은 예시적인 것이며, 기타 각각의 특징에 대한 영향력을 평가할 수 있는 다른 방법을 사용할 수도 있음은 물론이다.However, this method is illustrative, and other methods that can evaluate the influence of each feature can of course be used.

도 4는 기여도 정보에 의해 생성된 2차 특징의 일예를 나타낸 도면이다.Figure 4 is a diagram showing an example of secondary features generated by contribution information.

도 4의 (A)는 입력 데이터가 원래 가지고 있던 특징을 나타낸 것이고, 도 4의 (B)는 특징 기여도 생성부(22)에 의해 각각의 특징이 기여도 정보로 대체된 2차 특징을 나타낸 것이다.Figure 4 (A) shows the features that the input data originally had, and Figure 4 (B) shows secondary features in which each feature is replaced with contribution information by the feature contribution generator 22.

생성된 2차 특징은 필터링 모델(23)로 전달된다.The generated secondary features are passed to the filtering model (23).

필터링 모델(23)은, 전술한 바와 같이 평가용 데이터에 대해 특징 기여도 생성부(22)에 의해 생성된 2차 특징에 의해 학습되며, 학습 완료 후에는 이상 탐지부(10)에서 양성으로 판단된 검사용 데이터에 대해 특징 기여도 생성부(22)에 의해 생성된 2차 특징을 입력받아 위 양성 또는 참 양성을 분류하고, 위 양성인 검사용 데이터를 제거하는 기능을 수행한다.As described above, the filtering model 23 is learned by secondary features generated by the feature contribution generator 22 for the evaluation data, and after learning is completed, the anomaly detection unit 10 determines that the filtering model 23 is positive. It receives the secondary features generated by the feature contribution generation unit 22 for the test data, classifies them as false positives or true positives, and performs a function of removing false positive test data.

필터링 모델(23) 또한 머신 러닝 기반 신경망 모델일 수 있다. 다만, 필터링 모델(23)은 학습 수행시에는 전술한 바와 같이 양성으로 판단된 라벨이 존재하는 평가용 데이터에 대해 생성된 기여도 정보로 대체된 2차 특징에 의한 지도 학습이라는 점에서 차이가 있다.The filtering model 23 may also be a machine learning-based neural network model. However, the filtering model 23 is different in that it is supervised learning using secondary features that are replaced with contribution information generated for evaluation data with labels judged positive as described above when performing learning.

이러한 2차 특징에 의해 학습이 완료되면, 필터링 모델(23)은 이상 탐지부(10)에서 양성으로 판단된 검사용 데이터의 2차 특징에 대해 위 양성 또는 참 양성을 분류하고, 위 양성으로 판단된 검사용 데이터를 제거함으로써 이상 탐지의 정확도를 높일 수 있게 된다.When learning by these secondary features is completed, the filtering model 23 classifies the secondary features of the test data determined as positive by the anomaly detection unit 10 as false positive or true positive, and determines the test data to be false positive. By removing the inspection data, the accuracy of abnormality detection can be improved.

도 5는 본 발명에 의한 시스템(100)의 성능을 평가한 측정 결과를 나타낸 것이다.Figure 5 shows measurement results evaluating the performance of the system 100 according to the present invention.

도 5에서, SWaT와 KDD는 시스템(100)의 학습을 위해 사용된 데이터 세트이며, 필터링 전은 필터링부(20)가 적용되지 않았을 때의 성능 측정 결과이고 필터링 후는 필터링부(20)를 적용했을 때의 성능 측정 결과이다.In Figure 5, SWaT and KDD are data sets used for learning of the system 100, and before filtering are performance measurement results when the filtering unit 20 is not applied, and after filtering are the performance measurement results when the filtering unit 20 is applied. This is the performance measurement result.

굵은 글씨로 나타낸 부분은 성능 지표가 개선된 항목을 나타낸 것으로서, 도시된 바와 2개의 데이터 세트 각각에 대해 전체적인 성능 지표가 향상되었음을 알 수 있다.The parts in bold indicate items with improved performance indicators, and it can be seen that the overall performance indicators have improved for each of the two data sets shown.

본 발명에 의한 시스템(100)의 필터링부(20)는, 이상 탐지부(10)의 예측 결과에서 위 양성(FP)를 필터링하기 때문에 실제 양성(이상)에서 참 양성(TP)의 비율인 Recall은 거의 변화하지 않으면서 예측된 양성(이상)에서 참 양성(TP)의 비율인 Precision이 필터링부(20)를 적용한 경우가 더 크게 변화하였음을 알 수 있다.Since the filtering unit 20 of the system 100 according to the present invention filters false positives (FP) from the prediction results of the anomaly detection unit 10, the recall is the ratio of true positives (TP) to actual positives (anomalies). It can be seen that Precision, which is the ratio of predicted positives (abnormalities) to true positives (TP), changed more significantly when the filtering unit 20 was applied, with little change.

도 6은 필터링부(20) 적용 전후의 ROC-Curve를 비교하여 나타낸 것이다.Figure 6 shows a comparison of the ROC-Curve before and after applying the filtering unit 20.

도 6에서, ML은 이상 탐지부(10)를 나타낸 것이고, ML + Filter는 이상 탐지부(10)에 필터링부(20)가 적용된 경우를 나타낸 것이다.In FIG. 6, ML indicates the anomaly detection unit 10, and ML + Filter indicates the case where the filtering unit 20 is applied to the anomaly detection unit 10.

두 그림 모두에서 필터링부(20)를 적용하였을 경우 왼쪽 위에 가까워지는 것을 확인할 수 있다. y축인 참 양성 비율(TPR)이 같을 때 x축인 위 양성 비율(FPR)이 필터 모델을 적용한 경우가 더 작게 나오므로 필터링부(20)를 적용한 경우가 보다 효과적으로 위 양성(FP) 비율을 줄이고 있음을 확인할 수 있다.In both figures, it can be seen that when the filtering unit 20 is applied, it gets closer to the upper left. When the true positive rate (TPR) on the y-axis is the same, the false positive rate (FPR) on the x-axis is smaller when the filter model is applied, so the case where the filtering unit 20 is applied reduces the false positive (FP) rate more effectively. can confirm.

이상에서, 본 발명에 의한 바람직한 실시예들을 참조하여 본 발명을 설명하였으나, 이는 예시적인 것으로서, 첨부된 청구범위 및 도면에 의해 파악되는 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다는 점을 유의해야 한다.In the above, the present invention has been described with reference to preferred embodiments of the present invention, but these are illustrative and all changes within the equivalent scope understood by the appended claims and drawings are included in the scope of the present invention. It should be noted that

100...로그 기반 사용자의 이상 행위 탐지 및 특징 기여도에 기초한 위 양성 데이터의 필터링 시스템
10...이상 탐지부
11...전처리부
12...이상 탐지 모델
20...필터링부
21...라벨링 생성부
22...특징 기여도 생성부
23...필터링 모델100...Log-based user abnormal behavior detection and filtering system for false positive data based on feature contribution
10...abnormality detection unit
11...Preprocessing part
12...Anomaly detection model
20...filtering unit
21...Labeling generation unit
22...Feature contribution generation unit
23...Filtering model

Claims

A system for detecting abnormal behavior of users based on logs and filtering false positive data based on feature contributions,
An anomaly detection unit that performs the function of determining whether test data is positive or negative based on a pre-learned anomaly detection model; and
A filtering unit that classifies the test data determined to be positive by the abnormality detection unit as false positive or true positive by a pre-learned filtering model and removes the false positive test data.
A log-based user abnormal behavior detection and filtering system for false positive data based on feature contribution.

In claim 1,
The abnormality detection unit,
A preprocessor that performs preprocessing on input data; and
An anomaly detection model in which learning for anomaly detection is performed based on the training data among the input data and outputs a positive or negative result for the inspection data among the input data.
A log-based user's abnormal behavior detection and filtering system for false positive data based on feature contribution, comprising:

In claim 2,
The input data includes inspection data, which is data subject to anomaly detection, training data used to train an anomaly detection model, and evaluation data used to train a filtering model. A system for detecting abnormal behavior and filtering false positive data based on feature contribution.

In claim 3,
Log-based, characterized in that the data for which preprocessing has been completed by the preprocessor is classified into normal data and abnormal data, the normal data is divided and included in the learning data and evaluation data, and the abnormal data is included only in the evaluation data. A system for detecting abnormal user behavior and filtering false positive data based on feature contributions.

In claim 1,
The filtering unit,
a labeling generation unit that inputs evaluation data into an anomaly detection unit, labels the evaluation data determined as positive by the anomaly detection unit as true positive and false positive, and transmits the labeled evaluation data to a feature contribution generation unit;
a feature contribution generator that generates contribution information for predicting each feature as positive for input data, and replaces each feature with the generated contribution information to generate secondary features; and
The evaluation data is learned using the secondary features generated by the feature contribution generation unit, and after learning is completed, the 2 generated by the feature contribution generation unit is used for the inspection data determined to be positive by the anomaly detection unit. A filtering model that receives primary characteristics, classifies false positives or true positives, and removes false positive test data.
A log-based user's abnormal behavior detection and filtering system for false positive data based on feature contribution, comprising:

In claim 5,
The input data input to the feature contribution generation unit is evaluation data labeled by the labeling generation unit when learning a filtering model, and is inspection data judged positive by the anomaly detection unit when detecting an anomaly. A log-based user abnormal behavior detection and filtering system for false positive data based on feature contribution.

In claim 6,
The feature contribution generation unit,
When performing filtering model learning, information on the contribution of each feature of the evaluation data determined to be positive by the anomaly detection unit labeled by the labeling generation unit to being predicted as positive is generated, and the contribution information generated for each feature is generated. Secondary features are generated by replacing information, and when anomaly is detected, contribution information is generated for each feature of the test data determined to be positive in the anomaly detection unit to be predicted as positive, and the contribution generated for each feature is generated. A log-based user's abnormal behavior detection and filtering system for false positive data based on feature contribution, characterized by generating secondary features by replacing them with information.

In claim 7,
The feature contribution information generator calculates the difference between model prediction values when each feature is included and when each feature is not included for each feature used while the input data goes through the anomaly detection model for a subset of all possible features. By doing so, log-based user abnormal behavior detection and filtering of false positive data based on feature contribution, characterized by generating contribution information for each feature and generating secondary features by replacing existing features with the generated contribution information. system.