5.1 Experiment Setup
Dataset. We adopt the CERT Insider Threat Dataset [
9], which is the only publicly available dataset, to evaluate insider threat detection. In Section
5.4, we apply our model for detecting vandalism on Wikipedia and report evaluation results on the UMD Wikipedia dataset [
15]. The CERT dataset consists of five log files that record the computer-based activities for all employees:
logon.csv that records logon and logoff operations of all employees,
email.csv that records all the email operations (send or receive),
http.csv that records all the web browsing operations (visit, download, or upload),
device.csv that records the usage of a thumb drive (connect or disconnect) and
file.csv that records activities (open, write, copy or delete) on a removable media device. To describe employees’ action behavior, instead of the five original coarse-grained categories for activities, we apply 18 fine-grained categories to describe the employees’ operations. For example, based on the occurrence time,
logon can be detailed as
Weekday_Logon_Normal,
Weekday_Logon_After,
Weekend_Logon and
Logoff. We show the 18 fine-grained categories in Table
2.
The CERT dataset contains 3,995 benign employees and 5 insiders. For each user, it records activities from January 2010 to June 2011. On average, the number of activities for each employee is around 40,000. We preprocess the original dataset by joining all the log files, grouping them by each employee, and then sorting the activities of each employee based on the recorded time- stamps. We include all five insiders and randomly select five benign employees in our experiment. We can see that from Table
3 that malicious activities take up a very small percentage of all activities: ACM2278 only has 0.0701% (22/31,370) malicious activities; CDE1846 only has 0.3549% (134/37,754) malicious activities, CMP2946 only has 0.3920% (243/61,989) malicious activities, MBG3183 only has 0.0094% (4/42,438) malicious activities, and PLJ1771 only has 0.0858% (18/20,964) malicious activities. It is challenging to detect effective insider threat detection algorithm due to the extremely low percentage of malicious activities.
In our experiment, we treat each employee’s activities as one sequence, each of which is sorted by the activity occurrence time, and we have ten streams which are from five insiders above and five normal users randomly selected. For each activity sequence, we feed it into our proposed DMHP algorithm and then obtain the combined likelihood score of each activity from one specific employee dynamically. Given an empirical threshold \(\epsilon\), each activity in the stream can be judged by the real-time combined likelihood score: it is labeled as an insider activity if the corresponding combined likelihood score is less than \(\tau\); otherwise, it is labeled as a normal activity.
Hyperparameters. For the
time model, we set the base intensity of Hawkes process
\(\lambda _0=0.1\), the number of RBF Kernels
\(L=7\), the relevant reference time points as 3, 7, 11, 24, 48, 96, 192 hours to capture both short-term and long-term dependence of user activities, and the corresponding bandwidths are 1, 5, 7, 12, 24, 24, 24.
\(Dir(\alpha ^{\prime }_1,\alpha ^{\prime }_2,\dots , \alpha ^{\prime }_L)\) is a symmetric Dirichlet distribution with the concentration parameter value 0.1. For the
type mode,
\(Dir(\theta ^{\prime }_1, \theta ^{\prime }_2, \dots , \theta ^{\prime }_D)\) is one symmetric Dirichlet distribution with the concentration parameter value 0.01. We do not specify the exact value of likelihood threshold
\(\epsilon\); instead, we rank each activity based on its likelihood value and report the number of real malicious activities in the certain percentage of ranked activities. For the reproducibility, the source code is available online.
1 Baselines. Our DMHP
time&
type uses both type and time information in the modeling. In our evaluation, we compare it with five baselines, DMHP
time, DMHP
type, Isolation Forest [
19], Local Outlier Factor [
14], and ADeMS [
12]. DMHP
time only uses time information as input, and DMHP
type only uses activity type information. Hence, the comparison with these two baselines can be considered as an ablation study. Isolation Forest and Local Outlier Factor are two typical unsupervised anomaly detection methods. Isolation Forest detects anomalies based on the concept of isolation without employing distance or density metrics whereas Local Outlier Factor identifies anomalies by measuring the local deviation of a given data point with respect to its neighbors. ADeMS is a state-of-the-art streaming-based anomaly detection algorithm. ADeMS maintains a set of few orthogonal vectors to capture the prior from the historical non-anomalous data points and use a reconstruction error test to evaluate the anomaly of upcoming data. On the implementation, we use scikit-learn for Isolation Forest and Local Outlier Factor, and the GitHub implementation
2 for ADeMS. To build the inputs of Isolation Forest and Local Outlier Factor, we encode each activity by a vector with 43 dimensions, in which we adopt the one-hot vector to represent the 24 hours in a day as time information and 18 activity types as type information, while one dimension to indicate the time gap since the previous activity. For the baseline ADeMS, the input vector does not include the time gap information because including the time gap information would damage the performance in the CERT dataset, based on our experiments.
Training & Testing. The activities in the first 1.5 months are used to train the proposed model and baselines. We think the dynamic models would stabilize after training with activities of the first 1.5 months. All the rest activities (16.5 months) are employed for evaluation. In all models, we also treat all WWW_visit activities as normal as WWW_visit activities trivially contribute to the insider threat detection.
5.2 Malicious Insider Activity Detection
Figure
1 shows the
receiver operating characteristic (ROC) curves and the corresponding area under the ROC curve (AUC) values of our DMHP models (
time&type,
type,
time), Local Outlier Factor, and Isolation Forest. We can see that our DMHP
time&type achieves the best accuracy with the AUC value of 0.9509, which is significantly higher than
time (AUC 0.5031) and
type (AUC 0.6530). The reason behind this can summarized as follows: First, our
time model is a self-exciting Hawkes process that the occurrence likelihood of a new event increases due to just occurred events (Equation (
6)); therefore, if a series of malicious activities occur consecutively in a very short period of time, our
time model tends to provide high likelihoods for these malicious activities. Second, in general, our
type model is a frequency-based model that it tends to assign a high (low) likelihood to the activity type with a high (low) frequency. Notably, in the CERT dataset, some malicious activities are of high-frequency, e.g.,
email-in,
email-out or
website visit, which makes the
type model difficult to achieve high accuracy. However, as shown in Figure
1, our
time&type model, which combines the
time and
type models, assigns a smaller likelihood to the malicious activity. This demonstrates the advantage of combining both activity type and time information in sequence modeling for insider threat detection. In addition, the
time&type model also significantly outperforms Local Outlier Factor (AUC 0.8857), Isolation Forest (AUC 0.8808), and ADeMS (AUC 0.9353).
For a comprehensive evaluation, we report in Table
4 the corresponding AUC values from all compared models for each of the five
log categories, i.e.,
logon,
email,
http,
device, and
file. We can observe that
time&type obtains the highest AUC values in all five categories,
logon 0.9592,
email 0.9314,
http 0.9844,
device 0.9535, and
file 0.9585. For baselines, ADeMS performs well on all activity types, while Isolation Forest achieves good AUC values in
email and
device but poor AUC values in
logon,
http and
file. We also observe that DMHP with time information alone does not achieve good performance across all five categories.
We also show in Table
5 a detailed comparison of model performance for each insider. We calculate the number of detected malicious activities in the top
\(15\%\) activities with the lowest likelihood scores from each model. As shown in Rows 2 and 3, the malicious activities take up a very small percentage of activities of insiders. Moreover, those malicious activities are often hidden among normal activities and occur in a long period. Our
time&type achieves highest recalls for all insiders except
CMP2946. Concretely, it captures 22 real malicious activities for
ACM2278 with recall
\(22/22 = 100\%\), 128 real malicious activities for
CDE1846 with recall
\(134/134 = 100\%\), 91 real malicious activities for
CMP2946 with recall
\(91/94 = 96.80\%\), 4 real malicious activities for
MBG3183 with recall
\(4/4 = 100\%\), 15 real malicious activities for
PLJ1771 with recall
\(15/15 = 100\%\). Generally, ADeMs also performs well in the malicious activity detection. Isolation Forest achieves its highest recall for
CDE1846. However, it obtains fairly low recall values for other insiders, e.g.,
ACM2278 (
\(36.36\%\)) and
MBG3183 (
\(0\%\)). This is because malicious activities in
ACM2278 and
MBG3183 are heavily associated with
http, and Isolation Forest seems to be not good at capturing the malicious
http behavior pattern, which is demonstrated by the comparably lower AUC in Table
4.
5.3 Insider Threat Detection: Case Study
In accordance with Table
5, the high recall values provided by DMHP
time&type model in the top
\(15\%\) activities lead us to a hypothesis that DMHP
time&type has a tendency to assign low (high) combined likelihood scores to the malicious (normal) activities. To further testify our hypothesis and provide a clear picture for malicious activity detection, we focus on one insider,
ACM2278, as a case study.
With the scenario design of CERT dataset, we learn that
ACM2278 refers to
an insider who did not previously use removable drives or work after hours begins logging in after hours, using a removable drive, and uploading data to wikileaks.org and then, leaves the organization shortly thereafter. There are 22 malicious activities carried by
ACM2278, which form two concrete instances of insider behavior pattern: one instance is from activities 1 to 12 and another is from activities 13 to 22. For reproducibility purpose, we list them in Table
6. In both attack instances, the insider logs on the company’s computer after work, connects the removable drives to computer, uploads data, and finally ends at the next day’s morning before the working hour.
Table
7 shows the performance of insider threat detection provided by all models for
ACM2278. We report the recall value calculated in the top
\(3\%\) percent of activities with the lowest combined likelihood scores from each model. In accordance with Table
7, we can see that DMHP
time&type model and ADeMS both perform well and they detect all 22 malicious insider activities in the top
\(3\%\) percent of alerted activities;
type model performs better than the other two baselines with recall
\(19/22 = 86.36\%\) while
time and Isolation Forest perform poor.
Empirical Distribution of Occurrence Likelihoods. In Figure
2, the green histograms indicate occurrence likelihoods for normal activities of
ACM2278, while the red histograms indicate occurrence likelihoods for malicious activities.
Figure
2(a) shows that the likelihood scores of malicious activities derived from the
time model mainly lie in two intervals
\([0.5, 0.6]\) and
\([0.7, 0.9]\). Similarly, we can observe that the empirical distribution of normal activities derived from the
time model roughly has two peaks with 0.3 and 0.7 as centers. In this case, likelihood scores of malicious and normal activities are interleaved together and it is hard to find a good threshold to separate them clearly, which results in a low AUC for
time model.
On the contrary, Figure
2(b) shows that likelihood scores of insider-threat malicious activities for
type ranges between 0.7 and 0.9, while the ones of normal activities mainly spread from 0.9 to 1.0. Hence, we can notice that
type model can generally separate malicious and normal activities, which results in a good recall for
ACM2278 shown in Table
5. From Figure
2(c), we can clearly observe that the combined likelihood scores of malicious activities from the
time&type model concentrates on the extreme left part of the empirical distribution of normal activities. As a result,
time&type model produces a high AUC value as shown in Figure
1.
5.4 Wikipedia Vandal Detection
In this subsection, to further show the advantage of our DMHP model, we conduct a different task, i.e., detecting vandalism on Wikipedia by using the UMDWikipedia dataset [
15]. In Wikipedia, each article page is edited by users including certified editors and volunteers. Each edit contains various attributes such as page ID, title, time, categories, page type, revert status, and content. Because Wikipedia is based on crowdsourcing mechanism and applies the freedom-to-edit model (i.e., any user can edit any article), it leads to a rapid growth of vandalism (deliberate attempts to damage or compromise integrity). Those vandals who commit acts of vandalism can be considered as insiders in the community of Wikipedia contributors. It is worth noting that user edits in Wikipedia are in principle similar to user activities in insider threat detection that are recorded in log files. Hence, we can leverage log files from Wikipedia to detect the malicious activities conducted by the vandals.
In this experiment, we focus on identifying malicious activities on Wikipedia pages. Because DMHP needs a number of historical activities to capture the patterns, we selected three Wikipedia pages, “List of metro systems”, “Star Trek Into Darkness”, and “St. Louis Cardinals”, which have a large number of edits on the UMDWikipedia dataset. For each page, we collect all edits from its contributors (including vandals). Each page has more than 300 edits and at least three malicious edits. Specifically, if an edit is reverted by bots or certified editors, we consider it malicious. We define two types of activities based on whether the edit is on a content page or a meta page. The reason we treat the edits of each page (rather than of a contributor) as a sequence is to have the ground truth of user modes. In this setting, the number of user modes is the same as the number of contributors and the edits from the same contributor can be considered as activities under a user mode in DMHP. To build the inputs of baselines, we represent each activity by a vector with 26 dimensions, in which 24 dimensions are used to represent 24 hours in a day as time information and 2 dimension are used to represent activity types.
Experimental Results. Figure
3 shows the ROC curve of malicious edit detection. We can observe that the
time&type model significantly outperforms baselines in terms of AUC on malicious edit detection. Similar to results on the CERT dataset, with combining activity types and time information,
time&type achieves higher AUC value than the model using activity type or time alone. We also observe that, in contrast with the CERT dataset, baselines (especially Isolation Forest) achieve poor performances on Wikipedia dataset. This is because Isolation Forest is based on attribute-wise analysis so that it tends to fail when the distribution for anomalous points becomes less discriminative, e.g., if the anomalous and non-anomalous points share the similar attribute range or distribution [
13]. Compared with the CERT dataset, activity-type information in the Wikipedia dataset is subtle with only the choices of
meta page or not, which is not sufficient to discriminate whether or not an activity is malicious. We further select the page “List of metro systems” for our case study. The edit sequence consists of 396 edits from 6 contributors, among which there are 20 malicious edits. Table
8 shows the performance of malicious edit detection with various
\(\lambda _0\) values. Our
time&type model achieves good performance on detecting malicious edits on the page with AUC = 0.8815 when
\(\lambda _0=0.1\). A further investigation shows that our
time&type model can detect 12 out of 20 malicious edits in
\(10\%\) of total edits with the lowest combined likelihood scores.
Sensitivity of Hyperparameters. We further study whether the detection performance of our DMHP is sensitive to hyperparameters. From Equation (
16), we can see that the base intensity
\(\lambda _0\) mainly determines whether a new
mode is created or not for an upcoming activity. In other words,
\(\lambda _0\) affects the number of
modes which may further affect the detection performance. To evaluate the performance of DMHP with various
\(\lambda _0\) values, we use the Wikipedia page “List of metro systems” as an illustrative example. We can observe from columns 1 and 2 in Table
8 that the number of
modes increases with the increment of
\(\lambda _0\). This is because, as shown in Equation (
16), the bigger
\(\lambda _0\) is, the higher probability of creating a new
cluster will be. Concretely, the empirical quantity relation between the number of
modes and
\(\lambda _0\) follows the rule: as
\(\lambda _0\) approaches 0, the number of
modes goes to 1, while it goes to the total number of activities in the stream as
\(\lambda _0\) approaches
\(+ \infty\) [
33].
To investigate the relationship between the number of
modes and the detection accuracy, we report the changes of AUC values with variation of
\(\lambda _0\) in column 3 of Table
8. We can see when
\(\lambda _0\) ranges from 0.00001 to 0.1, the AUC values do not change much although the number of predicted hidden users (
modes) increases accordingly. This is because, as
\(\lambda _0\) is 0.00001, the number of predicted users is 10, which is larger than the ground-truth number of real contributors. In other words, the model tries to distinguish different editing styles in a fine-grained manner even if these edits are actually from the same user. As a result, the AUC values slightly increase from 0.8440 to 0.8815. However, if the
\(\lambda _0\) keeps increasing, we can observe a significant reduction of AUC values. This is because the number of the predicted user is too large compared with ground-truth, which is out of a reasonable range. In this case, the model has already been overfitted, so it is very important to choose an appropriate value for
\(\lambda _0\) which is sensitive.
To study why the number of predicted hidden users (modes) does not affect much the detection performance when
\(\lambda _0\) ranges from 0.00001 to 0.1, we further adopt four clustering metrics,
Adjusted Mutual Information (AMI), V-measure, Homogeneity, and
Normalized Mutual Information (NMI), to evaluate the performance of our
time&type model on the cluster result. Note that we have the ground truth about which user (
mode) each edit is from for UMDWikipedia data. From Table
8, we can observe that with the increasing of the predicted number of
modes, the values of all clustering metrics except
Homogeneity decrease. However, the Homogeneity scores are always 1 with different
\(\lambda _0\) values, which indicates that each mode contains edits from one single user. This explains why the detection performance in terms of AUC is insensitive to the parameter value of the base intensity
\(\lambda _0\) in the range of
\((0.00001, 0.1)\).