An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques
<p>An overview of the system.</p> "> Figure 2
<p>Result sample of applying CM for evaluating the insider data leakage detection using NB classifier.</p> "> Figure 3
<p>The meta data of extracted features.</p> "> Figure 4
<p>The average AUC-ROC curve value generated by applying label encoding and standard scaling. The (<b>a</b>) plot shows the AUC-ROC curve value of applying DT algorithm, and the (<b>b</b>) plot shows the AUC-ROC curve value of applying RF algorithm. The (<b>c</b>) plot shows the AUC-ROC curve value of applying KNN algorithm, while the (<b>d</b>) plot shows the AUC-ROC curve value of applying KSVM algorithm.</p> "> Figure 5
<p>The average AUC-ROC curve values generated by applying the one-hot encoding method. The (<b>a</b>) plot shows the AUC-ROC curve value of applying the LR algorithm, while the (<b>b</b>) plot shows the AUC-ROC curve value of applying DT algorithm. The (<b>c</b>) plot shows the AUC-ROC curve value of applying RF algorithm and the (<b>d</b>) plot shows the AUC-ROC curve value of applying the KNN algorithm. The (<b>e</b>) plot shows the AUC-ROC value of applying the KSVM algorithm.</p> "> Figure 5 Cont.
<p>The average AUC-ROC curve values generated by applying the one-hot encoding method. The (<b>a</b>) plot shows the AUC-ROC curve value of applying the LR algorithm, while the (<b>b</b>) plot shows the AUC-ROC curve value of applying DT algorithm. The (<b>c</b>) plot shows the AUC-ROC curve value of applying RF algorithm and the (<b>d</b>) plot shows the AUC-ROC curve value of applying the KNN algorithm. The (<b>e</b>) plot shows the AUC-ROC value of applying the KSVM algorithm.</p> "> Figure 6
<p>The average AUC-ROC values of applying SMOTE technique in the detection process. The (<b>a</b>) plot shows the AUC-ROC value of applying the LR algorithm, while the (<b>b</b>) plot shows the AUC-ROC value of applying the DT algorithm. The (<b>c</b>) plot shows the AUC-ROC value of applying the RF algorithm, and the (<b>d</b>) plot shows the AUC-ROC value of applying the NB algorithm. The (<b>e</b>) plot shows the AUC-ROC value of applying the KNN algorithm.</p> "> Figure 7
<p>The AUC-ROC curve results of the applied methods (label encoding, one-hot encoding and SMOTE) on different ML algorithms.</p> ">
Abstract
:1. Introduction
1.1. Insider Threats and Their Consequences
1.2. Strategies for Insider Threat Detection
- It proposes a unified model to detect a data leakage threat that could be carried out by a malicious insider during a sensitive period, before his/her leaving to an organization.
- It addresses the bias issue that could happen due to an inappropriate encoding process, while detecting the insider threat, utilizing the one-hot encoding method.
- It handles the class imbalance problem of the dataset by applying the SMOTE technique.
- It implements the most well-known ML algorithms (LR, DT, RF, NB, k-NN and KSVM) for detecting data leakage acts. The performance of the applied algorithms is compared, for a better view of the optimal detection model.
- It applies significant metrics for evaluating the performance of ML algorithms (precision, recall, F-measure and AUC-ROC value), as they consider the bias problem of the encoding process and the imbalanced classes of the CERT dataset.
2. Related Work
3. Methodology
3.1. Data Collection
3.2. Pre-Processing
3.3. Feature Extraction
3.4. Encoding
3.5. Classification
3.6. Performance Metrics
4. Experimental Evaluation
4.1. Label Encoding and Feature Scaling
4.2. One-Hot Encoding
4.3. Synthetic Minority Oversampling Technique (SMOTE)
- Select a data point from the minority class as an input.
- Find its k nearest neighbors as an argument for the SMOTE () function.
- Select one of these neighbors and place a synthetic data point anywhere on the line which fits both the data point and its neighbors.
- Repeat the steps until the dataset is balanced.
5. Discussion and Comparison
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Lee, C.; Iesiev, A.; Usher, M.; Harz, D.; McMillen, D. IBM X-Force Threat Intelligence Index. 2020. Available online: https://www.ibm.com/security/data-breach/threat-intelligence (accessed on 7 February 2021).
- Claycomb, W.R.; Nicoll, A. Insider Threats to Cloud Computing: Directions for New Research Challenges. In Proceedings of the 2012 IEEE 36th Annual Computer Software and Applications Conference, Institute of Electrical and Electronics Engineers, Izmir, Turkey, 16–20 July 2012; pp. 387–394. [Google Scholar]
- Hunker, J.; Probst, C.W. Insiders and insider threats an overview of definitions and mitigation techniques. J. Wirel. Mob. Netw. Ubiquitous Comput. Dependable Appl. 2011, 2, 4–27. [Google Scholar]
- Silowash, G.; Cappelli, D.; Moore, A.; Trzeciak, R.; Shimeall, T.J.; Flynn, L. Common Sense Guide to Mitigating Insider Threats, 4th ed.; Software Engineering Institute: Pittsburgh, PA, USA, 2012; Available online: https://apps.dtic.mil/sti/pdfs/ADA585500.pdf (accessed on 20 September 2021).
- Sarkar, K.R. Assessing insider threats to information security using technical, behavioural and organisational measures. Inf. Secur. Tech. Rep. 2010, 15, 112–133. [Google Scholar] [CrossRef]
- Erdin, E.; Aksu, H.; Uluagac, S.; Vai, M.; Akkaya, K. OS Independent and Hardware-Assisted Insider Threat Detection and Prevention Framework. In Proceedings of the 2018 IEEE Military Communications Conference (MILCOM2018), Los Angeles, CA, USA, 29–31 October 2018; pp. 926–932. [Google Scholar]
- Almehmadi, A. Micromovement Behavior as an Intention Detection Measurement for Preventing Insider Threats. IEEE Access 2018, 6, 40626–40637. [Google Scholar] [CrossRef]
- Kim, J.; Park, M.; Cho, S.; Kang, P. Insider Threat Detection Based on User Behavior Modeling and Anomaly Detection Algorithms. Appl. Sci. 2019, 9, 4018. [Google Scholar] [CrossRef] [Green Version]
- Theoharidou, M.; Kokolakis, S.; Karyda, M.; Kiountouzis, E. The insider threat to information systems and the effective-ness of ISO17799. Comput. Secur. 2005, 24, 472–484. [Google Scholar] [CrossRef]
- Wong, W.K.; Moore, A.; Cooper, G.; Wagner, M. Rule-Based Anomaly Pattern Detection for Detecting Disease Outbreaks. 2002. Available online: https://www.aaai.org/Papers/AAAI/2002/AAAI02-034.pdf (accessed on 20 September 2021).
- Cappelli, D.M.; Moore, A.P.; Trzeciak, R.F. The CERT Guide to Insider Threats: How to Prevent, Detect, and Respond to Information Technology Crimes (Theft, Sabotage, Fraud); Addison-Wesley: Boston, MA, USA, 2012. [Google Scholar]
- Eldardiry, H.; Sricharan, K.; Liu, J.; Hanley, J.; Price, B.; Brdiczka, O.; Bart, E. Multi-source fusion for anomaly detection: Using across-domain and across-time peer-group consistency checks. J. Wirel. Mob. Netw. Ubiquitous Comput. Dependable Appl. 2014, 5, 39–58. [Google Scholar]
- Eberle, W.; Graves, J.; Holder, L. Insider threat detection using a graph-based approach. J. Appl. Secur. Res. 2010, 6, 32–81. [Google Scholar] [CrossRef]
- Mayhew, M.; Atighetchi, M.; Adler, A.; Greenstadt, R. Use of machine learning in big data analytics for insider threat detection. In Proceedings of the MILCOM 2015—2015 IEEE Military Communications Conference, Tampa, FL, USA, 26–28 October 2015; pp. 915–922. [Google Scholar]
- Alpaydin, E. Introduction to Machine Learning; MIT Press: Cambridge, MA, USA, 2020. [Google Scholar]
- Silowash, L.F.G.; Cappelli, D.; Moore, A.P.; Trzeciak, R.F.; Shimeall, T.J. Common Sense Guide to Mitigating Insider Threats, Technical Report CMU/SEI-2012-TR-012, 4th ed.; Software Engineering Institute, Carnegie Mellon University: Pittsburgh, PA, USA, 2012; Available online: http://resources.sei.cmu.edu/library/asset-view.cfm?AssetID=34017 (accessed on 20 September 2021).
- Liu, L.; De Vel, O.; Han, Q.-L.; Zhang, J.; Xiang, Y. Detecting and Preventing Cyber Insider Threats: A Survey. IEEE Commun. Surv. Tutorials 2018, 20, 1397–1417. [Google Scholar] [CrossRef]
- Homoliak, I.; Toffalini, F.; Guarnizo, J.; Elovici, Y.; Ochoa, M. Insight into Insiders and IT: A Survey of Insider Threat Taxonomies, Analysis, Modeling, and Countermeasures. ACM Comput. Surv. 2018, 52, 30. [Google Scholar] [CrossRef] [Green Version]
- Alsowail, R.A.; Al-Shehari, T. Empirical Detection Techniques of Insider Threat Incidents. IEEE Access 2020, 8, 78385–78402. [Google Scholar] [CrossRef]
- Yuan, S.; Wu, X. Deep learning for insider threat detection: Review, challenges and opportunities. Comput. Secur. 2021, 104, 102221. [Google Scholar] [CrossRef]
- Kim, A.; Oh, J.; Ryu, J.; Lee, K. A Review of Insider Threat Detection Approaches with IoT Perspective. IEEE Access 2020, 8, 78847–78867. [Google Scholar] [CrossRef]
- Al-Mhiqani, M.; Ahmad, R.; Abidin, Z.; Yassin, W.; Hassan, A.; Abdulkareem, K.; Ali, N.; Yunos, Z. A Review of Insider Threat Detection: Classification, Machine Earning Techniques, Datasets, Open Challenges, and Recommendations. Appl. Sci. 2020, 10, 5208. [Google Scholar] [CrossRef]
- Buczak, A.L.; Guven, E. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 2016, 18, 1153–1176. [Google Scholar] [CrossRef]
- Bhuyan, M.H.; Bhattacharyya, D.K.; Kalita, J.K. Network Anomaly Detection: Methods, Systems and Tools. IEEE Commun. Surv. Tutor. 2013, 16, 303–336. [Google Scholar] [CrossRef]
- Al-Shehari, T.; Shahzad, F. Improving Operating System Fingerprinting using Machine Learning Techniques. Int. J. Comput. Theory Eng. 2014, 6, 57–62. [Google Scholar] [CrossRef] [Green Version]
- Al-Shehari, T.; Zhioua, S. An empirical study of web browsers’ resistance to traffic analysis and website fingerprinting attacks. Clust. Comput. 2018, 21, 1917–1931. [Google Scholar] [CrossRef]
- Eberle, W.; Holder, L.; Cook, D. Identifying Threats Using Graph-based Anomaly Detection. In Machine Learning in Cyber Trust; Springer: Berlin/Heidelberg, Germany, 2009; pp. 73–108. [Google Scholar]
- Caputo, D.; Maloof, M.; Stephens, G. Detecting Insider Theft of Trade Secrets. IEEE Secur. Priv. Mag. 2009, 7, 14–21. [Google Scholar] [CrossRef]
- Parveen, P.; Thuraisingham, B. Unsupervised incremental sequence learning for insider threat detection. In Proceedings of the 2012 IEEE International Conference on Intelligence and Security Informatics, Washington, DC, USA, 11–14 June 2012; pp. 141–143. [Google Scholar]
- Senator, T.E.; Goldberg, H.G.; Memory, A.; Young, W.T.; Rees, B.; Pierce, R.; Huang, D.; Reardon, M.; Bader, D.; Chow, E.; et al. Detecting insider threats in a real corporate database of computer usage activity. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; ACM Press: New York, NY, USA, 2013; Volume Part F1288, pp. 1393–1401. [Google Scholar]
- Rashid, T.; Agrafiotis, I.; Nurse, J.R. A New Take on Detecting Insider Threats. In Proceedings of the 8th ACM CCS International Workshop on Managing Insider Security Threats, Vienna, Austria, 28 October 2016; ACM Press: New York, NY, USA, 2016; pp. 47–56. [Google Scholar]
- Thompson, H.; Stolfo, S.J.; Keromytis, A.D.; Hershkop, S. Anomaly Detection at Multiple Scales (ADAMS); Defense Technical Information Center (DTIC): Fort Belvoir, VA, USA, 2011. [Google Scholar]
- Eldardiry, H.; Bart, E.; Liu, J.; Hanley, J.; Price, B.; Brdiczka, O. Multi-domain information fusion for insider threat detection. In Proceedings of the 2013 IEEE Security and Privacy Workshops, San Francisco, CA, USA, 23–24 May 2013; pp. 45–51. [Google Scholar]
- Gavai, G.; Sricharan, K.; Gunning, D.; Hanley, J.; Singhal, M.; Rolleston, R. Supervised and unsupervised methods to detect insider threat from enterprise social and online activity data. In Proceedings of the 7th ACM CCS International Workshop on Managing Insider Security Threats (MIST ’15), Dallas, TX, USA, 30 October 2017; ACM Press: New York, NY, USA, 2015; pp. 13–20. [Google Scholar]
- Goldberg, H.; Young, W.; Reardon, M.; Phillips, B.; Senator, T. Insider Threat Detection in PRODIGAL. Available online: https://aisel.aisnet.org/hicss-50/eg/insider_threat/3/ (accessed on 20 September 2021).
- Ben Salem, M.; Stolfo, S.J. Modeling User Search Behavior for Masquerade Detection. In Programming Languages and Systems; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6961, pp. 181–200. [Google Scholar]
- Toffalini, F.; Homoliak, I.; Harilal, A.; Binder, A.; Ochoa, M. Detection of Masqueraders Based on Graph Partitioning of File System Access Events. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24 May 2018; pp. 217–227. [Google Scholar]
- Alsowail, R.; Al-Shehari, T. A Multi-Tiered Framework for Insider Threat Prevention. Electronics 2021, 10, 1005. [Google Scholar] [CrossRef]
- Georgiadou, A.; Mouzakitis, S.; Askounis, D. Detecting Insider Threat via a Cyber-Security Culture Framework. J. Comput. Inf. Syst. 2021, 1–11. [Google Scholar] [CrossRef]
- Alhajjar, E.; Bradley, T. Survival analysis for insider threat. Comput. Math. Organ. Theory 2021, 1–17. [Google Scholar] [CrossRef]
- Denney, K.; Babun, L.; Uluagac, A.S. USB-Watch: A Generalized Hardware-Assisted Insider Threat Detection Framework. J. Hardw. Syst. Secur. 2020, 4, 136–149. [Google Scholar] [CrossRef]
- Tuor, A.; Kaplan, S.; Hutchinson, B.; Nichols, N.; Robinson, S. Deep learning for unsupervised insider threat detection in structured cybersecurity data streams. In Proceedings of the Artificial Intelligence for Cyber Security Workshop (AAAI-2017), San Francisco, CA, USA, 4–5 February 2017; Volume WS-17-01. [Google Scholar]
- Bose, B.; Avasarala, B.; Tirthapura, S.; Chung, Y.-Y.; Steiner, D. Detecting Insider Threats Using RADISH: A System for Real-Time Anomaly Detection in Heterogeneous Data Streams. IEEE Syst. J. 2017, 11, 471–482. [Google Scholar] [CrossRef]
- Le, D.C.; Khanchi, S.; Zincir-Heywood, A.N.; Heywood, M.I.; Le, D.C. Benchmarking evolutionary computation approaches to insider threat detection. In Proceedings of the Genetic and Evolutionary Computation Conference, Kyoto, Japan, 15–19 July 2018; pp. 1286–1293. [Google Scholar]
- Le, D.C.; Zincir-Heywood, N.; Heywood, M.I. Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning. IEEE Trans. Netw. Serv. Manag. 2020, 17, 30–44. [Google Scholar] [CrossRef]
- Tian, Z.; Shi, W.; Tan, Z.; Qiu, J.; Sun, Y.; Jiang, F.; Liu, Y. Deep Learning and Dempster-Shafer Theory Based Insider Threat Detection. Mob. Netw. Appl. 2020, 1–10. [Google Scholar] [CrossRef]
- Sav, U.; Magar, G. Insider Threat Detection Based on Anomalous Behavior of User for Cybersecurity. In Inventive Computation and Information Technologies; Springer: Berlin/Heidelberg, Germany, 2020; pp. 17–28. [Google Scholar]
- Wasko, S.; Rhodes, R.E.; Goforth, M.; Bos, N.; Cowley, H.P.; Matthews, G.; Leung, A.; Iyengar, S.; Kopecky, J. Using alternate reality games to find a needle in a haystack: An approach for testing insider threat detection methods. Comput. Secur. 2021, 107, 102314. [Google Scholar] [CrossRef]
- CERT. Insider Threat Test Dataset; Software Engineering Institute, Carnegie Mellon University: Pittsburgh, PA, USA, 2020; Available online: https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099 (accessed on 14 September 2021).
- Glasser, J.; Lindauer, B. Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data. In Proceedings of the 2013 IEEE Security and Privacy Workshops, San Francisco, CA, USA, 23–24 May 2013; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2013; pp. 98–104. [Google Scholar]
- El Affendi, M.A.; Al Rajhi, K.H.S. Text encoding for deep learning neural networks: A reversible base 64 (Tetrasexagesimal) Integer Transformation (RIT64) alternative to one hot encoding with applications to Arabic morphology. In Proceedings of the 2018 Sixth International Conference on Digital Information, Networking and Wireless Communications (DINWC), Beirut, Lebanon, 25–27 April 2018; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2018; pp. 70–74. [Google Scholar]
- Su, L.J.; Wu, S.X.; Cao, D. Windows-Based Analysis for HFS+ File System. Adv. Mater. Res. 2011, 179–180, 538–543. [Google Scholar] [CrossRef]
- Lorena, A.C.; Jacintho, L.F.; de Siqueira, M.F.; De Giovanni, R.; Lohmann, L.; de Carvalho, A.; Yamamoto, M. Comparing machine learning classifiers in potential distribution modelling. Expert Syst. Appl. 2011, 38, 5268–5275. [Google Scholar] [CrossRef]
- Apostolakis, J. An Introduction to Data Mining. In Data Mining in Crystallography; Structure and Bonding book series; Springer: Berlin/Heidelberg, Germany, 2009; Volume 134, pp. 1–35. [Google Scholar] [CrossRef]
- Cutler, A.; Cutler, D.R.; Stevens, J.R. Random forests. In Ensemble Machine Learning; Springer: Berlin/Heidelberg, Germany, 2012; pp. 157–175. [Google Scholar]
- Korb, K.B.; Nicholson, A.E. Bayesian Network Classifiers. In Bayesian Artificial Intelligence; CRC Press: Boca Raton, FL, USA, 2010; pp. 233–258. [Google Scholar]
- Domingos, P.; Pazzani, M. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Mach. Learn. 1997, 29, 103–130. [Google Scholar] [CrossRef]
- Ruppert, D. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. J. Am. Stat. Assoc. 2004, 99, 567. [Google Scholar] [CrossRef]
- Hussain, M.; Wajid, S.K.; Elzaart, A.; Berbar, M. A Comparison of SVM Kernel Functions for Breast Cancer Detection. In Proceedings of the 2011 Eighth International Conference Computer Graphics, Imaging and Visualization, Singapore, 17–19 August 2011; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2011; pp. 145–150. [Google Scholar]
- Patle, A.; Chouhan, D.S. SVM kernel functions for classification. In Proceedings of the 2013 International Conference on Advances in Technology and Engineering (ICATE), Mumbai, India, 23–25 January 2013; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2013; pp. 1–9. [Google Scholar]
- Moreno, P.J.; Ho, P.P.; Vasconcelos, N. A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications. 2004. Available online: https://www.hpl.hp.com/techreports/2004/HPL-2004-4.pdf (accessed on 20 September 2021).
- Salzberg, S. Book Review-C4. 5: Programs for machine learning. Mach. Learn. 1993, 240, 302. [Google Scholar]
- Le, D.C.; Zincir-Heywood, A.N. Machine learning based insider threat modelling and detection. In Proceedings of the 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM 2019), Arlington, VA, USA, 8–12 April 2019; pp. 1–6. [Google Scholar]
- Kubat, M. An Introduction to Machine Learning; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
- García, V.; Sánchez, J.; Mollineda, R. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl.-Based Syst. 2012, 25, 13–21. [Google Scholar] [CrossRef]
- Chawla, N.V. Data Mining for Imbalanced Datasets: An Overview. Data Min. Knowl. Discov. Handb. 2009, 30, 875–886. [Google Scholar] [CrossRef] [Green Version]
- Géron, A. Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2019. [Google Scholar]
- Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Abadi, M. TensorFlow: Learning functions at scale. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, Nara, Japan, 18–24 September 2016; ACM Press: New York, NY, USA, 2016; p. 1. [Google Scholar]
- Farahnakian, F.; Heikkonen, J. A deep auto-encoder based approach for intrusion detection system. In Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon-si, Korea, 11–14 February 2018; IEEE: New York, NY, USA, 2018; p. 1. [Google Scholar]
- Rodríguez, P.; Bautista, M.A.; Gonzàlez, J.; Escalera, S. Beyond one-hot encoding: Lower dimensional target embedding. Image Vis. Comput. 2018, 75, 21–31. [Google Scholar] [CrossRef] [Green Version]
- Barua, S.; Islam, M.; Murase, K. A Novel Synthetic Minority Oversampling Technique for Imbalanced Data Set Learning. Program. Lang. Syst. 2011, 7063, 735–744. [Google Scholar] [CrossRef]
- Al-Mhiqani, M.N.; Ahmed, R.; Zainal, Z.; Isnin, S. An Integrated Imbalanced Learning and Deep Neural Network Model for Insider Threat Detection. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 573–577. [Google Scholar] [CrossRef]
- Gamachchi, A.; Boztas, S. Insider Threat Detection Through Attributed Graph Clustering. In Proceedings of the 2017 IEEE Trustcom/BigDataSE/ICESS, Sydney, NSW, Australia, 1–4 August 2017; pp. 112–119. [Google Scholar] [CrossRef] [Green Version]
- Hall, A.J.; Pitropakis, N.; Buchanan, W.J.; Moradpoor, N. Predicting malicious insider threat scenarios using organiza-tional data and a heterogeneous stack-classifier. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 5034–5039. [Google Scholar]
- Le, D.C.; Zincir-Heywood, N. Anomaly Detection for Insider Threats Using Unsupervised Ensembles. IEEE Trans. Netw. Serv. Manag. 2021, 18, 1152–1164. [Google Scholar] [CrossRef]
- Sharma, B.; Pokharel, P.; Joshi, B. User Behavior Analytics for Anomaly Detection Using LSTM Autoencoder—Insider Threat Detection. In Proceedings of the 11th International Conference on Advances in Information Technology, Bangkok, Thailand, 1–3 July 2020; Association for Computing Machinery (ACM): New York, NY, USA, 2020; pp. 1–9. [Google Scholar]
- Singh, M.; Mehtre, B.M.; Sangeetha, S. Insider Threat Detection Based on User Behaviour Analysis. In Machine Learning, Image Processing, Network Security and Data Sciencese; Communications in Computer and Information Science; Springer: Berlin/Heidelberg, Germany, 2020; pp. 559–574. [Google Scholar]
- Wang, J.; Cai, L.; Yu, A.; Meng, D. Embedding Learning with Heterogeneous Event Sequence for Insider Threat Detection. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; pp. 947–954. [Google Scholar]
- Ye, X.; Han, M.-M. An Improved Feature Extraction Algorithm for Insider Threat Using Hidden Markov Model on User Be-Havior Detection. Available online: https://www.emerald.com/insight/content/doi/10.1108/ICS-12-2019-0142/full/html (accessed on 20 September 2021).
- Yuan, F.; Shang, Y.; Liu, Y.; Cao, Y.; Tan, J. Attention-Based LSTM for Insider Threat Detection. In Applications and Techniques in Information Security; Communications in Computer and Information Science; Springer: Berlin/Heidelberg, Germany, 2019; pp. 192–201. [Google Scholar]
- Yuan, F.; Shang, Y.; Liu, Y.; Cao, Y.; Tan, J. Data Augmentation for Insider Threat Detection with GAN. In Proceedings of the 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2020; pp. 632–638. [Google Scholar]
File | Description |
---|---|
Logon | In this file, the logon/off activity of insiders to an organization system is recorded. It contains insiders ids, logon/off events and PCs ids along with the associated timestamps. |
File | This file includes the data of insiders’ activities with respect to operations of transferring files to removable devices. The details of file operations are recorded in “file.csv” (e.g., insiders ids, pc ids, file type, content, timestamp, etc.). |
HTTP | The http file includes the web browsing logs of insiders. It involves visited URLs, timestamps, insiders’ ids, PCs ids, and some keywords involved in the content of visited webpages. |
The email activity information of insiders is stored in the “e-mail.csv” file, such as the ids of insiders, e-mail addresses, timestamps, email size, attachment, etc. Several information can be deduced from the email activity file (e.g., the number of sent emails, whether the recipient is inside or outside an organization, and so on). | |
Device | The logs of insiders’ activities with respect to the use of removable devices are logged in the “device.csv” file. It includes information of the interactions of insiders with removable devices, such as the devices id, connected, disconnected, timestamp, and so on. |
Data Field | Variables Sample |
---|---|
Vectors | Logon, logoff, device, http. |
Timestamps | mm/dd/yyyy hh:mm:ss AM/PM. |
Insider IDs | MAR0955, MCF0600, … |
Actions | Connect/Disconnect, Logon/Logoff, http://wikileaks.org |
Target | Malicious or non-malicious. |
Feature | Description |
---|---|
Vector | This feature represents various types of insider activities (logon/logoff of insider sessions, visited webpage, used removable devices). |
Timestamp | The timestamps of insiders’ actions are represented in a format of (mm/dd/yyyy hh:mm:ss AM/PM). |
Insider ID | The insider ids which are represented as MAR0955, MCF0600, etc. |
Action | The actions of insiders that specify the variables of the feature vector (e.g., Connect/Disconnect, Logon/Logoff, visit http://wikileaks.org, etc.). |
Target | It represents whether an action of an insider is malicious or not according to the ground truth of the CERT dataset. |
Feature | Data Type |
---|---|
Vectors | categorical |
Timestamps | ordinal |
Insider IDs | categorical |
Actions | categorical |
Target | categorical |
Feature | Encoded |
---|---|
Vectors | Int64 |
Timestamps | Int64 |
User IDs | Int64 |
Actions | Int64 |
Target | Int64 |
DETECTED | |||
Malicious | Non-malicious | ||
ACTUAL | Malicious | True positives (TP) | False negatives (FN) |
Non-malicious | False positives (FP) | True negatives (TN) |
Feature Matrix | Training Set | Test Set |
---|---|---|
2308813 | 1847050 | 461763 |
ML Algorithms | Precision | Recall | F-Measure | |||
---|---|---|---|---|---|---|
Actual | Scaled | Actual | Scaled | Actual | Scaled | |
DT | 0.67 | 0.67 | 0.74 | 0.74 | 0.71 | 0.71 |
RF | 0.60 | 0.58 | 0.19 | 0.19 | 0.29 | 0.29 |
KNN | 0.13 | 0.84 | 0.01 | 0.21 | 0.02 | 0.32 |
KSVM | 0.00 | 1.00 | 0.00 | 0.15 | 0.00 | 0.27 |
Insider Acts | Login | USB | Website | Logoff |
---|---|---|---|---|
Login | 1 | 0 | 0 | 0 |
USB | 0 | 1 | 0 | 0 |
Website | 0 | 0 | 1 | 0 |
Logoff | 0 | 0 | 0 | 1 |
ML Algorithms | Precision | Recall | F-Measure |
---|---|---|---|
LR | 1.00 | 0.15 | 0.26 |
DT | 0.68 | 0.77 | 0.72 |
RF | 0.39 | 0.32 | 0.35 |
KNN | 0.33 | 0.06 | 0.11 |
KSVM | 1.00 | 0.15 | 0.27 |
ML Algorithms | Precision | Recall | F-Measure |
LR | 0.50 | 1.00 | 0.67 |
DT | 0.99 | 0.99 | 0.99 |
RF | 0.99 | 0.99 | 0.99 |
NB | 0.77 | 0.95 | 0.85 |
KNN | 0.98 | 0.99 | 0.98 |
Approach | Model | AUC-ROC |
---|---|---|
Rashid et al. [31] | Hidden Markov models (HMM) | 0.83 |
Al-Mhiqani et al. [73] | Deep neural network (DNN) | 0.95 |
Gamachchi et al. [74] | Attributed graph clustering (AGC) | 0.76 |
Hall et al. [75] | Neural network (NN) | 0.95 |
Naive Bayesian network (NBN) | 0.98 | |
Support vector machine (SVM) | 0.98 | |
Random forest (RF) | 0.88 | |
Decision tree (DT) | 0.93 | |
Logistic regression (LR) | 0.80 | |
Le et al. [76] | Unsupervised ensembles (UE) | 0.91 |
Sharma et al. [77] | Long short-term memory (LSTM) | 0.95 |
Singh et al. [78] | Multi fuzzy classifier (MFC) | 0.89 |
Wang et al. [79] | Principled and Probabilistic Model (PPM) | 0.99 |
Ye et al. [80] | Hidden Markov model (HMM) | 1.00 |
Yuan et al. [81] | Recurrent neural network (RNN) | 0.93 |
Gated recurrent unit (GRU) | 0.91 | |
Long short-term memory (LSTM) | 0.93 | |
Yuan et al. [82] | DT + random | 0.90 |
Xgboost + random | 0.92 | |
DT + SMOTE | 0.98 | |
Xgboost + SMOTE | 0.98 | |
DT + GAN | 0.99 | |
Xgboost + GAN | 0.99 | |
Proposed method | LR + SMOTE | 0.79 |
DT + SMOTE | 1.00 | |
RF + SMOTE | 1.00 | |
NB + SMOTE | 0.84 | |
KNN + SMOTE | 0.99 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Al-Shehari, T.; Alsowail, R.A. An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques. Entropy 2021, 23, 1258. https://doi.org/10.3390/e23101258
Al-Shehari T, Alsowail RA. An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques. Entropy. 2021; 23(10):1258. https://doi.org/10.3390/e23101258
Chicago/Turabian StyleAl-Shehari, Taher, and Rakan A. Alsowail. 2021. "An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques" Entropy 23, no. 10: 1258. https://doi.org/10.3390/e23101258
APA StyleAl-Shehari, T., & Alsowail, R. A. (2021). An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques. Entropy, 23(10), 1258. https://doi.org/10.3390/e23101258