Abstract
The operational complexity and dynamicity of clouds highlight the importance of automated solutions for explaining the root cause of security incidents. Most existing works rely on human analysts to interpret provenance graphs for root causes of security incidents. However, navigating and understanding a large and complex cloud-scale provenance graph can be very challenging for human analysts. Without such an understanding, cloud providers cannot effectively address the underlying security issues causing the incidents, such as vulnerabilities or misconfigurations. In this paper, we propose VinciDecoder, an automated approach for generating natural language forensic reports based on provenance graphs. Our main observation is that the way nodes and edges compose a path in provenance graphs is similar to how words compose a sentence in natural languages. Therefore, VinciDecoder leverages a novel combination of provenance analysis, natural language translation, and machine-learning techniques to generate forensic reports. We implement VinciDecoder on an OpenStack cloud testbed, and evaluate its performance based on real-world attacks. Our user study and experimental results demonstrate the effectiveness of our approach in generating high-quality reports (e.g., up to 0.68 BLEU score for precision).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In Sect. 4.2, we discuss how we obtain more pairs of reports and paths for training.
- 2.
Despite removing the numbers, the range of the elapsed time (e.g., milliseconds vs. hours) retains useful information about the incidents.
- 3.
- 4.
Note that while both sets of our experiments in Sect. 4.1 and 4.2 show high quality reports, directly comparing their results is not meaningful as their reports are of incomparable lengths (e.g., cloud management-level provenance graph-based reports are typically longer which has a negative effect on the performance).
- 5.
This study has been identified as quality assurance by Research Ethics/Office of Research of our university, which means it requires no ethics approval.
References
Cisco AVOS. https://github.com/CiscoSystems/avos. Accessed 28 July 2022
CVE-2014-0056. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-0056/. Accessed 28 July 2022
CVE-2015-5240. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-5240. Accessed 28 July 2022
CVE-2016-7498. https://nvd.nist.gov/vuln/detail/CVE-2016-7498. Accessed 28 July 2022
CVE-2020-17376. https://bugs.launchpad.net/nova/+bug/1890501. Accessed 28 July 2022
CVE details. https://www.cvedetails.com/vulnerability-list/. Accessed 14 June 2022
Neo4j Graph Platform. https://neo4j.com/. Accessed 28 July 2022
OpenStack. https://www.openstack.org/. Accessed 28 July 2022
Alsaheel, A., et al.: ATLAS: a sequence-based learning approach for attack investigation. In: USENIX Security, pp. 3005–3022 (2021)
Assila, A., Ezzedine, H., et al.: Standardized usability questionnaires: features and quality focus. eJCIST 6(1) (2016)
Bates, A., Mood, B., Valafar, M., Butler, K.R.B.: Towards secure provenance-based access control in cloud environments. In: CODASPY, pp. 277–284 (2013)
Bhattarai, B., Huang, H.: SteinerLog: prize collecting the audit logs for threat hunting on enterprise network. In: ASIA CCS, pp. 97–108 (2022)
Binyamini, H., Bitton, R., Inokuchi, M., Yagyu, T., Elovici, Y., Shabtai, A.: A framework for modeling cyber attack techniques from security vulnerability descriptions. In: KDD, p. 2574–2583 (2021)
Bleikertz, S., Vogel, C., Groß, T., Mödersheim, S.: Proactive security analysis of changes in virtualized infrastructures. In: ACSAC, pp. 51–60. ACM (2015)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–394 (1999)
Chen, X., Irshad, H., Chen, Y., Gehani, A., Yegneswaran, V.: CLARION: sound and clear provenance tracking for microservice deployments. In: USENIX Security, pp. 3989–4006 (2021)
Chiche, A., Yitagesu, B.: Part of speech tagging: a systematic review of deep learning and machine learning approaches. J. Big Data 9(1), 1–25 (2022)
Cho, K., van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: SSST, pp. 103–111. ACL (2014)
Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. In: ACL, pp. 567–573 (2017)
Gao, P., et al.: Enabling efficient cyber threat hunting with cyber threat intelligence. In: ICDE, pp. 193–204. IEEE (2021)
Hassan, W.U., Aguse, L., Aguse, N., Bates, A., Moyer, T.: Towards scalable cluster auditing through grammatical inference over provenance graphs. In: NDSS (2018)
He, D., Lu, H., Xia, Y., Qin, T., Wang, L., Liu, T.Y.: Decoding with value networks for neural machine translation. Adv. Neural Inf. Process. Syst. 30, 177–186 (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Johnson, C., Badger, L., Waltermire, D., Snyder, J., Skorupka, C., et al.: Guide to cyber threat information sharing. NIST Spec. Publ. 800, 150 (2016)
King, S.T., Chen, P.M.: Backtracking intrusions. In: SOSP, pp. 223–236 (2003)
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.: OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of ACL, System Demonstrations, pp. 67–72. ACL (2017)
Koncel-Kedziorski, R., Bekal, D., Luan, Y., Lapata, M., Hajishirzi, H.: Text generation from knowledge graphs with graph transformers. In: NAACL (2019)
Läubli, S., Sennrich, R., Volk, M.: Has machine translation achieved human parity? A case for document-level evaluation. In: EMNLP, pp. 4791–4796. ACL (2018)
Lavie, A.: Evaluating the output of machine translation systems. AMTA Tutor. 86 (2010)
Lebret, R., Grangier, D., Auli, M.: Neural text generation from structured data with application to the biography domain. In: EMNLP, pp. 1203–1213. ACL (2016)
Lopez, A.: Statistical machine translation. ACM Comput. Surv. (CSUR) 40(3), 1–49 (2008)
Lu, R., Lin, X., Liang, X., Shen, X.S.: Secure provenance: the essential of bread and butter of data forensics in cloud computing. In: ASIA CCS, pp. 282–292 (2010)
L’Heureux, A., Grolinger, K., Elyamany, H.F., Capretz, M.A.M.: Machine learning with big data: challenges and approaches. IEEE Access 5, 7776–7797 (2017). https://doi.org/10.1109/ACCESS.2017.2696365
Madi, T., et al.: QuantiC: distance metrics for evaluating multi-tenancy threats in public cloud. In: CloudCom, pp. 163–170. IEEE (2018)
Miao, H., Deshpande, A.: Understanding data science lifecycle provenance via graph segmentation and summarization. In: ICDE, pp. 1710–1713. IEEE (2019)
Milajerdi, S.M., Eshete, B., Gjomemo, R., Venkatakrishnan, V.: POIROT: aligning attack behavior with kernel audit records for cyber threat hunting. In: CCS, pp. 1795–1812 (2019)
Milajerdi, S.M., Gjomemo, R., Eshete, B., Sekar, R., Venkatakrishnan, V.N.: HOLMES: real-time APT detection through correlation of suspicious information flows. In: IEEE S &P, pp. 1137–1152 (2019)
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
Nguyen, D., Park, J., Sandhu, R.: Adopting provenance-based access control in openstack cloud IaaS. In: Au, M.H., Carminati, B., Kuo, C.-C.J. (eds.) NSS 2014. LNCS, vol. 8792, pp. 15–27. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11698-3_2
Pasquier, T., et al.: Practical whole-system provenance capture. In: SoCC, pp. 405–418 (2017)
Pasquier, T., et al.: Runtime analysis of whole-system provenance. In: CCS, pp. 1601–1616. ACM (2018)
Puduppully, R., Dong, L., Lapata, M.: Data-to-text generation with content selection and planning. In: AAAI, vol. 33, pp. 6908–6915 (2019)
Santana, M.A.B., Ricca, F., Cuteri, B.: Reducing the impact of out of vocabulary words in the translation of natural language questions into SPARQL queries. arXiv preprint arXiv:2111.03000 (2021)
Satvat, K., Gjomemo, R., Venkatakrishnan, V.: EXTRACTOR: extracting attack behavior from threat reports. In: EuroS &P, pp. 598–615. IEEE (2021)
Sharma, S., El Asri, L., Schulz, H., Zumer, J.: Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. CoRR abs/1706.09799 (2017)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2, 3104–3112 (2014)
Tabiban, A., Jarraya, Y., Zhang, M., Pourzandi, M., Wang, L., Debbabi, M.: Catching falling dominoes: cloud management-level provenance analysis with application to OpenStack. In: CNS, pp. 1–9. IEEE (2020)
Tabiban, A., Majumdar, S., Wang, L., Debbabi, M.: PERMON: An Openstack middleware for runtime security policy enforcement in clouds. In: CNS, pp. 1–7. IEEE (2018)
Tabiban, A., Zhao, H., Jarraya, Y., Pourzandi, M., Zhang, M., Wang, L.: ProvTalk: towards interpretable multi-level provenance analysis in networking functions virtualization (NFV). In: NDSS (2022)
Thirunavukkarasu, S.L., et al.: Modeling NFV deployment to identify the cross-level inconsistency vulnerabilities. In: CloudCom, pp. 167–174. IEEE (2019)
Ujcich, B.E., et al.: Cross-app poisoning in software-defined networking. In: CCS, pp. 648–663 (2018)
Wang, H., Yang, G., Chinprutthiwong, P., Xu, L., Zhang, Y., Gu, G.: Towards fine-grained network security forensics and diagnosis in the SDN era. In: CCS, pp. 3–16. ACM (2018)
Wang, Q., Hassan, W.U., Bates, A., Gunter, C.: Fear and logging in the internet of things. In: NDSS (2018)
Wang, Q., et al.: You are what you do: hunting stealthy malware via data provenance analysis. In: NDSS (2020)
Wang, Y., et al.: TenantGuard: scalable runtime verification of cloud-wide VM-level network isolation. In: NDSS (2017)
Wu, Y., Zhao, M., Haeberlen, A., Zhou, W., Loo, B.T.: Diagnosing missing events in distributed systems with negative provenance. In: ACM SIGCOMM, pp. 383–394 (2014)
Yusif, S., Hafeez-Baig, A.: A conceptual model for cybersecurity governance. J. Appl. Secur. Res. 16(4), 490–513 (2021)
Zeng, J., Chua, Z.L., Chen, Y., Ji, K., Liang, Z., Mao, J.: WATSON: abstracting behaviors from audit logs via aggregation of contextual semantics. In: NDSS (2021)
Acknowledgment
We thank the anonymous reviewers for their valuable comments. This work was supported by the Natural Sciences and Engineering Research Council of Canada and Ericsson Canada under the Industrial Research Chair in SDN/NFV Security and the Canada Foundation for Innovation under JELF Project 38599.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Algorithm 2 shows our rule-based mechanism generating reports based on the cloud management-level provenance graphs (e.g., the provenance graph in Fig. 1). To generate fluent sentences, we specify rules for indicating different subjects (line 2–5). We add resources extracted from the names of operations (e.g., a VM in CreateVM) through the template a $resource_type named $main_resource_name (line 7–9). We specify various rules (line 11–20) for describing other affected resources connected to an operation node. We also specify rules to record other information such as the elapsed time between operations (line 21–26). Through such rules specifically designed for each type of operations, resources, and users, VinciDecoder generates reports when there is an insufficient amount of training data for generating high quality reports.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tabiban, A., Zhao, H., Jarraya, Y., Pourzandi, M., Wang, L. (2022). VinciDecoder: Automatically Interpreting Provenance Graphs into Textual Forensic Reports with Application to OpenStack. In: Reiser, H.P., Kyas, M. (eds) Secure IT Systems. NordSec 2022. Lecture Notes in Computer Science, vol 13700. Springer, Cham. https://doi.org/10.1007/978-3-031-22295-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-22295-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22294-8
Online ISBN: 978-3-031-22295-5
eBook Packages: Computer ScienceComputer Science (R0)