Abstract
For efficiently handling thousands of malware specimens, we aim to quickly and automatically categorize those into malware families. A solution for this could be the neighbor-joining method using NCD (Normalized Compression Distance) as similarity of malware. It creates a phylogenetic tree of malware based on the NCDs between malware binaries for clustering. However, it is frustratingly slow because it requires \((N^2+N)/2\) compression attempts for the NCDs, where N is the number of given specimens. For fast clustering, this paper presents an algorithm for efficiently constructing a phylogenetic tree by greatly reducing compression attempts. The key idea to do so is not to construct a tree of N specimens all at once. Instead, it divides N specimens into temporal clusters in advance, constructs a small tree for each temporal cluster, and joins the trees as a united tree. Intuitively, separately constructing small trees requires a much smaller number of compression attempts than \((N^2+N)/2\). With experiments using 4,109 in-the-wild malware specimens, we confirm that our algorithm achieved clustering 22 times faster than the neighbor-joining method with a good accuracy of 97%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
A preliminary version of this algorithm was proposed by Takumi Yone in the master’s thesis [19] of Kyushu University, who was supervised by some authors of this paper.
References
Malwr. https://malwr.com/
Virustotal. https://www.virustotal.com/
Antonakakis, M., et al.: Understanding the Mirai botnet. In: Proceedings of the 26th USENIX Conference on Security Symposium, SEC 2017, pp. 1093–1110. USENIX Association, Berkeley (2017)
Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Kruegel, C., Lippmann, R., Clark, A. (eds.) RAID 2007. LNCS, vol. 4637, pp. 178–197. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74320-0_10
Bayer, U., Comparetti, P.M., Hlauschek, C., Krügel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: Proceedings of the Network and Distributed System Security Symposium, NDSS 2009, San Diego, pp. 8–11 (2009)
Black Lotus Labs: Attack of things! https://www.netformation.com/our-pov/attack-of-things-2/
Cebrian, M., Alfonseca, M., Ortega, A.: The normalized compression distance is resistant to noise. IEEE Trans. Inf. Theory 53(5), 1895–1900 (2007)
Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)
Doctor Web: Dr.Web. https://www.drweb.com
Elias, I., Lagergren, J.: Fast neighbor joining. In: Caires, L., Italiano, G.F., Monteiro, L., Palamidessi, C., Yung, M. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 1263–1274. Springer, Heidelberg (2005). https://doi.org/10.1007/11523468_102
Karim, M.E., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1–2), 13–23 (2005)
Langfelder, P., Zhang, B., Horvath, S.: Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for R. Bioinformatics 24(5), 719–720 (2007)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)
Pa, Y.M.P., Suzuki, S., Yoshioka, K., Matsumoto, T., Kasama, T., Rossow, C.: IoTPOT: analysing the rise of IoT compromises. In: 9th USENIX Workshop on Offensive Technologies, WOOT 2015. USENIX Association, Washington, D.C. (2015)
Price, M.N., Dehal, P.S., Arkin, A.P.: Fasttree 2 - approximately maximum-likelihood trees for large alignments. PLOS ONE 5(3), 1–10 (2010)
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)
Salomon, D.: Data Compression - The Complete Reference, 4th edn. Springer, London (2007). https://doi.org/10.1007/978-1-84628-603-2
Simonsen, M., Mailund, T., Pedersen, C.N.S.: Rapid neighbour-joining. In: Crandall, K.A., Lagergren, J. (eds.) WABI 2008. LNCS, vol. 5251, pp. 113–122. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87361-7_10
Yone, T.: Phylogenetic tree estimation for large-scale malware datasets. Master’s thesis. Kyushu University, Japan (2016). (in Japanese)
Acknowledgment
The authors wish to thank the IoTPOT team from Yokohama National University for providing the dataset. This research was partially supported by JSPS KAKENHI Grant Number 18H03291.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
He, T. et al. (2019). A Fast Algorithm for Constructing Phylogenetic Trees with Application to IoT Malware Clustering. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Lecture Notes in Computer Science(), vol 11953. Springer, Cham. https://doi.org/10.1007/978-3-030-36708-4_63
Download citation
DOI: https://doi.org/10.1007/978-3-030-36708-4_63
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36707-7
Online ISBN: 978-3-030-36708-4
eBook Packages: Computer ScienceComputer Science (R0)