research-article

Open access

ANDROIDGYNY: Reviewing Clustering Techniques for Android Malware Family Classification

Authors:

Thalita Scharr Rodrigues Pimenta,

Fabricio Ceschin,

Andre GregioAuthors Info & Claims

Digital Threats: Research and Practice, Volume 5, Issue 1

Article No.: 3, Pages 1 - 35

https://doi.org/10.1145/3587471

Published: 21 March 2024 Publication History

PDF eReader

Abstract

Thousands of malicious applications (apps) are created daily, modified with the aid of automation tools, and released on the World Wide Web. Several techniques have been applied over the years to identify whether an APK is malicious or not. The use of these techniques intends to identify unknown malware mainly by calculating the similarity of a sample with previously grouped, already known families of malicious apps. Thus, high rates of accuracy would enable several countermeasures: from further quick detection to the development of vaccines and aid for reverse engineering new variants. However, most of the literature consists of limited experiments—either short-term and offline or based exclusively on well-known malicious apps’ families. In this paper, we explore the use of malware phylogeny, a term borrowed from biology, consisting of the genealogical study of the relationship between elements and families. Also, we investigate the literature on clustering techniques applied to mobile malware classification and discuss how researchers have been setting up their experiments.

1 Introduction

Thousands of variants of malicious programs (a.k.a. malware) are created per day due to multiple writing techniques, such as code adaptations and code generations [1]. Therefore, using clustering methods, analysts can group the plethora of similar unlabeled samples and handle them. Moreover, sophisticated obfuscation techniques and new tactics are used to prevent and evade anti-malware [2]. Due to this fact, even with the multitude of academic work and deployed security solutions, there is still urgency for effective and heuristic methods for automatic detection of malicious software and their family classification. For many years, the malware categorization into families has been primarily done manually, through a process that typically requires memorization, library description study, and sample collection research. This manual process is time-consuming and exhaustive [3]. Specifically, Miller et al. (2016) [4] report that a modern company can only afford to label 60 malware samples per day.

Furthermore, malware classification suffers from increasing complexity, mainly because of the number of features required to represent application behavior exhaustively. Consequently, the investigation of millions of records of data and features collected from malicious codes becomes impracticable to be carried out without a high degree of automation.

Recently, analysts and researchers have proposed several representation attempts of malicious features and behaviors. For example, examples of the most common data representation methods are binary information, control flow graph (CFG), API-level features and behaviors, and Dalvik bytecode frequency analysis [5]. Through manual or computational analysis, after detecting the malicious behavior of a program, the code can be categorized as a known family variant or as an element of a new family. From this point on, the labeling problem arises. This issue, related to the classification of malware pointed out since the 1990s, refers to the names of malware. A pattern of a malicious file label consists of a hierarchical structure with the type, platform, family, and abbreviation or numbering of the family variant [6]. However, a common practice is creating a name for each analyzed malicious sample. As the appointment is a subjective process, most antivirus companies develop different strategies to name their samples, which results in many specimens of malware referenced with distinguished names [7].

There have been attempts at standards and schemes of malware labeling. However, in practice, the difficulties in analysis and classification remain [8]. Researchers have presented models for label unification, for Android malware, such as Euphony [9] and ANDRE [10].

For commercial antivirus purposes, the detection of malicious traits is more critical than grouping into families and correct labeling. Nevertheless, concerning classification using data mining algorithms, correct labels are fundamental for the training data [11]. Therefore, family variant labels are essential for separating malicious code and also determining the family lineage [12]. Understanding the origin and lineage of polymorphic malware may lead to new techniques for detecting and classifying unknown attacks [13].

Research on malware classification is not recent: it dates from the beginning of the 1990s. The vast majority of papers published in this area today aim to classify samples of malicious code in families. A family contains malware specimens (variants) that implement the same functions or have similar behaviors or structures. Grouping malware samples also aid in the defense, removal, or analysis of malicious software. For example, malware may have alike weaknesses, similar structures, and effects, all of which can be identified by improved detection methods [14]. Thus, clustering techniques and lineage research about relationships between families and variants are significant to security analysts and end-users.

In this paper, we present the types of clustering methods, their disadvantages, strengths, and examples of algorithms in Section 3. Also, we address the main articles available in the field of clustering and phylogenetic analysis of mobile malware, especially the Android platform. We discuss their broadness, advantages, limitations, repeatability, and feasibility of application in the wild.

The remainder of the paper is divided as follows: Section 2 provides an overview of concepts about mobile malware types, malware analysis mechanisms, including extraction and selection techniques, and Android malware datasets. Section 3 presents several clustering methods, strengths, and weaknesses. Besides, it includes information about similarity indices, linking criteria, and clustering validation indexes. Section 3.5 briefly introduces phylogeny concepts, especially for computing. On Section 4, we examine the main techniques used in the analyzed studies. Furthermore, we propose an approach to malware clustering. Finally, Section 5 displays the final considerations.

2 Background AND Related Work

In previous decades, the primary focus of malware analysis was malicious code targeting desktop operating systems, especially the detection and classification of Windows malware. However, since 2009, the growth of mobile malware has been impressive. Before this year, fewer than 1,000 mobile malware were known [15].

Although there are several surveys regarding mobile malware detection, these articles do not focus on clustering techniques. Besides, a similar article on malware clustering does not address mobile malware and is brief, not examining into concepts of clustering and phylogeny. In Table 1, we show related surveys on malware classification and clustering.

Table 1.

Paper (1st author)	Scope	OS	Classification
Mohini [16]	The authors present a study of the Android security model, application-level security, and security issues of the Android platform.	Android	Binary
Gandotra [17]	This is a review of techniques for analyzing, detecting, and classifying malware.	Windows	Binary
Feizollah [18]	This article is an analysis of feature selection in mobile malware detection.	Android	Binary
Arshad [19]	The review addresses diverse anti-malware techniques, their benefits, and their limitations.	Android	Binary
Riasat [20]	A systematic study based on Android malware detection in official and unofficial markets.	Android	Binary
Baskaran [21]	In this paper, the authors present an investigation of the evolution of Android malware detection techniques based on machine learning algorithms.	Android	Binary
Yadav [22]	The article addresses the development of detection and existing analysis methods of Android malware.	Android	Binary
Malhotra [23]	The authors reviewed malware detection techniques on the mobile platform.	Android, iOS and Symbian	Binary
Tam [15]	This work is an extensive survey involving Android malware analysis and detection techniques.	iOS, Windows OS, Palm, BlackBerry, Symbian and Android	Binary
Zacharia [24]	The survey is a brief discussion of Android malware detection techniques.	Android	Binary
Yan [25]	The authors compare mobile malware detection methods based on evaluation criteria and measures.	Windows, iOS and Android	Multiclass
Yu [26]	A survey involving malware behavior description, behavior analysis methods, and visualization techniques.	Windows and Android	Binary
Odusami [27]	The authors investigate malware detection techniques for identifying gaps.	Android	Binary
Hamed [28]	This review covered techniques of malware detection by using different systems of cloud-based intrusion detection.	iOS, Windows OS, Palm, BlackBerry, Symbian and Android	Binary
Alswaina [29]	The authors highlight the identified limitations in the literature, challenges, and future research directions regarding the Android malware family.	Android	Multiclass
Kumars [30]	This survey contains strategies for Android Malware Detection focusing on methodology and frameworks to classify, cluster, and extract Android malware features. Android	Binary
Meijin [31]	This article presents a systematic review of Android malware detection, including information about feature extraction, raw data preprocessing, valid feature subsets selection, and machine learning-based selection models.	Android	Binary

Table 1. Related Surveys of Malware Classification and Clustering

Another recent survey that addresses a crucial issue with Android malware classifiers is that of Pendlebury et al. [32]. The authors discuss how to remove experimental bias in malware classification. Also, they implemented TESSERACT, an open source evaluation framework for comparing Android malware classifiers in a realistic setting.

The process of clustering Android malware follows a standard set of steps. In Figure 1, we present a flowchart of the activities performed and methods applied during malware analysis and their assigned families. The following subsections cover methods available in the mobile malware classification and detection literature.

Fig. 1.

2.1 Mobile Malware: Types and Operating Systems

From the end user’s point of view, the information that their device is infected is usually enough to look for ways and software to remove malware. However, for security analysts, and professionals responsible for forensics and incident response, information about families can help damage contention, alert spreading, decision-making, and mitigation tasks.

Qamar et al. (2019) have generated a timeline of mobile malware detection between 2000 and 2018 [33]. By analyzing the timeline created, we noted that between 2000 and 2007, most found malware were to Symbian OS, usually worms, trojans, and spyware. As of 2008, despite we found some malware on Ios and Windows, the most significant amount is on Android. Thus, one reason for the growth of malware Android is that this operating system is open, and its app store (Google Play) is not closely monitored for security [34].

Dagon et al. (2004) present a taxonomy based on committed security goals: Confidentiality, Integrity, and Availability [35]. In the Confidentiality category are threats of data, such as bluebugging and bluesnarfing. In the second category (integrity) appears Phone hijacking. In the category of Evaluability, the authors mention Protocol-based denial-of-service and battery-draining attacks.

Baskaran et al. (2016) cite other taxonomy from the existing and known vulnerabilities. According to the authors, the main types of Mobile threats are [21]:

•

Update Attack: This type of attack involves a benign application installed on the operating system that downloads malicious third-party updates and installs them on the system. The detection is complex.

•

Information Extraction: Attacks and codes which objective to steal user data, such as IMEI numbers.

•

Messaging and Automatic Calls: Codes used to make calls and send messages from attacked and compromised devices.

•

Root Exploits: Mechanisms used to gain administrator privileges, take device control, and make changes in the system.

•

Optimization for search engines: In this type of attack, the code simulates clicks on specific links to enlarge website traffic, increasing revenue from these pages.

•

Botnets: Attackers can use compromised device networks for malicious actions, including DDoS attacks and spam delivery.

Chakraborty et al. (2017) reinforce that a security analyst, besides an accurate predictor, needs to understand which features discriminate between different malware families. These findings are valuable in engineering robust signatures for new malware variants detection [36].

According to the literature, there is a window of time for security analysts and end-users to notice new variants. For Android malware analysis, researchers indicate that one can spend about three months to detect new malware [37].

In the malware categorization process, families contain samples with similar attack techniques and structures [38]. According to Malik et al. (2017), we can categorize Android malware families by their installation method on devices [39]. Some categories include:

•

Installation Method: The principal methods to install on devices are repackaging, update attack and drive-by download. Repackaging occurs when malware authors use popular apps to disassemble them and enclose malicious payloads, to then re-assemble. The update attack technique is more complicated to detect. In addition to repackaging, update components download the malicious payload at run time [40].

•

Activation: Malware use events, such as boot completion, package removing/installation, SMS or call phone activities, and network connectivity tasks to activate.

•

Malicious Payload: The most common types of malware payload are privilege escalation, remote control, financial, charges, and personal information stealing [40].

Understanding the type of malware, for example, banking trojans, allows you to apply specific solutions to remove the threat.

2.2 Malware Collection and Malware Android Datasets

Regarding mobile malware data acquisition, the most used methods are real-time acquisition, online acquisition, use of datasets, and sandboxes [33].

Table 2 provides information on some existing malicious APK datasets. In Figure 2, we show information about the most popular datasets for the Android platform in recent years.

Fig. 2.

Table 2.

Name	Collection	Samples	Families	URL
Malgenome	2011–2015	1,260	49	http://www.malgenomeproject.org
Drebin	2010–2012	5,560	179	https://www.sec.cs.tu-bs.de/ danarp/drebin
M0Droid	2011–2013	193	-	http://cyberscientist.org/m0droid-dataset
Koodous	2017	> 100,000	-	https://koodous.com
VirusShare	2016	34,140,004	-	https://virusshare.com
Androzoo	2016	10,147,916	76000	https://androzoo.uni.lu
PRAGuard Dataset	2015	10,479	-	http://pralab.diee.unica.it/en/AndroidPRAGuardDataset
AAGM Dataset	2017	1,900	12	https://www.unb.ca/cic/datasets/android-adware.html
AMD Project	2017	4354	42	http://amd.arguslab.org
Kharon Dataset	2016	60	19	http://kharon.gforge.inria.fr/dataset
Contagio	2011–2013	237	-	http://contagiodump.blogspot.com
Piggybacking	2011–2016	1,136	29	https://github.com/serval-snt-uni-lu/Piggybacking
RmvDroid	2014–2018	9,133	56	https://doi.org/10.5281/zenodo.2593596
AndMal2017	2015–2017	426	42	https://www.unb.ca/cic/datasets/andmal2017.html
AndMal2020	2006–2020	400,000	191	https://www.unb.ca/cic/datasets/andmal2020.html
MalDroid20	2017–2018	17,341	-	https://www.unb.ca/cic/datasets/maldroid-2020.html

Table 2. Android Malware Datasets

The URLs of M0Droid and AMD Project were not available at the time this essay was in progress.

Concerning the number of available samples, the smallest datasets are Kharon (60 elements), M0Droid (193 elements), and Contagio (237 elements). The two largest are Androzoo, with over 76,000 distinct families, and Koodous.

The first public Android malware dataset was released in 2011, and one of the most popular is Malgenome. This set includes 1,260 elements of 49 Android malware families collected between August 2010 and October 2011.

Moreover, it is interesting to note that the PRAGuard dataset contains modifications and obfuscation of the specimens present in Malgenome and Contagio.

The Drebin dataset, publicly available in 2014, has 5,560 specimens from 179 families. The collection period was between 2010 and 2012.

Information about the AAGM dataset indicates that 250 adware malware, 150 generic malware, and 1,500 benign applications belong to the suite. In addition, the samples correspond to variants of 12 families.

Wei et al. (2017) have created the AMD database, providing information with family information and characteristics, including variant behavior reports [41].

2.3 Mobile Android Malware Analysis

The malware analysis techniques aim to investigate samples to determine their functionality, extracting as much information as possible [7]. From the analysis and discovery of its structure, behavior, and weaknesses, “vaccines” can be developed.

Due to the particularities of mobile malware, analysis requires sophisticated techniques. A critical point of mobile phones is the sensor and event-based system. As with desktop malware analysis, static, dynamic, and hybrid analysis methods are employed [42]:

•

Static analysis: This analysis type involves disassembling and source code analysis. In this method, it is possible to obtain information without malware running. Analysts perform investigations based on signatures, permissions, and component analysis. As in the traditional static analysis, the major limitation of this scheme refers to the shortage of treatment of special execution conditions and the investigation of the execution paths. Conventional techniques are detection based on the signature, API Calls, commands, dangerous permissions, hardware components, intent, and dex files. Most found static techniques in the literature investigate confidentiality threats. There are gaps related to confidentiality, integrity, and availability threats [43].

•

Dynamic analysis: The behavior of malicious code while being executed in a controlled environment is analyzed. When using this type of analysis, we can view information about the device’s memory, operating conduct, response time, and typical device performance data [44]. The following techniques are part of static analysis: anomaly-based detection, Tairu analysis, emulation-based detection, out-of-box analysis, in-box analysis, and virtualization. Besides the energy consumption, the analysis process could employ processing and network connection, so it is more suitable to opt for cloud analysis.

•

Hybrid analysis: This method combines static and dynamic analysis techniques, possibly increasing robustness and code investigation coverage, and enables the analyst to find more vulnerabilities [15].

•

Cloud-based analysis: Due to the battery and processing device limitations, the analysis process occurs in the cloud, using replicas and emulators of what is running on the mobile phone, returning the outcomes to the mobile device. One benefit is that the malware detection and classification techniques occur on the analysis servers [45].

Feizollah et al. (2015) [18], Schmeelk et al. (2015) [43], Zachariah et al. (2017) [24], Tam et al. (2017) [15], Baskaran et al. (2016) [21], Arshad et al. (2016) [19], and Bakour et al. (2019) [46] presented reviews of the types of analysis with more details.

Regarding the feature extraction method, it is noteworthy that the quality directly implies the accuracy of the malware classification algorithm. Consequently, characteristics that represent the structure and behavior of the analyzed variant should be selected [7].

2.4 Attributes, Features Extraction, and Features Selection Techniques

In the study of clustering and phylogenetic analysis of mobile malware, it is essential to emphasize that the resulting quality depends on the extracted features and the analysis type of the malicious code. The following is information about available approaches.

Machine learning approaches need the relevant features extraction of the samples and their families. Features are applied elements or characteristics to classify an application as either malicious or benign code [47]. The features are either categorical or numerical sequences in nature [38].

For mobile malware, the following types of attributes can be listed from [48], [21], and [33]:

•

Static Attributes: Include sensitive and restricted permissions, Java code, strings, hardware components, APIS calls, native commands, XML elements, application metadata, .dex file opcodes, task intent, and hardware usage analysis.

•

Dynamic Attributes: System calls, network traffic, SMS, process information battery usage, Dalvik and native memory, CPU consumption, WIFI status, dynamic code execution, file activities, information leakage, system components, user interactions, number of active connections, data packets sent, and other data.

•

Other extraction schemes: Use of metadata (application description, developer ID, and application category), native calls, and hybrid scenarios (a combination of statistical and dynamic characteristics).

Experiments presented in [49] suggest that combining intents and permissions are more efficient for malware detection than just using the application’s whitelist.

2.5 Features Extraction and Features Selection

Wang et al. (2019) discuss problems related to feature extraction. For example, the main problem is the required time to obtain information from the files. Also, the process can generate millions of features with a value of 0. How to process the vector of values efficiently is also a challenge. Regarding extraction through statistical analysis, the authors cite the difficulty of obtaining data from malicious applications that present advanced obfuscation and encryption techniques. Concerning dynamic analysis, they consider that applications may have runtime protection mechanisms [50].

In a feature extraction process, we can collect thousands of attributes. Thus, the selection feature objective is to clarify the data transference to the classification model. After extracting features, it is necessary to reduce the size of the dataset. Many features may not be valuable for classification. Two popular approaches to this type of selection are attribute-based and subset-based. However, not every feature selection method is suitable for clustering tasks.

Table 3 contains forms of data representation frequently used in malware analysis. Some significant characteristics are:

Table 3.

Type	Description
N-grams	Programs are broken into strings, can be bytes of assembly instructions
	lines of code, and so on [56]. Examples: byte code 2-gram, opcode 3-gram.
Binary Frequency	The binary vector is dictionary-sized, where:
	0 = absence of the term,
	1 = presence of the element in the sample code.
	Example: binary vector of app permissions.
Frequency Feature Extraction	In this type of extraction, the vector contains the number of occurrences
	of dictionary terms, not just the absence and presence of the feature [57].
Frequency Weighting Factors	Besides calculating the frequency of terms, methods such as TF-IDF use weights
	to determine the importance of the attribute in the feature set.
Hidden Markov Model	Although not considered merely a form of feature extraction, HMMs are stochastic models that
	represent states, having satisfactory results in detecting malicious code.

Table 3. Feature Extraction and Data Representation

•

Opcode: Opcodes or opstrings are abbreviations for operation codes, instruction syllables, and instruction packages. These codes are part of the machine language, which specifies the operations which the device will perform.

•

String extraction: Analysis of the plain text presents in the application’s binary files, including external class names, window strings, and other graphical items [51].

•

Function Call Sequences: Binaries can be reasonably well identified as ordinary or malicious, looking only at the names of functions and calls that appear in their codes [52].

•

Control flow graphs - CFG: a program’s CFG displays all the possible paths that an executing program can use [53]. The size of the generated graph proportionally increases as far as the computational complexity.

•

Dangerous API calls: Some techniques detect calls to specific APIs and requests for dangerous permissions. Thus, if the program makes at least one of these calls, the sample is considered unreliable [54].

•

Network and communication data: Most malware need a network connection to spread and share data from their malicious activities. From network monitoring logs, extracted traces assist in detecting the malware behavior [55].

2.6 Features Selection for Clustering

Feature selection differs in supervised and unsupervised methods, mainly due to the class attribute present in supervised methods. Generally, we use sampling techniques to reduce noise and find the most notable features of the set. Yet, for both clustering and classification tasks, feature selection methods can be divided into [58]:

•

Filters: This method is applied to estimate the performance of features in datasets. The term filter derives from the idea that irrelevant attributes are filtered from the database before the classification algorithm [59]. In clustering approaches, all features are evaluated individually by calculating the sample’s pairwise similarity. They are generally slower than others feature selection categories, especially with large amounts of data. They are generally slower than others feature selection categories, especially with large numbers of samples and features.

•

Wrappers: Unlike the filters approach, Wrapper methods are related and operate with clustering algorithms. These approaches evaluate subsets of variables to detect the potential relations among variables [60]. Despite the high computational cost, these methods generally reach more significant clustering performance when compared to filters. [58].

•

Hybrid approaches: The hybrid feature selection method merges the Filter method with the Wrapper method, combining the computational efficiency of the first and the predictive accuracy of the last approach [61].

•

Embedded: Embedded approaches usually develop cluster labels through clustering algorithms. Thus, these methods transform unsupervised feature selection into sparse learning-based supervised feature selection with the generated cluster labels [62].

3 Clustering AND Phylogeny Applied to Malware Samples

Clustering is an unsupervised learning technique used to find unexpected patterns in data. Cluster Analysis is a technique for grouping objects into clusters by one measure of pre-established similarity [63]. Elements in the same group are more similar to each other than objects that are in another defined cluster.

Given the difficulty of analyzing all the possibilities of groups within a large amount of data, many techniques are available to assist cluster formation.

Careful cluster analysis requires methods that provide the following characteristics [64]:

•

Scalability to handle large amounts of data and multiple dimensions;

•

Capacity to cluster different types of data and objects;

•

Ability to define groupings of different sizes and shapes;

•

Require minimum knowledge for the correct determination of input parameters;

•

Run successfully even in the presence of noise;

•

Display consistent results regardless of the order of the presentation of the data.

The clustering task is highly dependent on the parameters, similarity measures, and methods used by the algorithm [65]. Therefore, Section 3.1 discusses information on distance and similarity measures.

3.1 Distances and Similarities

In the literature, there are concepts of similarity and distance. While the similarity between two close objects has a value of 1, the distance has a value of 0, and vice versa. Similarity and distance are inverses of each other [38]. Thus, similarity measures are numerical standards to demonstrate how related two objects are. In Table 4, we display prevalent measures in the literature.

Table 4.

Name	Description	Equations
Manhattan distance	This measure calculates the distance between two points measured along axes at right angles [66].	\(\begin{aligned}MD(i,j) = \sum \nolimits _{i=1}^n\|i_{\text x}- j_{\text x}\| \end{aligned}\), where i and j = the first and second point, respectively. The i _x and j _x are the components of the points i and j, respectively.
Euclidean distance	The Euclidean distance between two points is the length of the line segment connecting them [66].	\(\begin{aligned}ED(x,y) = \sum \nolimits _{i=1}^n \sqrt {v^{2}_{\text i} - u^{2}_{\text i}} \end{aligned}\), where u_i and v_i are the components of the points i and j, respectively.
Cosine distance	This similarity employs the cosine of the angle between two vectors (v and u) of the feature space from points x and y, respectively [66].	\(\begin{aligned}Cosine(x,y) = 1 - \cos {\beta } = 1 - \frac{v.u}{\|\|v\|\|.\|\|u\|\|} \end{aligned}\)
Jaccard Index	This measure estimates the similarity between sample sets A and B using the intersection size divided by the size of the union of the sample sets [67].	\(\begin{aligned}Jaccard(a,b) = \frac{\|a\cap b\|}{\|a \cup b\|} \end{aligned}\)
Normalized Compression Distance	NCD is a measure based on the application of the compressor to compute the ratio of similarity of two files. [68].	\(\begin{aligned}NCD(p,q) = \frac{C(pq) - min\lbrace C(p).C(q)\rbrace }{max\lbrace C(p).C(q)\rbrace } \end{aligned}\)
Jaro Winkler	Adaptation of the metric Jaro [69]. This measure employs the number and order of characters familiar to two strings to calculate the similarity [70].	\(\begin{aligned}J = \frac{m}{3a} + \frac{m}{3b} + \frac{m-t}{3m} \end{aligned}\), where m is the number of correlations between characters; a and b are the respective sizes of the strings compared; t is the number of transpositions.
Pearson Correlation Coefficient	This coefficient measures statistical relationship (association) between two variables [71].	\(\begin{aligned}PCC = \frac{\sum _i(X_{\text i}- \overline{X})(Y_{\text i}- \overline{Y})}{\sqrt {\sum _i(X_{\text i}- \overline{X})^{2}(Y_{\text i}- \overline{Y})^{2}}} \end{aligned}\)

Table 4. The Main Similarity Measures Found in the Literature

For categorical sequences, we can employ techniques such as Sequence alignment, The Longest Common Subsequence, and N-gram analysis [38].

As mentioned earlier, n-grams are substrings of a long string with length N [72]. Sequence alignment is a method for ordering one sequence above the other to identify the equivalence among similar substrings. The longest Common Subsequence (LCS) algorithm finds the most extended consecutive sequence of common items for all possible prefix combinations of the input strings [73].

According to Borbely (2016), there are still challenges in choosing appropriate data understanding algorithms, the optimal combination techniques, and their parameters for clustering tasks [74].

3.2 Analysis of Clustering Types

Clustering algorithms include various categories and diverse methods, such as the ones listed below.

Partitioning Methods: Subject to that there are n objects in the original set, partitioning methods can split the group into k partitions. The best-known algorithms in this category are k-means and k-medoid, which deal with an idea of a center of gravity or a central object. All methods in this group have the same grouping quality and the same problems [75]: you need to know the number of clusters in advance; Clusters with extensive variations in size are difficult to identify, and the method is suitable for concave spherical clusters.

•

Strengths: Linear scalability. Some partition-based algorithms, such as CLARANS, handle outliers well [76].

•

Weaknesses: It deals only with numeric data [77]. It is sensitive to the order of recorded data [78].

Hierarchical Methods: A hierarchical grouping produces a differentiated tree, often called a dendrogram. In this tree, the objects are the leaves, and the inner nodes reveal a similar structure to the points. If the grouping hierarchy is formed from the bottom up, each data object is initially a cluster by itself, so the small clusters are merged into larger groups at each cluster level. Ranking until all data objects are in a group. This type of hierarchical method is called agglomerative.

The reverse process is called divisive hierarchical grouping. Several new hierarchical algorithms have appeared in recent years. The main difference between all these hierarchical algorithms is how to measure the similarity between each pair of clusters [75].

•

Strengths: No initial number of clusters required. Versatility and applicability with different types of attributes [79].

•

Weaknesses: They are not scalable. Difficulty finding an optimal number of clusters depending on dataset size [36, 80, 81, 82].

Density-based methods: Density-based algorithms are an attempt to address the need for a skillful method to find clusters in arbitrary formats. In these strategies, the idea of groups represents the existence of dense areas of data, separated by regions with low density. Some examples of these algorithms are DBSCAN, OPTICS, and DENCLUE.

•

Strengths: It is noise resistant and can handle grouping of varying sizes and shapes [83].

•

Weaknesses: These methods do not handle well when varying density clusters or extensive data [84] [36].

Grid methods: Grid-based algorithms first quantify the grouping space into a finite number of cells, which forms a grid. The analysts perform all the operations in this grid. Examples of algorithms are Sting, Wavecluster, and Click.

•

Strengths: These algorithms are suitable for parallel processing and increment updating [85].

•

Weaknesses: The clustering result is sensitive to the granularity [86].

Fractal-based methods: This clustering method uses data auto-similarity properties. First, the algorithm divides the data into sufficiently large subclusters with high compression. In the second step, the algorithm merges complex subclusters closer to each other. This technique uses a more natural and fully deterministic method to generate the final clusters [87]. An example of the algorithms used in this type of clustering is FD3 [88].

•

Strengths: Large efficiency and great scalability, dealing with outliers effectively [89].

•

Weaknesses: the results are firmly sensitive to the parameters [86].

Methods based on Models: In this scheme, the reference algorithms models are used for each grouping to optimize the curve between the analyzed elements through a mathematical model. Thus, these clustering algorithms are ideal for unknown distributions, such as a mixture of elemental distributions [90]. In this type of clustering, each cluster corresponds to a different distribution; usually, distributions are considered Gaussian. One of the best-known examples is the Expectation-Maximization Algorithm - EM.

•

Strengths: This clustering category automatically identifies the optimal number of clusters. Also, this type of clustering regards the probability of samples belonging to each group [91].

•

Weaknesses: Cost and high computational time [92].

Graph-based methods (Graph Theory): The representation of a clustering problem data can use a graph, where a node represents each set element. For modeling the distance, the method employs a specific weight related to the edge that connects two elements. Examples of algorithms are HCS, DTG, CLICK, and CAST [93].

•

Strengths: These algorithms fit for the data set with random shape and high dimension [86].

•

Weaknesses: The number of clusters needed to be preset [94].

Methods based on Distribution: This scheme assumes that the data is composed of distributions, such as Gaussian distributions. While the center distance of the distribution increases, the probability of a point belonging to that distribution decreases. DBCLASD [95] is an example of such a clustering method algorithm.

•

Strengths: Moderately high scalability and supported by the well-developed statistical science [86].

•

Weaknesses: Relatively high time complexity and a strong influence of many parameters [86].

Methods based on Neural Networks: Competitive neural networks, including RNA, can be models for clustering. They are commonly called self-organizing networks and are typically methods without supervision. Two famous algorithms are SOM and Adaptive Resonant Theory (ART) [96]. The Kohonen map (SOM) uses the concept of neighborhood. This type of network learns to recognize sections of neighborhoods in the input space and the topology of the input vectors on which training takes place [97]. Typical applications of the ART method include radar signal recognition and image processing.

•

Strengths: Easy interpretation ability to handle large amounts of data and complex data [98].

•

Weaknesses: A good dataset is critical; Determining which factors are relevant to the problem can be a complicated task [99].

Methods based on Evolutionary Computing: Evolutionary algorithms, such as genetic algorithms, can be applied in cluster analysis. In this clustering method, the algorithms evaluated characteristics that best represent the elements, and the operators for selection, mutation, and crossing of [100]. Examples of these algorithms are the genetic KM [101] and the genetically guided algorithm [102].

•

Strengths: Experiments have shown compact and well-separated clusters [103].

•

Weaknesses: The computational effort still is an issue [104].

Fuzzy-based methods: This type of clustering allows an individual to partially belong to more than one group, with varying degrees of the neighborhood. Cluster boundary elements are not required to belong entirely to a cluster. However, degrees of association between 0 and 1 are assigned, indicating their partial association [105]. Generally, these models are standard in pattern recognition. The Fuzzy c-means and Fuzzy c-shells algorithms are examples of fuzzy clustering.

•

Strengths: This method has the advantage of robustness for ambiguity and maintains more information than any rigid clustering method [106].

•

Weaknesses: High sensitivity to noise [107].

Kernel based methods: Kernel Algorithms feature space to allow nonlinear separation in the input. Kernel-based clustering applies this procedure and works the implicitly grouped by a Kernel method, which performs a mapping not appropriate linear input data for a high-quality feature space substituting the internal product between the nonlinear variables with an appropriately determined kernel [108]. K-Means Kernel and SVC - Support Vector Clustering are examples of this type of clustering.

•

Strengths: Theses methods are able to identify the non-linear structures [109].

•

Weaknesses: Time complexity is large. Complex Algorithms in nature [109].

We emphasize that the clustering task is highly dependent on the parameters, similarity measures, and methods used by the algorithm [65].

3.3 Linkage Measure

Hierarchical algorithms are clustering techniques that create a hierarchy of relationships between objects. They work in two modes: the agglomerative, which constructs clusters from isolated elements.

In the divider approach, the process starts with a broad set, which is broken into parts until it reaches isolated elements [110]. The principal aspect of the hierarchical algorithms is the selection of pairs based on the linkage function.

The linkage measure uses similarity or distance indices between the elements of the set. In this process, similar elements are grouped into a single cluster, reducing the total number of remaining clusters. One of the essential uses of the agglomerative method is to identify the point at which two groups of elements are too distinct to form a homogeneous cluster. With increasing distance values between samples, it is ideal to stop the clustering process avoiding the resulting groups from being too dissimilar [111].

Some of the linkage methods are the following:

•

Maximum linkage or Complete Linkage: This method, also known as a Complete Linkage or Nearest Neighbor, has two clusters with the closest maximum distance merged. This process repeats until there is only a single cluster left. Complete linkage is also called farthest neighbor linkage [112].

•

Minimum linkage or Single Linkage: This measure involves estimating the distance between clusters using the elements with the greatest distance between them. For this reason, the single linkage is sometimes called nearest neighbor linkage [111].

•

Average linkage: Also known as UPGMA or Unweighted Pair Group Method with Arithmetic Mean. This measure uses the average pair-wise proximity between their member individuals of all different groups to merge clusters [113]. The UPGMA method is relative to its weighted variant, the WPGMA method [111].

•

Centroid Method: Also known as UPGMC: Unweighted Pair Group Method with Centroid Averaging. In this type, the distance between the centroids of the groups defines the proximity between them [114]. UPGMC employs only Euclidean distances [115].

•

Ward Method: For Ward’s linkage, two clusters are merged by calculating the total sum of square derivations from the mean of a cluster [112].

•

Median Method: In this form, the distance between two clusters is the median distance between an element in one group and an observation in the other cluster [112].

After the decision on the most appropriate method and its application, we generate the clusters. Besides, the interpretation of the generated clusters is a complex step, which aims to discover meanings related to the area of the analyzed data sets.

3.4 Clustering Validation Techniques

Clustering validation estimates the quality of clustering results. This validation is one of the essential issues to the success of clustering applications [116, 117]. There are Internal Criteria and External Criteria for validation. External Criteria evaluate the outcome regarding a prespecified structure, while internal criteria estimate the result with respect to information intrinsic to the data alone [118]. Table 5 presents the most well-known indexes of internal and external validation of clustering.

Table 5.

Name	Description	Equations
Calinski-Harabasz Index	The measure of Calinski–Harabasz is one of the most usual approaches for estimation clustering algorithms quality. A higher value indicates a better division into clusters [10, 119, 120]	\(\begin{aligned}H =\frac{trace(S_{b})}{trace(S_{w})} \cdot \frac{n_{p}-1}{n_{p}-k} \end{aligned}\), where S_(b) = the between-cluster scatter matrix, S_(w)= the internal scatter matrix, _(p)= the number of clustered samples, and k = the number of clusters.
Davies-Bouldin Index	This index, proposed in 1979, aims to identify compact and well-separated groups. A lower value indicates a better division into clusters. [118, 119, 119, 121]	\(\begin{aligned} DB = \frac{1}{c} \cdot \sum \nolimits _{i=1}^c Max _{\text i}\ne j \lbrace \frac{d(X_{\text i})+d(X_{\text j})}{d(c_{\text i}, c _{\text j})}\rbrace \end{aligned}\), where c = number of clusters, i, j = cluster labels d(X_i), d(X_j) = all the samples in clusters, i and j = their respective cluster centroids, d(c_i, c_j) = distance between these centroid [118].
Dunn Index	This index produces some lacks, which is why diverse adaptations and generalizations have appeared in the literature [122]. A high value indicates a valid clustering solution [5, 118, 121]	\(\begin{aligned}D = min_{1\le i \le c} \lbrace min \lbrace \frac{d(c_i,c_j)}{max_{1\le k \le c}(d(X_k)) } \rbrace \rbrace \end{aligned}\), where d(c_i), c_j) = the intercluster distance between cluster, X_i and X_j d(X_k) = the intracluster distance of cluster (X_k), c = the number of cluster of dataset [118].
Silhouette Index	One of the most common indices for internal validation. The samples have been “well-clustered” when the result is close to 1 [5, 120, 123, 124, 125, 126]	\(\begin{aligned}s(i) = \frac{b(i) - a(i)}{max\lbrace a(i),b(i)}\rbrace \end{aligned}\), where a(i) is the average distance between the i sample and all of the samples included in X_j; ‘max’ is the maximum operator, and b(i) is the minimum average distance between the i sample and all of the samples clustered in \(\begin{aligned}X _{\text k} (k=1,\ldots ,c;k \ne j) \end{aligned}\) [123].
Entropy	A high value of purity indicates optimal clustering. How the entropy is a negative measure, low entropy values mean more reliable clustering [127, 128, 129, 130]	\(\begin{aligned}E (D _{\text i}) = - \sum \nolimits _{i=1}^k Pr _{\text i}(c_{\text j}) log _{\text 2}Pr_{\text i}(c _{\text j})\end{aligned}\) where \(\begin{aligned}E_{\text {total}} (D)= \sum \nolimits _{i=1}^k \frac{\|Di\|}{\|D\|}. E(D_{\text i})\end{aligned}\) and Pr_i (c_j) = the proportion of data points of c_j in cluster i or D_i [131].
Purity	Purity is similar to entropy. However, using this index, a high value of purity indicates optimal clustering [127, 129, 130, 132]	\(\begin{aligned}p(c_{\text j})=\frac{1}{\|{C_{\text j}}\|} max(\|c_{\text j}\| _{\text class = i})\end{aligned}\), where k = number of clusters, \(\|C_{\text j}\|\) = the size of cluster c_j, \(\|C_{\text j}\| _{\text class =i}\) = the number of points of class i assigned to the cluster j. The total purity for the clusters is a weighted sum of the individual purities [132].
F-measure	A high F- Measure value indicates better clustering quality. [9, 118, 130, 133, 134]	\(\begin{aligned}F(i,j)= \frac{2Recall(i,j)Precision(i,j)}{Precision(i,j)+Recall(i,j)} \end{aligned}\) where \(\begin{aligned}Recall(i,j) = \frac{nij}{ni}, Precision(i,j) = \frac{nij}{nj} \end{aligned}\) and n_ij = the number of elements of class i that are in cluster j, n_i, n_j = the number of elements in cluster i and j, respectively [118].

Table 5. Cluster Validation Indexes: The First Four Indices are for Internal Validation

The other three (Purity, Entropy, and F-measure) are for external validation.

3.5 Phylogeny

Phylogeny is the genealogical history of a group of organisms and a hypothetical representation of ancestral/descendant relationships [135].

Binary trees (with or without root), phylogenetic networks, or graphs are indicated to represent the relationship between species [136]. For rooted trees, the root describes the common ancestor of all species represented in the tree. When an ancestor is not specified or assumed, a rootless tree is applicable for representing relationships between groups, regardless of the common ancestor. The nodes in phylogeny denote the unit of taxonomy or species.

Phylogeny differs from taxonomy while the first one uses a group with shared characteristics as base, whereas it takes on a common ancestor and seeks to derive relationships among its descendants [137].

Computational phylogenetics is the application of computer algorithms, methods, and programs for phylogenetic analysis. The phylogenetics of biology inspires malware phylogeny research. We can adapt methods commonly used in biology to construct, evaluate, and compare phylogenetic trees in malicious programs research [56].

Even if there are apparent differences between malicious software and species, by considering software products as species and source code as genes, software evolution can be investigated in the same way as biological evolution [138].

The most common tree-building methods apply distance, parsimony, maximum likelihood, and Bayesian approaches. For example, one of the most common methods of building phylogenetic trees is the MP - Maximum Parsimony Method [139]. This method assumes that the most accurate tree is the one that requires the least changes to produce the data contained in the alignment [140]. Another popular method is maximum likelihood estimation.- MLE is used to estimate the parameters of a statistical model.

The MP and MLE methods are character-based, i.e., they depend on the individual values of each alignment sequence. There are other methods known as distance-based methods, which, when estimating phylogeny consider an entire taxonomy sequence, often using a distance matrix. The simplest distance-based method is the Unweighted Pair Group Method (UPGMA). In this method, the sequence pair (or sequence group) to be grouped first is the one with the shortest distance between all sequence pairs (or groups).

Both character-based and distance-based methods have advantages and disadvantages. The UPGMA method, for example, is easy to implement but is not widely used for phylogenetic tree creation because it does not take into account different rates of evolutionary change between sequences represented in the [141] tree. Character-based methods are more faithful in the evolutionary representation of phylogeny but are more computationally expensive to implement.

One of the recurring problems in phylogenetic analysis is the size of the generated tree, especially in sets with thousands of specimens. This issue is not easily solved by just using the zoom and panning tools [142]. Besides, there is no clear separation of hierarchical structure lines and labels by any available technique.

Although tree-based models are the main ones in the literature, they are not suitable for representing cross-linking events, which can occur in malware generation [143]. Thus, phylogenetic networks emerged as an alternative way to model evolution. These networks are graphs in which each labeled sheet represents an instance, and nodes can have more than one parent.

4 Evaluation of Android Malware Clustering/phylogeny Approaches

We reviewed several studies that involved clustering Android malware into families or Android malware phylogeny. In this section, we describe our methodology and our main findings.

4.1 Methodology

The activities carried out can be divided into:

•

Search: To specify the papers for evaluation, we searched in electronic databases, such as Google Scholar and Scopus. The filter keywords were: malware families AND phylogeny AND classification, with data from 2010, totaling 541 initial results.

•

Exclusion criteria: As exclusion criteria, we decided to evaluate only works with unrestricted access. After reading the abstracts, as a second criterion, we excluded studies with a limited objective of detecting malware. From these specifications, a set of 82 articles resulted. We examined the following information: the study objectives, number of samples informed by the authors, datasets mentioned, distance/similarity measure, clustering method used, use of supervised and unsupervised methods together, clustering results, and whether the authors applied metrics for clustering results validation.

•

Summarization: By observing Table 1, the first three columns refer to the type of analysis performed. The Hybrid option indicates that the work used concomitant static and dynamic analysis. The next set of columns refers to the type of features used. The last two groups of columns are the use of supervised and unsupervised methods for clustering malware into families. Due to the great diversity of elements in the literature, such as distance measures, analysis tools, types of representation, and datasets, the papers analyzed were divided into seven tables, by year, for better visualization by the reader.

Other relevant information is available in the charts of Figure 3 to Figure 7 and condensed below. Table 6 presents information on articles dated between 2011 and 2015.

Fig. 3.

Table 6.

	Year	2011	2012	2013			2014					2015
	Type/Ref	[144]	[134]	[145]	[146]	[130]	[147]	[92]	[148]	[48]	[149]	[150]	[82]	[151]	[80]	[152]	Total:
Analysis	Static		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)		\(\checkmark\)	\(\checkmark\)		\(\checkmark\)		\(\checkmark\)	10
	Dynamic	\(\checkmark\)						\(\checkmark\)		\(\checkmark\)			\(\checkmark\)		\(\checkmark\)		5
	Hybrid																0
Features	Permissions			\(\checkmark\)		\(\checkmark\)								\(\checkmark\)			3
	API Calls										\(\checkmark\)						1
	Opcodes																0
	System Calls																0
	Network Data	\(\checkmark\)								\(\checkmark\)					\(\checkmark\)		3
	Code				\(\checkmark\)				\(\checkmark\)			\(\checkmark\)	\(\checkmark\)			\(\checkmark\)	5
	Others		\(\checkmark\)	\(\checkmark\)			\(\checkmark\)	\(\checkmark\)									4
S.M	K-means			\(\checkmark\)		\(\checkmark\)		\(\checkmark\)		\(\checkmark\)				\(\checkmark\)			5
	HAC						\(\checkmark\)		\(\checkmark\)				\(\checkmark\)				3
	EM					\(\checkmark\)		\(\checkmark\)			\(\checkmark\)						3
	Fuzzy	\(\checkmark\)															1
	Birch														\(\checkmark\)		1
	DBSCAN																0
	Proposed Model				\(\checkmark\)											\(\checkmark\)	2
	Phylogeny																0
	Others						\(\checkmark\)	\(\checkmark\)		\(\checkmark\)		\(\checkmark\)					4
U.M	Not mentioned	\(\checkmark\)			\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	11
	SVM																0
	KNN		\(\checkmark\)						\(\checkmark\)								2
	Decision Tree			\(\checkmark\)													1
	RF			\(\checkmark\)													1
	NB		\(\checkmark\)											\(\checkmark\)			2
	Logistic																0
	Neuro Fuzzy																0
	Others																0

Table 6. Analyzed Papers from 2011 to 2015 with Information about Analysis Type, Employed Features, and Use of Supervised and Unsupervised Methods

In the period of the articles in Table 6, there was a 200% increase in malware collected between 2012 and 2013, according to a McAfee report [153]. With the large volume of new samples for investigation, despite the disadvantages of static analysis, we observe that this type of analysis still prevails as the most applied. Analysts prefer it mainly due to the time limitation, computational power restriction, and the number of tools available for reverse engineering and APK decompression.

Regarding the features used, we highlight the code analysis, permissions, and linking network data connected with dynamic analysis. In this period, the analyzed studies using supervised methods with clustering algorithms were still succinct.

Table 7 presents information on articles dating from 2016.

Table 7.

	Year	2016
	Type/Ref	[154]	[155]	[156]	[157]	[158]	[159]	[160]	[161]	Total:
Analysis	Static		\(\checkmark\)				\(\checkmark\)		\(\checkmark\)	3
	Dynamic	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)		5
	Hybrid									0
Features	Permissions						\(\checkmark\)			1
	API Calls			\(\checkmark\)						1
	Opcodes		\(\checkmark\)							1
	System Calls					\(\checkmark\)		\(\checkmark\)		2
	Network Data				\(\checkmark\)					1
	Code			\(\checkmark\)						1
	Others	\(\checkmark\)			\(\checkmark\)				\(\checkmark\)	3
S.M	K-means						\(\checkmark\)			1
	HAC			\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	5
	EM									0
	Fuzzy									0
	Birch	\(\checkmark\)								1
	DBSCAN									0
	Proposed Model									0
	Phylogeny		\(\checkmark\)							1
	Others					\(\checkmark\)				1
U.M	Not mentioned	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)			\(\checkmark\)	\(\checkmark\)	6
	SVM									0
	KNN									0
	Decision Tree						\(\checkmark\)			1
	RF									0
	NB									0
	Logistic									0
	Neuro Fuzzy									0
	Others					\(\checkmark\)				1

Table 7. Analyzed Papers from 2016 with Information about Analysis Type, Employed Features, and Use of Supervised and Unsupervised Methods

In 2016, we noticed an increase in the use of HAC, both with static and dynamic analysis. Between 2016 and 2017, there was an increase in the application of dynamic analysis techniques, as we can visualize in Table 7. A fact that possibly contributed to this increase was that some tools popularly used for dynamic analysis of Android malware were published during this period, such as Droidmate [162] and Droidbot [163]. Another important fact for the period was the release of Google’s multilayer architecture in an Android security review in 2017. However, as most devices were still running outdated system versions, application exploits continued to spread [164].

Table 8 presents information about the studied articles from 2017.

Table 8.

	Year	2017
	Type/Ref	[165]	[166]	[36]	[167]	[168]	[169]	[9]	[170]	[171]	[172]	[173]	[174]	[175]	[176]	Total:
Analysis	Static	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)		12
	Dynamic				\(\checkmark\)									\(\checkmark\)			2
	Hybrid																0
Features	Permissions	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)					\(\checkmark\)			\(\checkmark\)	\(\checkmark\)		\(\checkmark\)		7
	API Calls										\(\checkmark\)		\(\checkmark\)		\(\checkmark\)		3
	Opcodes																0
	System Calls				\(\checkmark\)		\(\checkmark\)										2
	Network Data													\(\checkmark\)			1
	Code										\(\checkmark\)	\(\checkmark\)					2
	Others					\(\checkmark\)		\(\checkmark\)		\(\checkmark\)			\(\checkmark\)				4
S.M	K-means					\(\checkmark\)			\(\checkmark\)			\(\checkmark\)	\(\checkmark\)	\(\checkmark\)			5
	HAC									\(\checkmark\)					\(\checkmark\)		2
	EM	\(\checkmark\)										\(\checkmark\)					2
	Fuzzy												\(\checkmark\)				1
	Birch																0
	DBSCAN	\(\checkmark\)									\(\checkmark\)						2
	Proposed Model		\(\checkmark\)	\(\checkmark\)			\(\checkmark\)	\(\checkmark\)									4
	Phylogeny				\(\checkmark\)												1
	Others											\(\checkmark\)					1
U.M	Not mentioned		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)			\(\checkmark\)		\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)			8
	SVM											\(\checkmark\)			\(\checkmark\)		2
	KNN														\(\checkmark\)		1
	Decision Tree											\(\checkmark\)			\(\checkmark\)		2
	RF					\(\checkmark\)						\(\checkmark\)					2
	NB																0
	Logistic																0
	Neuro Fuzzy	\(\checkmark\)															0
	Others	\(\checkmark\)					\(\checkmark\)		\(\checkmark\)			\(\checkmark\)					4

Table 8. Analyzed Papers from 2017 with Information about Analysis Type, Employed Features, and Use of Supervised and Unsupervised Methods

The data from the analyzed articles from 2017 demonstrate the tendency of analyzing application permissions strongly connected to static analysis. By viewing Table 8, the preference for K-means has remained constant.

Data from the articles from 2018 are presented in Table 9.

Table 9.

	Year	2018
	Type/Ref	[177]	[178]	[179]	[180]	[181]	[182]	[183]	[184]	[185]	[186]	[84]	[187]	[188]	Total:
Analysis	Static				\(\checkmark\)			\(\checkmark\)			\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	6
	Dynamic	\(\checkmark\)		\(\checkmark\)		\(\checkmark\)									3
	Hybrid		\(\checkmark\)				\(\checkmark\)		\(\checkmark\)	\(\checkmark\)					4
Features	Permissions							\(\checkmark\)					\(\checkmark\)		2
	API Calls		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)				\(\checkmark\)						4
	Opcodes	\(\checkmark\)								\(\checkmark\)		\(\checkmark\)			3
	System Calls										\(\checkmark\)				1
	Network Data					\(\checkmark\)	\(\checkmark\)								2
	Code							\(\checkmark\)							1
	Others								\(\checkmark\)	\(\checkmark\)				\(\checkmark\)	3
S.M	K-means					\(\checkmark\)	\(\checkmark\)	\(\checkmark\)			\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	6
	HAC											\(\checkmark\)			1
	EM														0
	Fuzzy	\(\checkmark\)				\(\checkmark\)									2
	Birch											\(\checkmark\)			1
	DBSCAN		\(\checkmark\)									\(\checkmark\)			2
	Proposed Model			\(\checkmark\)					\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		5
	Phylogeny	\(\checkmark\)			\(\checkmark\)					\(\checkmark\)					3
	Others											\(\checkmark\)			1
U.M	Not mentioned	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)					8
	SVM				\(\checkmark\)							\(\checkmark\)			2
	KNN				\(\checkmark\)									\(\checkmark\)	2
	Decision Tree				\(\checkmark\)							\(\checkmark\)			2
	RF				\(\checkmark\)							\(\checkmark\)			2
	NB											\(\checkmark\)	\(\checkmark\)		2
	Logistic											\(\checkmark\)			1
	Neuro Fuzzy														0
	Others										\(\checkmark\)				1

Table 9. Analyzed Papers from 2018 with Information about Analysis Type, Employed Features, and Use of Supervised and Unsupervised Methods

In 2018, static analysis use continued to predominate, similarly permissions as the main feature. By viewing Table 8, we can notice a growth in the number of articles combining supervised and unsupervised algorithms. The diversity of clustering algorithms has also expanded, including DBSCAN, Fuzzy, and EM. Also, API calls are features used in all three types of malware analysis.

Table 10 presents data on the articles dating from 2019.

Table 10.

	Year	2019
	Type/Ref	[189]	[190]	[191]	[192]	[193]	[194]	[195]	[196]	[197]	[198]	[199]	[10]	[126]	Total:
Analysis	Static	\(\checkmark\)	\(\checkmark\)			\(\checkmark\)			\(\checkmark\)			\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	7
	Dynamic			\(\checkmark\)				\(\checkmark\)			\(\checkmark\)				3
	Hybrid				\(\checkmark\)		\(\checkmark\)			\(\checkmark\)					3
Features	Permissions				\(\checkmark\)		\(\checkmark\)	\(\checkmark\)				\(\checkmark\)			4
	API Calls		\(\checkmark\)		\(\checkmark\)		\(\checkmark\)								3
	Opcodes				\(\checkmark\)	\(\checkmark\)								\(\checkmark\)	3
	System Calls			\(\checkmark\)											1
	Network Data														0
	Code				\(\checkmark\)								\(\checkmark\)		2
	Others	\(\checkmark\)							\(\checkmark\)	\(\checkmark\)	\(\checkmark\)				4
S.M	K-means	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)			\(\checkmark\)	\(\checkmark\)						5
	HAC											\(\checkmark\)			1
	EM														0
	Fuzzy														0
	Birch													\(\checkmark\)	1
	DBSCAN														0
	Proposed Model		\(\checkmark\)			\(\checkmark\)	\(\checkmark\)				\(\checkmark\)		\(\checkmark\)		5
	Phylogeny									\(\checkmark\)					1
	Others	\(\checkmark\)			\(\checkmark\)		\(\checkmark\)				\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	6
U.M	Not mentioned		\(\checkmark\)	\(\checkmark\)		\(\checkmark\)				\(\checkmark\)		\(\checkmark\)			5
	SVM				\(\checkmark\)			\(\checkmark\)	\(\checkmark\)				\(\checkmark\)	\(\checkmark\)	5
	KNN	\(\checkmark\)			\(\checkmark\)				\(\checkmark\)				\(\checkmark\)		4
	Decision Tree	\(\checkmark\)											\(\checkmark\)		2
	RF	\(\checkmark\)											\(\checkmark\)		2
	NB				\(\checkmark\)				\(\checkmark\)				\(\checkmark\)		3
	Logistic	\(\checkmark\)													1
	Neuro Fuzzy														0
	Others	\(\checkmark\)			\(\checkmark\)		\(\checkmark\)				\(\checkmark\)		\(\checkmark\)		5

Table 10. Analyzed Papers from 2019 with Information about Analysis Type, Employed Features, and Use of Supervised and Unsupervised Methods

We observed that Android’s new malware development decreased in 2018 [200]. From the second half of 2019 onwards, the development of new Android malware samples increased consistently. In this context, dynamic analysis is interesting for studying malicious code structures and developing vaccines. Research with dynamic and hybrid analysis increased in this period (Table 10). Articles were labeled as Hybrid Analysis when using static and dynamic techniques together.

Table 11 presents data on the studied articles dating from 2020, while Table 12 presents data on the studies dated between 2021 and 2022.

Table 11.

	Year	2020
	Type/Ref	[201]	[202]	[203]	[204]	[205]	[206]	Total:
Analysis	Static	\(\checkmark\)	\(\checkmark\)			\(\checkmark\)	\(\checkmark\)	4
	Dynamic			\(\checkmark\)	\(\checkmark\)			2
	Hybrid							0
Features	Permissions	\(\checkmark\)	\(\checkmark\)					2
	API Calls	\(\checkmark\)	\(\checkmark\)				\(\checkmark\)	3
	Opcodes					\(\checkmark\)		1
	System Calls			\(\checkmark\)	\(\checkmark\)			2
	Network Data	\(\checkmark\)	\(\checkmark\)					2
	Code							0
	Others	\(\checkmark\)	\(\checkmark\)					2
S.M	K-means	\(\checkmark\)					\(\checkmark\)	2
	HAC							0
	EM							0
	Fuzzy							0
	Birch							0
	DBSCAN							0
	Proposed Model		\(\checkmark\)			\(\checkmark\)		2
	Phylogeny			\(\checkmark\)	\(\checkmark\)			2
	Others							0
U.M	Not mentioned			\(\checkmark\)	\(\checkmark\)			2
	SVM						\(\checkmark\)	1
	KNN							0
	Decision Tree		\(\checkmark\)					1
	RF	\(\checkmark\)						1
	NB							0
	Logistic							0
	Neuro Fuzzy							0
	Others	\(\checkmark\)				\(\checkmark\)		2

Table 11. Analyzed Papers from 2020 with Information about Analysis Type, Employed Features, and Use of Supervised and Unsupervised Methods

Table 12.

	Year	2021										2022
	Type/Ref	[209]	[210]	[211]	[212]	[213]	[214]	[215]	[216]	[217]	[218]	[219]	[220]	Total:
Analysis	Static	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)			9
	Dynamic											\(\checkmark\)	\(\checkmark\)	2
	Hybrid				\(\checkmark\)									1
Features	Permissions	\(\checkmark\)			\(\checkmark\)	\(\checkmark\)		\(\checkmark\)		\(\checkmark\)				5
	API Calls	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)			\(\checkmark\)		\(\checkmark\)				5
	Opcodes				\(\checkmark\)		\(\checkmark\)		\(\checkmark\)		\(\checkmark\)			4
	System Calls													0
	Network Data												\(\checkmark\)	1
	Code		\(\checkmark\)	\(\checkmark\)				\(\checkmark\)						3
	Others	\(\checkmark\)			\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)			\(\checkmark\)	\(\checkmark\)	7
S.M	K-means	\(\checkmark\)			\(\checkmark\)						\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	5
	HAC				\(\checkmark\)						\(\checkmark\)			2
	EM													0
	Fuzzy				\(\checkmark\)	\(\checkmark\)								2
	Birch				\(\checkmark\)						\(\checkmark\)			2
	DBSCAN	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)			\(\checkmark\)		\(\checkmark\)			6
	Proposed Model	\(\checkmark\)						\(\checkmark\)						2
	Phylogeny									\(\checkmark\)				1
	Others	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)		\(\checkmark\)			\(\checkmark\)	\(\checkmark\)			5
U.M	Not mentioned				\(\checkmark\)									1
	SVM						\(\checkmark\)	\(\checkmark\)			\(\checkmark\)		\(\checkmark\)	4
	KNN							\(\checkmark\)			\(\checkmark\)		\(\checkmark\)	3
	Decision Tree						\(\checkmark\)				\(\checkmark\)	\(\checkmark\)		3
	RF	\(\checkmark\)					\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	7
	NB						\(\checkmark\)	\(\checkmark\)					\(\checkmark\)	3
	Logistic						\(\checkmark\)	\(\checkmark\)						2
	Neuro Fuzzy													0
	Others	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)		\(\checkmark\)	9

Table 12. Analyzed Papers from 2021 and 2022 with Information about Analysis Type, Employed Features, and Use of Supervised and Unsupervised Methods

Just by viewing the data from the analyzed articles from 2020, it can be complex to notice a significant trend. Constant changes in the malware scenario consequently impact this research area. In 2018 and 2019, there was an outbreak of Android ransomware [207]. Despite the number of mobile attacks by this type of malware decreasing in 2020, we had the beginning of the COVID-19 pandemic. With this atypical moment, several malicious applications using this disease as a theme were the focus of cybercriminals [208].

As of 2018, we notice a consistent increase in the use of API calls for classifying Android malware. Studies indicate that trojans have larger sets of API calls compared to the other malware categories [221]. Also, according to malware reports, Trojans remain of the biggest threats to consumers in 2021, with them accounting for 90% of all pandemic-related malware [222].

4.2 Findings

As the quantity of new variants and families of Android malware has increased in recent years, research on clustering and phylogeny has been present in journals and anti-virus companies. As can be observed in Figure 3, more articles were found between the years 2016 and 2019.

The results of the study analysis are summarized below:

Datasets: About datasets, there are several sources available free of charge for research, as available in Table 2. However, although they offered vectors ready for analysis, they rarely present detailed information about the sample set, such as categorizing the samples by type of malware and family. Also, the number of samples available is not yet representative and is not updated continuously [15]. Several studies used more than one dataset for experiments or created their sample set. The most cited datasets were Malgenome and Drebin.

Number of samples: Regarding the number of malware specimens used in experiments, [223] states that it is complex to determine the actual amount of samples used to train and test machine learning models. Thus, we emphasize that the quantities presented here were informed by the authors of the papers analyzed. By observing Figure 4 it is possible to see that there are several ranges of sample amounts. For instance, 15 studies performed experiments between 100 and 1,000 samples, 11 studies between 1,000 and 2,000, ten studies between 5,000 and 10,000, and ten studies between 10,000 and 20,000 copies. Only three studies analyzed more than 1 million specimens.

Fig. 4.

Fig. 5.

Type of analysis and use of tools: Concerning malware analysis, 49 studies performed static methods, 19 used dynamic analysis, and 8 applied hybrid analysis. In the case of phylogeny studies, only, 60% used dynamic analysis. Less than 50% of the analyzed Android malware phylogeny studies presented information and discussions about the relationships between the analyzed families. The tools most cited by the authors are in Figure 2, accentuating Apktool and Weka.

Features and representation: As can be seen in Figure 6, the most adopted features were permissions and API calls. In descending order of use, the methods include binary vectors, XML files, graphs, strings, n-grams, ARR files, Markov chains diagrams, grayscale images, and XES format.

Fig. 6.

Distance and similarity measures: When analyzing similar papers, we noticed that several measures of similarity and dissimilarity were applied. Besides, the number of studies that did not mention the used measure was also significant (15 studies out of 82). The most used measures were Euclidean distance and Jaccard Index.

Most applied unsupervised methods: As previously mentioned, clustering methods have been improved in recent years. However, more than ten analyzed studies employed the K-means algorithm as the primary clustering method. Also, some papers apply clustering with other Artificial Intelligence techniques, as can be seen in Figure 7. Among these techniques, the use of Random Forest and SVM stands out.

Fig. 7.

Clustering Results Validation: About the studies analyzed that cited clustering results validation indexes, 33% cited Accuracy, 17% cited F-Measure, and 16% cited True Positive and False Positive ratios. Purity, Calinski Harabaz score, ROC Curve, and other indexes validation were cited too.

There was a lack of normalization and standards in the presentation and validation of results, especially in phylogeny studies that did not use popular clustering methods.

4.3 Limitations and Challenges of Computational Phylogeny

Although phylogenetic methods that estimate maximum likelihood are popular in biology, in the malware domain, there are several limitations. For example, given a multiple alignment sequence and a specific phylogenetic tree, the likelihood for each character in the alignment must be calculated. With n sequences and states, the complexity The computational value of the calculation will be O (s n-2) [224]. This value is expensive when considering data such as DNA where there are only four states (A, G, T, C), but for malware where assembly instructions are states, the number of s is in the hundreds [225].

Another challenge presented in the research of Walenstein et al. (2007) [226] is evolutionary changes in code: How do you know how members of a malware family are related and how different families are related through code sharing?

Other known obstacles in establishing the malware lineage are the lack of sufficient field data (e.g., malware samples and contextual information about its real-world impact), lack of metadata about the process of collecting existing data sets, and the difficulty of developing tools and methods for rigorous data analysis [13].

Moreover, additional challenges related to malware phylogeny that are still unresolved are splitting, merging, and alternative layouts, and viewing, exploring, and comparing trees [142].

Another limitation related to malware analysis and phylogeny is the compression of copies made of virtual machine memory during dynamic analysis procedures. James Fowler (2016) has been getting significant results with malware samples, intending to run the VirusShare project, which would require a total of 9 Petabytes of RAM on one virtual machine to be dynamically analyzed [227].

Dumitras et al. (2011) pointed out two significant challenges in determining lineage and malware provenance. The first is in grouping samples into families. Current approaches use features such as signatures or call graphs to identify specimens from the same family. The inability of current tools to unzip obfuscated codes or generate dynamic call graphs makes it difficult to expose the connection between independent variables. The other challenge pointed out by the authors involves assembling the phylogenetic tree. The lack of relevant information from malware writers, such as code development processes and dissemination strategies, prevents analysts from establishing a fundamental truth for the validation of the results of the generated lineage [13].

Walenstein and Lakhotia (2012) address the foremost contemporary implications for building (evolutionary or not) relationships of malware. According to the authors, two significant current issues in capturing malware developments are lack of conceptual clarity and lack of formalism appropriate and necessary to model relationships [228].

Other issues addressed, for example, are the descent of multiple lineages, the sharing of multi-variant functionality updates, and differences in code generation and compilation mechanisms.

Pfeffer et al. (2012) indicated an alternative to the use of other forms of analysis to extract malware characteristics, such as genetics and linguistic ones. The continuous and recent research in the area, due to the constant increase in the number of malware and the evolution of the techniques used by its creators, researchers believe that there are several limitations and challenges to be researched. As demonstrated by some authors and starting from the initial proposal of Phylogeny suggests the application of the latest methods from other areas that have not been studied computationally in the scope of malware, such as Medicine and Biology, for the classification and lineage of malicious programs [229].

4.4 Stages of Malware Clustering: A Proposal

Based on the analysis of the most prevalent clustering techniques in the malware literature, we propose the flow of tasks in Figure 8, to guide malicious code clustering.

Fig. 8.

•

Stage 0 - Establish Clustering Purpose

We perform analysis and clustering of mobile malware samples for different purposes. For example, hardware device manufacturers may be interested in investigating large amounts of mobile malicious code samples to identify potential threats to physical products. A network security analyst can analyze malware such as Android ransomware to respond to incidents. Other purposes include anti-malware solutions development, malicious software lineage research, and malware hunting.

Different combinations of analysis type, clustering algorithms, and distance measures depend on the clustering purpose, dataset size, computational power, and available time.

•

Stage 1 - Samples Analysis and Attribute Extraction

According to the extent of the collection of elements, as a base of .apks files, the computational power, and the time available, the analyst experiments and defines the type of analysis. After defining and performing the analysis of the samples, we extract the attributes and define the representation types of the data.

•

Stage 2 - Data Normalization

With the set of features extracted, we perform the following tasks to ensure the quality of the clustering result.

Data Normalization: The nature of the attributes (discrete, binary, continuous) and the variable scale (nominal, ordinal) affect the structure of the clusters. Thus, tasks such as standardization or data normalization can benefit the quality of the results. Standardization tends to minimize dispersion problems, normalize features according to a specific criterion, and eliminate noise and redundancy to improve the result accuracy. Some available methods include Z Score, Maximum Amplitude, and Range 0 to 1.

Feature Selection: We have previously highlighted that feature selection can improve the quality of results using supervised and unsupervised methods. We also mentioned (Section 2.6) that not all feature selection methods apply to clustering. The choice of selection type, using filters, wrappers, or another approach, also depends on the purpose of the analysis, the computational power, and the time available for this pre-clustering step. Filters are not dependent on algorithm clustering, despite wrappers generally providing more satisfactory results. One of the recommended alternatives is to combine both techniques when possible. In Babaagba and Adesanya’s (2019) malware analysis experiments [230], the accuracy increased from 54.3 to 74.4% using feature selections with the EM clustering algorithm.

Outlier Identification: Outliers are objects considered dissimilar and inconsistent with the rest of the set. These anomalies can affect the result negatively or be just what we are sorting [231]. In malware analysis, outliers can bring important information, such as identifying mislabeled samples. Generally, in clustering, we assume that most of the elements are in denser clusters. Thus, we investigate outliers data that do not belong to any group or small clusters. Strategies for dealing with anomalies include inconsistent elements exclusion and separate analysis of the anomalous set.

•

Stage 3 - Distance Measure Choice

Determining the best distance measure is not a trivial task, depending directly on the analyzed data. Several studies and surveys on distance and similarity measures are available in the literature. Thus, the question remains: how should I decide?

The default option is usually the Euclidean distance, although this measurement is not ideal in numerous cases.

Generally, when the analyst comprehends the sample data, he tries out several measurements and visually verifies which ones delivered greater consistency with the pattern of the dataset elements. For instance, operating with features in the form of strings, can we verify if with the Hamming distance, Jaccard Index, and Levenshtein (edit) Distance, using samples from the same family, the result indicated high similarity between the samples? And what happens to samples from very different families using these same metrics?

Although there are dozens/hundreds of equations and similarity measures, Table 13 presents recommendations based on the analyzed papers and the data type.

•

Stage 4 - Clustering Method Selection and Results Validation

As mentioned earlier, each type of clustering has advantages and disadvantages. The selection of the method depends on the nature of the data, the dimensionality of the features, the computational cost, and the time available for processing. For example, hierarchies are advantageous when you know that your data have hierarchical relationships. They do not deal well with outliers, not being suitable for extensive datasets. Also, it is often difficult to define the optimal number of clusters by visually analyzing the dendrogram. The Expectation Maximization method is not indicated to find significantly short clusters, such as when the dataset has families with only one or very few samples.

Partitions methods, such as K-means, are fast and highly scalable. However, they depend on the initial number of clusters. They are suitable for datasets with spherical shapes. As the number of features increases, the scalability of the algorithm decreases. It is also sensitive to outliers, so we indicate data normalization. Density-based Algorithms are great for data with high dimensionality and with different forms of sets. Also, they are a great choice for noisy datasets, because this type handles outliers well. However, they often suffer from overfitting.

Type	Measures and Indexes
Strings	Hamming Distance, Jaccard Index, Levenshtein (Edit) Distance, NCD, Cosine
Documents	Dice distance, Cosine, Jaccard Index
Binary Vector	Jaccard Index, Hamming Distance, Euclidean Distance
Graph	Maximal Common Subgraphs, Edit Distance
Images	Euclidean Distance, Manhattan, VCAD, distance of color histograms
Hidden Markov Models	Kullback-Leibler divergence

Distance Measure and Features Types

In Table 14 we present examples of this clustering proposal.

Table 14.

Purpose	Analysis Type	Distance Metric	Clustering method
Investigate siblings families	Hybrid - Java API Call	Jaccard index	UPGMA
Detect malware	Dynamic - CFG	Edit distance	K-means and K-medoid
Analysis of families malware reports	Static - TFIDF	NCD	DBSCAN
Verify dangerous frequent permissions	Static - Permissions	Euclidean distance	Self-Organizing Map
Identify potential banking malware behaviors	Dynamic - network data	the Normalized
		Euclidean distance	Fuzzy C-Means

Table 14. Examples of Clustering Malware Purposes

5 Final Considerations

In addition to the current cybersecurity challenges, new technologies bring different scenarios, including solutions to existing situations and possible unexplored problems. One example is the technology of Digital Twins, which we can apply to simulations and contingency plans for several types of cyberattacks, which may involve zero-day malware. In this way, there are still several possibilities for acquiring knowledge about attacks, the mode of operating malicious codes, and other applications. Therefore, the clustering algorithms for malware analysis are still necessary and efficient for several purposes, especially when there are so much data to investigate.

The continued application of clustering methods in distinguishing between benign and malicious apps reiterates that satisfactory results are still possible. However, our investigations reiterate that for clustering to present better results, procedures include data normalization, feature selection through filters and wrappers, and the choice of appropriate similarity measures and clustering algorithms. Furthermore, we present a short tutorial for malware clustering with four distinct steps. Our findings show that most of the challenges and limitations of phylogeny are related to the lack of labeled datasets with samples separated into families, problems in labeling and the determination of the dates of variants, and the insufficiency of standardization of results quality analysis. As indications for future research, we suggested evaluating the use of Big Data approaches, and the usage of more recent clustering and machine learning algorithms.

References

[1]

A. Cani, M. Gaudesi, E. Sanchez, G. Squillero, and A. Tonda. 2014. Towards automated malware creation: Code generation and code integration. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 157–160.

Abstract

1 Introduction

2 Background AND Related Work

2.1 Mobile Malware: Types and Operating Systems

2.2 Malware Collection and Malware Android Datasets

2.3 Mobile Android Malware Analysis

2.4 Attributes, Features Extraction, and Features Selection Techniques

2.5 Features Extraction and Features Selection

2.6 Features Selection for Clustering

3 Clustering AND Phylogeny Applied to Malware Samples

3.1 Distances and Similarities

3.2 Analysis of Clustering Types

3.3 Linkage Measure

3.4 Clustering Validation Techniques

3.5 Phylogeny

4 Evaluation of Android Malware Clustering/phylogeny Approaches

4.1 Methodology

4.2 Findings

4.3 Limitations and Challenges of Computational Phylogeny

4.4 Stages of Malware Clustering: A Proposal

5 Final Considerations

References

Cited By

Index Terms

Recommendations

A comprehensive survey on deep learning based malware detection techniques

Analysis of Bayesian classification‐based approaches for Android malware detection

Meta-Learning for Multi-Family Android Malware Classification

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations