Given the difficulty of analyzing all the possibilities of groups within a large amount of data, many techniques are available to assist cluster formation.
3.1 Distances and Similarities
In the literature, there are concepts of similarity and distance. While the similarity between two close objects has a value of 1, the distance has a value of 0, and vice versa. Similarity and distance are inverses of each other [
38]. Thus, similarity measures are numerical standards to demonstrate how related two objects are. In Table
4, we display prevalent measures in the literature.
For categorical sequences, we can employ techniques such as Sequence alignment, The Longest Common Subsequence, and N-gram analysis [
38].
As mentioned earlier, n-grams are substrings of a long string with length N [
72]. Sequence alignment is a method for ordering one sequence above the other to identify the equivalence among similar substrings. The
longest Common Subsequence (LCS) algorithm finds the most extended consecutive sequence of common items for all possible prefix combinations of the input strings [
73].
According to Borbely (2016), there are still challenges in choosing appropriate data understanding algorithms, the optimal combination techniques, and their parameters for clustering tasks [
74].
3.2 Analysis of Clustering Types
Clustering algorithms include various categories and diverse methods, such as the ones listed below.
Partitioning Methods: Subject to that there are
n objects in the original set, partitioning methods can split the group into
k partitions. The best-known algorithms in this category are k-means and k-medoid, which deal with an idea of a center of gravity or a central object. All methods in this group have the same grouping quality and the same problems [
75]: you need to know the number of clusters in advance; Clusters with extensive variations in size are difficult to identify, and the method is suitable for concave spherical clusters.
•
Strengths: Linear scalability. Some partition-based algorithms, such as CLARANS, handle outliers well [
76].
•
Weaknesses: It deals only with numeric data [
77]. It is sensitive to the order of recorded data [
78].
Hierarchical Methods: A hierarchical grouping produces a differentiated tree, often called a dendrogram. In this tree, the objects are the leaves, and the inner nodes reveal a similar structure to the points. If the grouping hierarchy is formed from the bottom up, each data object is initially a cluster by itself, so the small clusters are merged into larger groups at each cluster level. Ranking until all data objects are in a group. This type of hierarchical method is called agglomerative.
The reverse process is called divisive hierarchical grouping. Several new hierarchical algorithms have appeared in recent years. The main difference between all these hierarchical algorithms is how to measure the similarity between each pair of clusters [
75].
•
Strengths: No initial number of clusters required. Versatility and applicability with different types of attributes [
79].
•
Weaknesses: They are not scalable. Difficulty finding an optimal number of clusters depending on dataset size [
36,
80,
81,
82].
Density-based methods: Density-based algorithms are an attempt to address the need for a skillful method to find clusters in arbitrary formats. In these strategies, the idea of groups represents the existence of dense areas of data, separated by regions with low density. Some examples of these algorithms are DBSCAN, OPTICS, and DENCLUE.
•
Strengths: It is noise resistant and can handle grouping of varying sizes and shapes [
83].
•
Weaknesses: These methods do not handle well when varying density clusters or extensive data [
84] [
36].
Grid methods: Grid-based algorithms first quantify the grouping space into a finite number of cells, which forms a grid. The analysts perform all the operations in this grid. Examples of algorithms are Sting, Wavecluster, and Click.
•
Strengths: These algorithms are suitable for parallel processing and increment updating [
85].
•
Weaknesses: The clustering result is sensitive to the granularity [
86].
Fractal-based methods: This clustering method uses data auto-similarity properties. First, the algorithm divides the data into sufficiently large subclusters with high compression. In the second step, the algorithm merges complex subclusters closer to each other. This technique uses a more natural and fully deterministic method to generate the final clusters [
87]. An example of the algorithms used in this type of clustering is FD3 [
88].
•
Strengths: Large efficiency and great scalability, dealing with outliers effectively [
89].
•
Weaknesses: the results are firmly sensitive to the parameters [
86].
Methods based on Models: In this scheme, the reference algorithms models are used for each grouping to optimize the curve between the analyzed elements through a mathematical model. Thus, these clustering algorithms are ideal for unknown distributions, such as a mixture of elemental distributions [
90]. In this type of clustering, each cluster corresponds to a different distribution; usually, distributions are considered Gaussian. One of the best-known examples is the
Expectation-Maximization Algorithm -
EM.
•
Strengths: This clustering category automatically identifies the optimal number of clusters. Also, this type of clustering regards the probability of samples belonging to each group [
91].
•
Weaknesses: Cost and high computational time [
92].
Graph-based methods (Graph Theory): The representation of a clustering problem data can use a graph, where a node represents each set element. For modeling the distance, the method employs a specific weight related to the edge that connects two elements. Examples of algorithms are HCS, DTG, CLICK, and CAST [
93].
•
Strengths: These algorithms fit for the data set with random shape and high dimension [
86].
•
Weaknesses: The number of clusters needed to be preset [
94].
Methods based on Distribution: This scheme assumes that the data is composed of distributions, such as Gaussian distributions. While the center distance of the distribution increases, the probability of a point belonging to that distribution decreases. DBCLASD [
95] is an example of such a clustering method algorithm.
•
Strengths: Moderately high scalability and supported by the well-developed statistical science [
86].
•
Weaknesses: Relatively high time complexity and a strong influence of many parameters [
86].
Methods based on Neural Networks: Competitive neural networks, including RNA, can be models for clustering. They are commonly called self-organizing networks and are typically methods without supervision. Two famous algorithms are SOM and
Adaptive Resonant Theory (ART) [
96]. The Kohonen map (SOM) uses the concept of neighborhood. This type of network learns to recognize sections of neighborhoods in the input space and the topology of the input vectors on which training takes place [
97]. Typical applications of the ART method include radar signal recognition and image processing.
•
Strengths: Easy interpretation ability to handle large amounts of data and complex data [
98].
•
Weaknesses: A good dataset is critical; Determining which factors are relevant to the problem can be a complicated task [
99].
Methods based on Evolutionary Computing: Evolutionary algorithms, such as genetic algorithms, can be applied in cluster analysis. In this clustering method, the algorithms evaluated characteristics that best represent the elements, and the operators for selection, mutation, and crossing of [
100]. Examples of these algorithms are the genetic KM [
101] and the genetically guided algorithm [
102].
•
Strengths: Experiments have shown compact and well-separated clusters [
103].
•
Weaknesses: The computational effort still is an issue [
104].
Fuzzy-based methods: This type of clustering allows an individual to partially belong to more than one group, with varying degrees of the neighborhood. Cluster boundary elements are not required to belong entirely to a cluster. However, degrees of association between 0 and 1 are assigned, indicating their partial association [
105]. Generally, these models are standard in pattern recognition. The Fuzzy c-means and Fuzzy c-shells algorithms are examples of fuzzy clustering.
•
Strengths: This method has the advantage of robustness for ambiguity and maintains more information than any rigid clustering method [
106].
•
Weaknesses: High sensitivity to noise [
107].
Kernel based methods: Kernel Algorithms feature space to allow nonlinear separation in the input. Kernel-based clustering applies this procedure and works the implicitly grouped by a Kernel method, which performs a mapping not appropriate linear input data for a high-quality feature space substituting the internal product between the nonlinear variables with an appropriately determined kernel [
108]. K-Means Kernel and
SVC -
Support Vector Clustering are examples of this type of clustering.
•
Strengths: Theses methods are able to identify the non-linear structures [
109].
•
Weaknesses: Time complexity is large. Complex Algorithms in nature [
109].
We emphasize that the clustering task is highly dependent on the parameters, similarity measures, and methods used by the algorithm [
65].
3.3 Linkage Measure
Hierarchical algorithms are clustering techniques that create a hierarchy of relationships between objects. They work in two modes: the agglomerative, which constructs clusters from isolated elements.
In the divider approach, the process starts with a broad set, which is broken into parts until it reaches isolated elements [
110]. The principal aspect of the hierarchical algorithms is the selection of pairs based on the linkage function.
The linkage measure uses similarity or distance indices between the elements of the set. In this process, similar elements are grouped into a single cluster, reducing the total number of remaining clusters. One of the essential uses of the agglomerative method is to identify the point at which two groups of elements are too distinct to form a homogeneous cluster. With increasing distance values between samples, it is ideal to stop the clustering process avoiding the resulting groups from being too dissimilar [
111].
Some of the linkage methods are the following:
•
Maximum linkage or Complete Linkage: This method, also known as a Complete Linkage or Nearest Neighbor, has two clusters with the closest maximum distance merged. This process repeats until there is only a single cluster left. Complete linkage is also called farthest neighbor linkage [
112].
•
Minimum linkage or Single Linkage: This measure involves estimating the distance between clusters using the elements with the greatest distance between them. For this reason, the single linkage is sometimes called nearest neighbor linkage [
111].
•
Average linkage: Also known as
UPGMA or
Unweighted Pair Group Method with Arithmetic Mean. This measure uses the average pair-wise proximity between their member individuals of all different groups to merge clusters [
113]. The UPGMA method is relative to its weighted variant, the WPGMA method [
111].
•
Centroid Method: Also known as
UPGMC:
Unweighted Pair Group Method with Centroid Averaging. In this type, the distance between the centroids of the groups defines the proximity between them [
114]. UPGMC employs only Euclidean distances [
115].
•
Ward Method: For Ward’s linkage, two clusters are merged by calculating the total sum of square derivations from the mean of a cluster [
112].
•
Median Method: In this form, the distance between two clusters is the median distance between an element in one group and an observation in the other cluster [
112].
After the decision on the most appropriate method and its application, we generate the clusters. Besides, the interpretation of the generated clusters is a complex step, which aims to discover meanings related to the area of the analyzed data sets.
3.5 Phylogeny
Phylogeny is the genealogical history of a group of organisms and a hypothetical representation of ancestral/descendant relationships [
135].
Binary trees (with or without root), phylogenetic networks, or graphs are indicated to represent the relationship between species [
136]. For rooted trees, the root describes the common ancestor of all species represented in the tree. When an ancestor is not specified or assumed, a rootless tree is applicable for representing relationships between groups, regardless of the common ancestor. The nodes in phylogeny denote the unit of taxonomy or species.
Phylogeny differs from taxonomy while the first one uses a group with shared characteristics as base, whereas it takes on a common ancestor and seeks to derive relationships among its descendants [
137].
Computational phylogenetics is the application of computer algorithms, methods, and programs for phylogenetic analysis. The phylogenetics of biology inspires malware phylogeny research. We can adapt methods commonly used in biology to construct, evaluate, and compare phylogenetic trees in malicious programs research [
56].
Even if there are apparent differences between malicious software and species, by considering software products as species and source code as genes, software evolution can be investigated in the same way as biological evolution [
138].
The most common tree-building methods apply distance, parsimony, maximum likelihood, and Bayesian approaches. For example, one of the most common methods of building phylogenetic trees is the
MP -
Maximum Parsimony Method [
139]. This method assumes that the most accurate tree is the one that requires the least changes to produce the data contained in the alignment [
140]. Another popular method is
maximum likelihood estimation.-
MLE is used to estimate the parameters of a statistical model.
The MP and MLE methods are character-based, i.e., they depend on the individual values of each alignment sequence. There are other methods known as distance-based methods, which, when estimating phylogeny consider an entire taxonomy sequence, often using a distance matrix. The simplest distance-based method is the Unweighted Pair Group Method (UPGMA). In this method, the sequence pair (or sequence group) to be grouped first is the one with the shortest distance between all sequence pairs (or groups).
Both character-based and distance-based methods have advantages and disadvantages. The UPGMA method, for example, is easy to implement but is not widely used for phylogenetic tree creation because it does not take into account different rates of evolutionary change between sequences represented in the [
141] tree. Character-based methods are more faithful in the evolutionary representation of phylogeny but are more computationally expensive to implement.
One of the recurring problems in phylogenetic analysis is the size of the generated tree, especially in sets with thousands of specimens. This issue is not easily solved by just using the zoom and panning tools [
142]. Besides, there is no clear separation of hierarchical structure lines and labels by any available technique.
Although tree-based models are the main ones in the literature, they are not suitable for representing cross-linking events, which can occur in malware generation [
143]. Thus, phylogenetic networks emerged as an alternative way to model evolution. These networks are graphs in which each labeled sheet represents an instance, and nodes can have more than one parent.