US20160294852A1 - Determining string similarity using syntactic edit distance - Google Patents
Determining string similarity using syntactic edit distance Download PDFInfo
- Publication number
- US20160294852A1 US20160294852A1 US14/679,757 US201514679757A US2016294852A1 US 20160294852 A1 US20160294852 A1 US 20160294852A1 US 201514679757 A US201514679757 A US 201514679757A US 2016294852 A1 US2016294852 A1 US 2016294852A1
- Authority
- US
- United States
- Prior art keywords
- string
- domain name
- syntax
- query
- syntactic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 claims description 25
- 230000004044 response Effects 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000013500 data storage Methods 0.000 claims 1
- 238000004458 analytical method Methods 0.000 description 22
- 238000011524 similarity measure Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 230000002547 anomalous effect Effects 0.000 description 4
- 230000002159 abnormal effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 108020001568 subdomains Proteins 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- H04L61/1511—
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L61/00—Network arrangements, protocols or services for addressing or naming
- H04L61/30—Managing network names, e.g. use of aliases or nicknames
- H04L61/301—Name conversion
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L61/00—Network arrangements, protocols or services for addressing or naming
- H04L61/45—Network directories; Name-to-address mapping
- H04L61/4505—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
- H04L61/4511—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
Definitions
- Computer networks and the devices that operate on them often experience problems for a variety of reasons, e.g., due to misconfiguration, software bugs, and malicious network and computing device attacks. Detecting and preventing the use and spreading of malicious software, for example, is often a priority for computer network administrators. Malicious software is increasingly designed to avoid detection using increasingly sophisticated methods.
- FIG. 1 is a block diagram of an example computing device for determining string similarity using syntactic edit distance.
- FIG. 2A is an example data flow for determining string similarity using syntactic edit distance.
- FIG. 2B is a representation of an example data flow for using syntactic edit distance to cluster domain names.
- FIG. 3 is a flowchart of an example method for determining string similarity using syntactic edit distance.
- the ability to determine similarity between strings facilitates a variety of analytical processes, including string clustering and pattern recognition. These analytical processes may be used, for example, in the computer networking context, to detect abnormalities in domain name system (DNS) queries.
- DNS domain name system
- Abnormal DNS query activity may be indicative of malicious software activity. Accordingly, using domain name query similarities to cluster domain names and identify patterns may facilitate the identification of malicious software (“malware”) operating on devices issuing the DNS queries.
- DNS queries are a type of network traffic generally produced by a computing device operating on a computer network; the DNS queries include a string specifying a domain name and are addressed to a DNS server device for domain name resolution.
- the DNS server typically provides an IP address associated with the query domain name in response to the DNS query, e.g., a computing device that issues a DNS query for “www.example.com,” may be provided with a response, from a DNS server, indicating the IP address associated with the “www.example.com,” e.g., “123.456.789.012.”
- DNS queries may be produced by computing devices for many non-malicious purposes, some malware may use DNS queries for malicious purposes.
- malware may make use of a domain generation algorithm (DGA) to periodically generate domain names that can be used by command and control servers to provide infected computing devices with updates and/or commands.
- DGA domain generation algorithm
- Malware makes use of DGAs, as opposed to static domains, to prevent the malware command and control servers from being blacklisted.
- An infected computing device may periodically attempt to reach out to a large number of randomly generated domain names, only a portion of which are registered to malware command and control servers.
- a network administrator's ability to detect a computing device that is using a DGA to generate a large number of randomly generated domain names may facilitate the identification of infected computing devices on the administrator's network.
- a DNS query analyzing device may inspect DNS query packets sent from a client computing device. To identify and cluster similar domain names, the analyzing device may generate syntax strings for each domain name included in a DNS query by replacing each character of the domain name with a metacharacter that represents a category of characters, or syntactic group. After generating syntax strings for each domain name, the domain names can be clustered into groups of similar domain names using the syntax strings. By determining similarity and clustering based on syntax, rather than the actual characters, patterns, such as those used by DGAs may be detected,
- the syntax string for the domain name may be “LLLPCCCDDDPLLL,” where ‘L’ replaces lower-case letters, ‘P’ replaces punctuation, ‘C’ replaces capital letters, and ‘D’ replaces digits.
- the syntax string for a second domain name, “www.DYW846.com,” using the same metacharacters, would be the same as that of the first domain name, e.g., “LLLPCCCDDDPLLL.”When clustered according to similarity using the syntax strings, the two domain names would be deemed similar, even though many characters do not match.
- This type of clustering may help identify, for example, use of a DGA that creates pseudo-random domain names using three random capital letters followed by three random digits. Further details regarding the use of syntax strings to measure string similarity and cluster strings are discussed in the paragraphs that follow.
- FIG. 1 is a block diagram of an example computing device 100 for determining string similarity using syntactic edit distance.
- Computing device 100 may be, for example, a server computer, a personal computer, a mobile computing device, or any other electronic device suitable for processing network communications data.
- computing device 100 includes hardware processor 110 and machine-readable storage medium 120 .
- Hardware processor 110 may be one or more central processing units (CPUs), semiconductor-based microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 120 .
- Hardware processor 110 may fetch, decode, and execute instructions, such as 122 - 130 , to control the process for determining string similarity using syntactic edit distance.
- hardware processor 110 may include one or more electronic circuits that include electronic components for performing the functionality of one or more of instructions.
- a machine-readable storage medium, such as 120 may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions.
- machine-readable storage medium 120 may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage device, an optical disc, and the like.
- RAM Random Access Memory
- EEPROM Electrically Erasable Programmable Read-Only Memory
- storage medium 120 may be a non-transitory storage medium, where the term “non-transitory” does not encompass transitory propagating signals.
- machine-readable storage medium 120 may be encoded with a series of executable instructions: 122 - 130 , for determining string similarity using syntactic edit distance.
- the computing device 100 executes instructions to receive DNS query packets 142 that were sent by a client computing device 140 ( 122 ). While FIG. 1 depicts the computing device 100 receiving the DNS query packets 142 directly from the client computing device 140 , in some implementations, the DNS query packets 142 are received indirectly, e.g., through one or more intermediary devices, such as network routers and/or load balancers. The DNS query packets 142 may be received periodically, one at a time, and/or in batches, and each packet specifies a domain name.
- the computing device 100 executes instructions to generate, for each domain name included in the received DNS query packets 142 , a syntax string by replacing each character of the query domain name with a metacharacter.
- Each metacharacter represents a category of characters, and the category represented by a metacharacter is different from the categories of other metacharacters.
- the letter ‘L’ may be a metacharacter that represents all lower-case alphabetic letters
- the letter ‘C’ may be a metacharacter that represents all capital letters
- the letter ‘D’ may be a metacharacter that represents all numerical digits
- the letter ‘P’ may be a metacharacter that represents all punctuation marks.
- metacharacter used and the characters represented by the metacharacter, may vary, e.g., numbers, symbols, punctuation, and other characters may be used as metacharacters, and a variety of other character categories may be used, such as vowels, consonants, and Greek letters, to name a few.
- a DNS query packet may include the domain name, “www.1a2Bc3.com.”
- the syntax string for the foregoing domain name would be “LLLPDLDCLDPLLL.” While the metacharacters and the categories they represent may vary, as noted above, other portions of the syntax string may also vary.
- the sub-domain and/or top level domain may be removed or condensed, e.g., due to their ubiquity.
- metacharacters representing “www.” and “.com” may be removed, leaving a syntax string of “DLDCLD.”
- “www.” and “.com” may be represented by a single metacharacter, e.g., a ‘W’ for “www.” and a ‘c’ for “.com,” leaving a syntax string of “WDLDCLDc.”
- Other example categories of characters which may be used separately or in combination with those above, include: alphabetical letters, vowel letters, consonant letters, non-English letters or characters, dashes, underscores, unprintable characters, and specific punctuation marks. Other subsets of the characters included in a domain name may be used to generate a syntax string for that domain name.
- the computing device 100 executes instructions to determine, for each domain name included in the received DNS query packets 142 , a syntactic edit distance between the domain name and each other domain name included in the received DNS query packets 142 ( 126 ).
- the syntactic edit distance between query domain names is determined based on the syntax strings of the corresponding domain names.
- the syntactic edit distance between domain names may be determined based on an edit distance between the syntax strings of the corresponding domain names.
- Various edit distance methods may be used to determine the edit distance between two strings, e.g., Levenshtein, or LCS edit distance methods.
- a second domain name may be “home.example.net,” which has a syntax string—using the above example metacharacters—of “LLLLPLLLLLLLPLLL.”
- the edit distance, using Levenshtein edit distance method, between this syntax string and the first syntax string above (“LLLPDLDCLDPLLL”) is 6 , e.g., four substitutions and two insertions.
- edit distance of syntax strings may be converted to a similarity value, e.g., relative to the edit distance of the domain names. For example, the edit distance between “www.1a2Bc3.com” and “home.example.net” is 13.
- An example similarity value for the two domain names may be calculated using the following formula: 1 ⁇ (syntactic edit distance/edit distance). In this example, the resulting similarity value would be ⁇ 0.54(1 ⁇ (6/13)). Other methods may also be used for determining syntactic similarity between strings.
- a sorted syntax string is generated for each syntax string by sorting the metacharacters of each syntax string, and the sorted syntax string is used for determining syntactic edit distance and/or similarity.
- the syntax string of the first domain name may be sorted into “CDDDLLLLLLLLPP,” while the second syntax string may be sorted into “LLLLLLLLLLLLPP.” Though shown being sorted in alphabetical order, the order in which the metacharacters are sorted may vary.
- the edit distance between the two example sorted syntax strings is also 6.
- an example formula may be: 1 ⁇ (sorted syntactic edit distance/edit distance). In this example, the resulting similarity value would be ⁇ 0.54(1 ⁇ (6/13)).
- other methods may also be used for determining syntactic similarity between sorted syntax strings.
- the computing device 100 executes instructions to cluster each query domain name included in the received DNS query packets 142 into a cluster based on the syntactic edit distances ( 128 ).
- one or more clustering methods may be used to generate clusters of domain names based on syntactic edit distances. For example, similarity measures, such as one calculated in the examples above, may be used to cluster domain names into clusters of other syntactically similar domain names.
- a variety of clustering and/or classification methods, or combinations thereof, may be used to cluster domain names.
- Example clustering methods include hierarchical clustering, centroid based clustering, distribution based clustering, and density based clustering.
- the computing device 100 executes instructions to identify the client computing device 140 as a potential source of malicious software based on the clusters ( 130 ).
- the client computing device 140 may be identified as a potential source of malware in response to determining that one of the clusters includes a number of domain names that exceeds a threshold number of query domain names.
- the threshold number may, for example, be dynamic and depend upon the number of clustered domain names, e.g., a threshold of 30 domain names in a particular cluster may be used for a total of 100 clustered domain names, but a higher threshold may be appropriate when clustering 10,000 domain names.
- a high volume of DNS queries may be indicative of potentially malicious activity occurring on a client computing device.
- Other anomalies, or abnormal activity, in DNS query traffic may lead to identification of a device infected with malware or otherwise performing improperly.
- recognizing patterns in the randomly generated domain names may facilitate identification of a client computing device that is infected.
- Clustering domain names by some measure of syntactic similarity and/or sorted syntactic similarity, rather than similarity of the actual domain names, may help identify patterns in DNS queries and/or the use of a DGA.
- a variety of options may be used when performing clustering and cluster analysis to determine if DNS traffic originating from a client computing device is indicative of malware.
- identification of a pattern need not be limited to identification of a single anomalous cluster, e.g., multiple clusters of an abnormal size—abnormally large and/or also small—may provide an indication of a DGA.
- similarity between domain names is determined based on a subset of the domain name characters
- the particular subset of characters used may be more or less indicative of malware or a DGA than others.
- using syntax strings and similarity measures are based on only sub-domains, or a sub-domain and domain name without a top level domain (TLD), may facilitate identification of DGAs or other DNS anomalies.
- the status of a particular domain name query and/or DNS server response may be used to perform clustering and analysis.
- non-existent domains and/or sub-domains may be an additional feature upon which clustering and analysis is based.
- only domain names and/or sub-domains that are non-existent are clustered based on syntactic similarity. This and other implementations may facilitate identification of malware that uses a DGA to reach out to many domains, only a few of which are actually registered, e.g., to malware command and control servers.
- clusters may also be used to facilitate identification of malicious command and control servers. For example, in a situation where all domain names generated by a particular DGA fit into a specific number of clusters, domain names within those specific clusters that are actually registered may be in use by malicious command and control servers.
- DGAs while used by malicious software, may also be used by non-malicious entities.
- a content serving network may use a DGA to generate random or pseudo-random sub-domains used to serve content to clients.
- content serving networks may use a DGA having a regular or standard syntax.
- a single cluster may include a large number of domain names and be anomalous, but not malicious. Malicious implementation of a DGA may be detected and distinguished from this type of non-malicious DGA use in situations where a malicious DGA doesn't use a standard syntax, but instead uses a DGA that results in syntax with relatively high variance.
- FIG. 2A is an example data flow 200 for determining string similarity using syntactic edit distance
- FIG. 28 is a representation of an example data flow 260 for using syntactic edit distance to cluster domain names.
- the data flow 200 depicts a domain name analysis device 240 , which may be implemented by a computing device, such as the computing device 100 described above with respect to FIG. 1
- the client computing device(s) 210 and user device 250 may be any computing device suitable for network communications, such as a personal computer, mobile computer, virtual machine, server computer, or any combination thereof.
- the client computing device(s) 210 may be virtual machines operating within a private cloud computing network 215 .
- the client computing device(s) 210 may be configured to perform various services and/or run various applications.
- a client may rent computing resources, such as the client computing devices 210 , from the operator of a cloud computing network, such as the network 215 , for use in providing web services, such as an e-mail application, to end-users.
- DNS queries are one form of network communications that may originate from the client computing device(s) 210 , e.g., in the form of DNS query packets 212 .
- Each of the DNS query packets 212 is addressed to a DNS server which will perform domain name resolution on a particular domain name. For example, in a situation where the client computing device(s) 210 implement an e-mail application, a DNS query packet may be issued to identify the destination for an email addressed to “user@example.com.”
- Each of the DNS query packets 212 passes through an egress point 220 of the network 215 .
- Egress point(s) 220 may be, for example, routers operating within the private network 215 .
- the egress point(s) 220 also provide DNS query packets 212 to the domain name analysis device 240 .
- the DNS query packets 212 provided to the domain name analysis device 250 may include all or a subset of the DNS query packets 212 provided by the client computing device(s) 210 .
- the domain name analysis device 240 may be provided with all of the DNS query packets, or with a subset, e.g., only queries from a single client computing device at a time.
- the example data flow 260 depicts example domain names 265 included in some example DNS queries provided to the domain name analysis device 240 .
- the domain name analysis device 240 For each of the domain names 265 , the domain name analysis device 240 generates a syntax string by replacing a subset of the characters of the domain name with one of multiple metacharacters.
- the resulting syntax strings 270 are depicted in the example data flow 260 , where the letter ‘L’ represents all lower-case alphabetic letters, the letter “C” represents all capital letters, the letter ‘D’ represents all numerical digits, the letter ‘P’ represents all periods, and the letter ‘U’ represents the underscore symbol.
- the syntax strings 270 are generated without the top level domain and preceding period(s), e.g., without “.com,” or “.co.uk.”
- the domain name analysis device 240 generates sorted syntax strings 272 for each syntax string.
- Each of the sorted syntax strings 272 is generated by sorting the metacharacters of each syntax string.
- a variety of methods may be used to sort syntax strings, e.g., alphabetically, reverse alphabetically, or according to user/administrator preferences.
- the example sorted syntax strings 272 are sorted alphabetically, with the exception that the ‘U’ representing an underscore symbol is first.
- the domain name analysis device 240 determines a syntactic edit distance between the domain name and each other domain name included in the DNS query packets 218 .
- the syntactic edit distance between the domain name and each other domain name is determined based on the syntax string of the domain name and each syntax string of each other domain name.
- the syntactic edit distances 280 are shown for the domain, “www.example.com.”
- the edit distance between the syntactic string for “www.example.com” and the syntactic string for “2b13Ca.example.com,” for example, is 4.
- the example data flow 260 also shows the sorted syntactic edit distances 282 for the domain, “www.example.com.” Some of the sorted syntactic edit distances are the same as the syntactic edit distances, while others are different. Either or both distance measures may be determined by the domain name analysis device 240 for use in determining similarity between domain names.
- the domain name analysis device 240 clusters each of the domain names 265 into one of multiple clusters 290 .
- domain names are clustered based on their syntactic edit distance from www.example.com E.g., each domain name in ClusterA has a syntactic edit distance of 0 or 1, each domain name in ClusterB has a syntactic edit distance of 2 or 3, each domain name in ClusterC has a syntactic edit distance of 4, and each domain name in ClusterD has a syntactic edit distance of 5-6.
- the clusters 290 may be based on the sorted syntactic difference between domain names 265 .
- example data flow 265 depicts clusters 290 formed based on syntactic edit distance from one domain name
- many variations and other clustering and/or classification algorithms, or combinations thereof, may also be used to cluster the domain names.
- syntactic edit distances may be calculated for each domain name relative to all other domain names, and a clustering algorithm may use those distance measures to cluster the domain names.
- the domain name analysis device 240 determines, for each domain name, measures of similarity to each other domain name.
- domain names may he clustered using the similarity measures, in addition to or instead of using syntactic edit distances directly.
- the domain name analysis device 240 may determine an edit distances between each pair of domain names and calculate the measure of similarity between them using the edit distance and the syntactic edit distance.
- One example formula may be
- the sorted syntactic edit distance may be substituted for the syntactic edit distance in the above example formula.
- Use of a similarity formula may facilitate in the identification of domain names that appear dissimilar, but are syntactically similar, and thus potentially related by being generated from the same DGA.
- the measure of similarity between “www.example.com” and “2b13Ca.example.com,” using formula (1) above, is ⁇ 0.33 (1 ⁇ (4/6)).
- the measure of similarity between “www.example.com” and “90sw5T.example.com,” using formula (1) above, is 0.2 (1 ⁇ (4/5)).
- the foregoing similarity values would be the same, in this example, if the sorted syntactic edit distance were used instead of syntactic edit distance.
- the resulting measure of similarity is 0.5 (1 ⁇ (3/6)).
- the domain names may be clustered based on similarity measures, e.g., as opposed to using edit distances directly.
- similarity formulas and clustering algorithms, and combinations thereof may he used to determine similarity between and cluster domain names.
- the domain name analysis device 240 may determine the use of a DGA.
- other network anomalies may also detected, e.g., high DNS traffic volume.
- Use of a DGA may be determined, for example, based on the number of domain names in a particular cluster or clusters, relative to the number of domain names in the other clusters. For example, ClusterC of the clusters 290 has double the number of domain names in half of the edit distance range. In some implementations, this may trigger an anomaly.
- duster features may be used to detect a network anomaly, such as the use of a DGA, e.g., a relatively large number of clusters with a relatively large number of domain names in each, a single relatively large cluster, or a large number of clusters with relatively few domain names in each.
- various supervised and/or unsupervised machine learning techniques may also be used to determine cluster features that are indicative of network anomalies and/or DGAs.
- the domain name analysis device 240 provides anomaly data 242 to a user device 250 .
- the user device 250 while depicted internally to the network 215 , may in some implementations be external to the network 215 .
- the user device 250 may be a network administrator device, e.g., one that exercises control over client computing device(s) 210 , or a client device used by en entity to manage the client computing device(s) 210 from outside of the network 215 .
- Anomaly data 242 may he used for a variety of purposes, e.g., for logging, for training an anomaly prediction module, or to take an action, e.g., notify a client of anomalous network activity and/or shut down a client computing device that is potentially infected with malware.
- Various devices such as the egress point(s) 220 , domain name analysis device 240 , and user device 250 , are depicted separately in the example data flow 200 .
- the operations described above and additional operations may be performed, in whole or in part, by additional devices.
- one computing device may generate syntax strings, while a second computing device may perform the clustering.
- a temporary storage device may be used to store DNS query packets 212 until they may be retrieved by the domain name analysis device 240 .
- some functionality of various devices may be combined, e.g., the domain name analysis device 240 may also take action in response to identifying potential malware by directly communicating with a client computing device to shut it down.
- the domain name analysis device 240 may be integrated in an egress point of the network 215 . Other configurations of both the network 215 and the devices within or outside of the network 215 may also be used.
- FIG. 3 is a flowchart of an example method 300 determining string similarity using syntactic edit distance.
- the method 300 may be performed by a computing device, such as a computing device described in FIG. 1 .
- Other computing devices may also be used to execute method 300 .
- Method 300 may be implemented in the form of executable instructions stored on a machine-readable storage medium, such as the storage medium 120 , and/or in the form of electronic circuitry.
- a first string of characters originating from a particular computing device is received ( 302 ).
- the first string of characters may represent a domain name being queried and included in a DNS query packet received from a client computing device.
- An example first string may be “www.example123.com.”
- a second string of characters originating from the particular computing device is received ( 304 ).
- the second string of characters may represent another domain name being queried in another DNS query packet from the client computing device.
- An example second string may be “www.uwqnci3p21.net.”
- a first syntax string is generated by replacing each character of the first string with one of a plurality of metacharacters ( 306 ).
- Each metacharacter represents a category of characters, and each category is different from each other category of characters represented by each other metacharacter.
- the example first string may be represented by the syntax string “LLLPLLLLLLLDDDc,” where ‘L’ represents lower-case letters, ‘P’ represents a period, ‘D’ represents a numerical digit, and ‘c’ represents “.com.”
- a single metacharacter is used to represent the TLD, though in some implementations each character of the TLD may be separately represented.
- a second syntax string is generated by replacing each character of the second string with one of the plurality of metacharacters ( 308 ).
- the second example string may be represented by the syntax string “LLLPLLLLLLDLDDn,” where the metacharacter ‘n’ represents “.net.”
- a measure of similarity is determined between the first string and the second string using a syntactic edit distance between the first string and the second string ( 310 ).
- the syntactic edit distance between the first and second string is determined based on the first and second syntax strings.
- a measure of similarity may be the edit distance between syntax strings.
- the measure of similarity between “www.example123.com” and “www.uwqnci3p21.net” is the edit distance between “LLLPLLLLLLLDDDc” and “LLLPLLLLLLDLDDn,” which is 3.
- the measure of similarity between the first and second string may be determined based on the edit distance between the first and second string. For example, using formula (1) described above with reference to FIGS. 2A and 2B , the measure of similarity between “www.example123.com” and “www.uwqnci3p21.net” is 0.75(1 ⁇ (3/12). This relative measure of similarity provides an indication of the syntactic similarity of the two strings.
- the particular computing device from which the first and second string are received is identified as a potential source of malicious software based on the measure of similarity between the first string and the second string. For example, if the similarity measure is above a predetermined threshold, the computing device may be identified as potentially infected with malware. In some implementations, receipt of additional strings may facilitate determining similarity between strings and clustering similar strings.
- additional strings of characters such as additional domain names in DNS queries
- syntax strings may be generated, e.g., using the same metacharacters used to generate the first and second syntax strings.
- measures of similarity may be determined between the first string, second, string, and other additional strings. In this situation, the result is a set of strings that each has a measure of similarity to each other string.
- Each string may be clustered into one of multiple clusters based on the measures of similarity.
- Various clustering algorithms may be used to cluster the strings, and the resulting clusters will generally each include strings that are more syntactically similar to other strings in the cluster than they are to strings included in other clusters.
- the clusters may be used to identify the particular computing device as a potential source of malicious software. E.g., as described above, large clusters and a large number of relatively large clusters, may be indicative of the use of a DGA, which may in turn be indicative of malware.
- examples provide a mechanism for detecting anomalies by analyzing DNS query packets and clustering syntactically similar domain names and potential applications of a system that is capable of determining string similarity using syntactic similarity.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- Computer networks and the devices that operate on them often experience problems for a variety of reasons, e.g., due to misconfiguration, software bugs, and malicious network and computing device attacks. Detecting and preventing the use and spreading of malicious software, for example, is often a priority for computer network administrators. Malicious software is increasingly designed to avoid detection using increasingly sophisticated methods.
- The following detailed description references the drawings, wherein:
-
FIG. 1 is a block diagram of an example computing device for determining string similarity using syntactic edit distance. -
FIG. 2A is an example data flow for determining string similarity using syntactic edit distance. -
FIG. 2B is a representation of an example data flow for using syntactic edit distance to cluster domain names. -
FIG. 3 is a flowchart of an example method for determining string similarity using syntactic edit distance. - The ability to determine similarity between strings facilitates a variety of analytical processes, including string clustering and pattern recognition. These analytical processes may be used, for example, in the computer networking context, to detect abnormalities in domain name system (DNS) queries. Abnormal DNS query activity, as discussed in further detail below, may be indicative of malicious software activity. Accordingly, using domain name query similarities to cluster domain names and identify patterns may facilitate the identification of malicious software (“malware”) operating on devices issuing the DNS queries.
- DNS queries are a type of network traffic generally produced by a computing device operating on a computer network; the DNS queries include a string specifying a domain name and are addressed to a DNS server device for domain name resolution. The DNS server typically provides an IP address associated with the query domain name in response to the DNS query, e.g., a computing device that issues a DNS query for “www.example.com,” may be provided with a response, from a DNS server, indicating the IP address associated with the “www.example.com,” e.g., “123.456.789.012.” While DNS queries may be produced by computing devices for many non-malicious purposes, some malware may use DNS queries for malicious purposes.
- By way of example, malware may make use of a domain generation algorithm (DGA) to periodically generate domain names that can be used by command and control servers to provide infected computing devices with updates and/or commands. Malware makes use of DGAs, as opposed to static domains, to prevent the malware command and control servers from being blacklisted. An infected computing device may periodically attempt to reach out to a large number of randomly generated domain names, only a portion of which are registered to malware command and control servers. A network administrator's ability to detect a computing device that is using a DGA to generate a large number of randomly generated domain names may facilitate the identification of infected computing devices on the administrator's network.
- In particular, a DNS query analyzing device may inspect DNS query packets sent from a client computing device. To identify and cluster similar domain names, the analyzing device may generate syntax strings for each domain name included in a DNS query by replacing each character of the domain name with a metacharacter that represents a category of characters, or syntactic group. After generating syntax strings for each domain name, the domain names can be clustered into groups of similar domain names using the syntax strings. By determining similarity and clustering based on syntax, rather than the actual characters, patterns, such as those used by DGAs may be detected,
- For example, the syntax string for the domain name, “www.ABC123.com,” may be “LLLPCCCDDDPLLL,” where ‘L’ replaces lower-case letters, ‘P’ replaces punctuation, ‘C’ replaces capital letters, and ‘D’ replaces digits. The syntax string for a second domain name, “www.DYW846.com,” using the same metacharacters, would be the same as that of the first domain name, e.g., “LLLPCCCDDDPLLL.”When clustered according to similarity using the syntax strings, the two domain names would be deemed similar, even though many characters do not match. This type of clustering may help identify, for example, use of a DGA that creates pseudo-random domain names using three random capital letters followed by three random digits. Further details regarding the use of syntax strings to measure string similarity and cluster strings are discussed in the paragraphs that follow.
- Referring now to the drawings,
FIG. 1 is a block diagram of anexample computing device 100 for determining string similarity using syntactic edit distance.Computing device 100 may be, for example, a server computer, a personal computer, a mobile computing device, or any other electronic device suitable for processing network communications data. In the embodiment ofFIG. 1 ,computing device 100 includeshardware processor 110 and machine-readable storage medium 120. -
Hardware processor 110 may be one or more central processing units (CPUs), semiconductor-based microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 120.Hardware processor 110 may fetch, decode, and execute instructions, such as 122-130, to control the process for determining string similarity using syntactic edit distance. As an alternative or in addition to retrieving and executing instructions,hardware processor 110 may include one or more electronic circuits that include electronic components for performing the functionality of one or more of instructions. - A machine-readable storage medium, such as 120, may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions. Thus, machine-
readable storage medium 120 may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage device, an optical disc, and the like. In some implementations,storage medium 120 may be a non-transitory storage medium, where the term “non-transitory” does not encompass transitory propagating signals. As described in detail below, machine-readable storage medium 120 may be encoded with a series of executable instructions: 122-130, for determining string similarity using syntactic edit distance. - As shown in
FIG. 1 , thecomputing device 100 executes instructions to receiveDNS query packets 142 that were sent by a client computing device 140 (122). WhileFIG. 1 depicts thecomputing device 100 receiving theDNS query packets 142 directly from theclient computing device 140, in some implementations, theDNS query packets 142 are received indirectly, e.g., through one or more intermediary devices, such as network routers and/or load balancers. TheDNS query packets 142 may be received periodically, one at a time, and/or in batches, and each packet specifies a domain name. - The
computing device 100 executes instructions to generate, for each domain name included in the receivedDNS query packets 142, a syntax string by replacing each character of the query domain name with a metacharacter. Each metacharacter represents a category of characters, and the category represented by a metacharacter is different from the categories of other metacharacters. For example, the letter ‘L’ may be a metacharacter that represents all lower-case alphabetic letters, the letter ‘C’ may be a metacharacter that represents all capital letters, the letter ‘D’ may be a metacharacter that represents all numerical digits, and the letter ‘P’ may be a metacharacter that represents all punctuation marks. The metacharacter used, and the characters represented by the metacharacter, may vary, e.g., numbers, symbols, punctuation, and other characters may be used as metacharacters, and a variety of other character categories may be used, such as vowels, consonants, and Greek letters, to name a few. - By way of example, a DNS query packet may include the domain name, “www.1a2Bc3.com.” Using the example metacharacters above, the syntax string for the foregoing domain name would be “LLLPDLDCLDPLLL.” While the metacharacters and the categories they represent may vary, as noted above, other portions of the syntax string may also vary. In some implementations, the sub-domain and/or top level domain may be removed or condensed, e.g., due to their ubiquity. For example, metacharacters representing “www.” and “.com” may be removed, leaving a syntax string of “DLDCLD.” As another example, “www.” and “.com” may be represented by a single metacharacter, e.g., a ‘W’ for “www.” and a ‘c’ for “.com,” leaving a syntax string of “WDLDCLDc.” Other example categories of characters, which may be used separately or in combination with those above, include: alphabetical letters, vowel letters, consonant letters, non-English letters or characters, dashes, underscores, unprintable characters, and specific punctuation marks. Other subsets of the characters included in a domain name may be used to generate a syntax string for that domain name.
- The
computing device 100 executes instructions to determine, for each domain name included in the receivedDNS query packets 142, a syntactic edit distance between the domain name and each other domain name included in the received DNS query packets 142 (126). The syntactic edit distance between query domain names is determined based on the syntax strings of the corresponding domain names. For example, the syntactic edit distance between domain names may be determined based on an edit distance between the syntax strings of the corresponding domain names. Various edit distance methods may be used to determine the edit distance between two strings, e.g., Levenshtein, or LCS edit distance methods. - By way of example, a second domain name may be “home.example.net,” which has a syntax string—using the above example metacharacters—of “LLLLPLLLLLLLPLLL.” The edit distance, using Levenshtein edit distance method, between this syntax string and the first syntax string above (“LLLPDLDCLDPLLL”) is 6, e.g., four substitutions and two insertions. In some implementations, edit distance of syntax strings may be converted to a similarity value, e.g., relative to the edit distance of the domain names. For example, the edit distance between “www.1a2Bc3.com” and “home.example.net” is 13. An example similarity value for the two domain names may be calculated using the following formula: 1−(syntactic edit distance/edit distance). In this example, the resulting similarity value would be ˜0.54(1−(6/13)). Other methods may also be used for determining syntactic similarity between strings.
- In some implementations, a sorted syntax string is generated for each syntax string by sorting the metacharacters of each syntax string, and the sorted syntax string is used for determining syntactic edit distance and/or similarity. For example, the syntax string of the first domain name may be sorted into “CDDDLLLLLLLLPP,” while the second syntax string may be sorted into “LLLLLLLLLLLLLLPP.” Though shown being sorted in alphabetical order, the order in which the metacharacters are sorted may vary. The edit distance between the two example sorted syntax strings is also 6. In implementations where similarity values are used, an example formula may be: 1−(sorted syntactic edit distance/edit distance). In this example, the resulting similarity value would be ˜0.54(1−(6/13)). As with the similarity formula and calculation provided above, other methods may also be used for determining syntactic similarity between sorted syntax strings.
- The
computing device 100 executes instructions to cluster each query domain name included in the receivedDNS query packets 142 into a cluster based on the syntactic edit distances (128). In some implementations, one or more clustering methods may be used to generate clusters of domain names based on syntactic edit distances. For example, similarity measures, such as one calculated in the examples above, may be used to cluster domain names into clusters of other syntactically similar domain names. A variety of clustering and/or classification methods, or combinations thereof, may be used to cluster domain names. Example clustering methods include hierarchical clustering, centroid based clustering, distribution based clustering, and density based clustering. - The
computing device 100 executes instructions to identify theclient computing device 140 as a potential source of malicious software based on the clusters (130). In some implementations, theclient computing device 140 may be identified as a potential source of malware in response to determining that one of the clusters includes a number of domain names that exceeds a threshold number of query domain names. The threshold number may, for example, be dynamic and depend upon the number of clustered domain names, e.g., a threshold of 30 domain names in a particular cluster may be used for a total of 100 clustered domain names, but a higher threshold may be appropriate when clustering 10,000 domain names. - As noted above, a high volume of DNS queries may be indicative of potentially malicious activity occurring on a client computing device. Other anomalies, or abnormal activity, in DNS query traffic may lead to identification of a device infected with malware or otherwise performing improperly. For example, with respect to malware that uses DGAs to issue DNS queries, recognizing patterns in the randomly generated domain names may facilitate identification of a client computing device that is infected. Clustering domain names by some measure of syntactic similarity and/or sorted syntactic similarity, rather than similarity of the actual domain names, may help identify patterns in DNS queries and/or the use of a DGA.
- A variety of options may be used when performing clustering and cluster analysis to determine if DNS traffic originating from a client computing device is indicative of malware. For example, identification of a pattern need not be limited to identification of a single anomalous cluster, e.g., multiple clusters of an abnormal size—abnormally large and/or also small—may provide an indication of a DGA. In implementations where similarity between domain names is determined based on a subset of the domain name characters, the particular subset of characters used may be more or less indicative of malware or a DGA than others. For example, in some situations, using syntax strings and similarity measures are based on only sub-domains, or a sub-domain and domain name without a top level domain (TLD), may facilitate identification of DGAs or other DNS anomalies.
- In some situations, the status of a particular domain name query and/or DNS server response may be used to perform clustering and analysis. For example, non-existent domains and/or sub-domains—detected by intercepting DNS server response packets or independently querying a DNS server—may be an additional feature upon which clustering and analysis is based. In some implementations, for example, only domain names and/or sub-domains that are non-existent are clustered based on syntactic similarity. This and other implementations may facilitate identification of malware that uses a DGA to reach out to many domains, only a few of which are actually registered, e.g., to malware command and control servers. In addition, while identification of an anomalous cluster or clusters of domain names may indicate a particular client computing device is infected with malware, the clusters may also be used to facilitate identification of malicious command and control servers. For example, in a situation where all domain names generated by a particular DGA fit into a specific number of clusters, domain names within those specific clusters that are actually registered may be in use by malicious command and control servers.
- DGAs, while used by malicious software, may also be used by non-malicious entities. For example, a content serving network may use a DGA to generate random or pseudo-random sub-domains used to serve content to clients. In some situations, content serving networks may use a DGA having a regular or standard syntax. In this situation, a single cluster may include a large number of domain names and be anomalous, but not malicious. Malicious implementation of a DGA may be detected and distinguished from this type of non-malicious DGA use in situations where a malicious DGA doesn't use a standard syntax, but instead uses a DGA that results in syntax with relatively high variance. This may result, for example, in a relatively high number of dusters being formed which, in this situation, may be indicative of the use of a malicious DGA. Any of the above clustering and analysis methods, options, and features may be used in combination with other clustering and analysis methods, options, and features to facilitate identification of potential sources of malicious activity.
-
FIG. 2A is anexample data flow 200 for determining string similarity using syntactic edit distance, andFIG. 28 is a representation of an example data flow 260 for using syntactic edit distance to cluster domain names. Thedata flow 200 depicts a domainname analysis device 240, which may be implemented by a computing device, such as thecomputing device 100 described above with respect toFIG. 1 The client computing device(s) 210 and user device 250 may be any computing device suitable for network communications, such as a personal computer, mobile computer, virtual machine, server computer, or any combination thereof. For example, the client computing device(s) 210 may be virtual machines operating within a privatecloud computing network 215. The client computing device(s) 210 may be configured to perform various services and/or run various applications. By way of example, a client may rent computing resources, such as theclient computing devices 210, from the operator of a cloud computing network, such as thenetwork 215, for use in providing web services, such as an e-mail application, to end-users. - During their operation, client computing device(s) 210 may periodically communicate using various network communications protocols. DNS queries are one form of network communications that may originate from the client computing device(s) 210, e.g., in the form of
DNS query packets 212. Each of theDNS query packets 212 is addressed to a DNS server which will perform domain name resolution on a particular domain name. For example, in a situation where the client computing device(s) 210 implement an e-mail application, a DNS query packet may be issued to identify the destination for an email addressed to “user@example.com.” - Each of the
DNS query packets 212 passes through an egress point 220 of thenetwork 215. Egress point(s) 220 may be, for example, routers operating within theprivate network 215. In addition to forwarding DNS query packet(s) 212 to their destination DNS server(s) 230, the egress point(s) 220 also provideDNS query packets 212 to the domainname analysis device 240. TheDNS query packets 212 provided to the domain name analysis device 250 may include all or a subset of theDNS query packets 212 provided by the client computing device(s) 210. In situations where multiple client computing devices managed by multiple clients and user devices, the domainname analysis device 240 may be provided with all of the DNS query packets, or with a subset, e.g., only queries from a single client computing device at a time. The example data flow 260 depictsexample domain names 265 included in some example DNS queries provided to the domainname analysis device 240. - For each of the
domain names 265, the domainname analysis device 240 generates a syntax string by replacing a subset of the characters of the domain name with one of multiple metacharacters. The resulting syntax strings 270 are depicted in the example data flow 260, where the letter ‘L’ represents all lower-case alphabetic letters, the letter “C” represents all capital letters, the letter ‘D’ represents all numerical digits, the letter ‘P’ represents all periods, and the letter ‘U’ represents the underscore symbol. In the example data flow 260, the syntax strings 270 are generated without the top level domain and preceding period(s), e.g., without “.com,” or “.co.uk.” - In some implementations, as shown in the example data flow 260, the domain
name analysis device 240 generates sortedsyntax strings 272 for each syntax string. Each of the sortedsyntax strings 272 is generated by sorting the metacharacters of each syntax string. A variety of methods may be used to sort syntax strings, e.g., alphabetically, reverse alphabetically, or according to user/administrator preferences. The example sortedsyntax strings 272 are sorted alphabetically, with the exception that the ‘U’ representing an underscore symbol is first. - For each of the
domain names 265, the domainname analysis device 240 determines a syntactic edit distance between the domain name and each other domain name included in the DNS query packets 218. As noted above, the syntactic edit distance between the domain name and each other domain name is determined based on the syntax string of the domain name and each syntax string of each other domain name. In the example data flow 260, thesyntactic edit distances 280 are shown for the domain, “www.example.com.” The edit distance between the syntactic string for “www.example.com” and the syntactic string for “2b13Ca.example.com,” for example, is 4. The example data flow 260 also shows the sortedsyntactic edit distances 282 for the domain, “www.example.com.” Some of the sorted syntactic edit distances are the same as the syntactic edit distances, while others are different. Either or both distance measures may be determined by the domainname analysis device 240 for use in determining similarity between domain names. - Using the syntactic and/or sorted
syntactic edit distances name analysis device 240 clusters each of thedomain names 265 into one ofmultiple clusters 290. In the example data flow 260, domain names are clustered based on their syntactic edit distance from www.example.com E.g., each domain name in ClusterA has a syntactic edit distance of 0 or 1, each domain name in ClusterB has a syntactic edit distance of 2 or 3, each domain name in ClusterC has a syntactic edit distance of 4, and each domain name in ClusterD has a syntactic edit distance of 5-6. In implementations where sorted syntactic edit distance is used, theclusters 290 may be based on the sorted syntactic difference betweendomain names 265. - While the
example data flow 265 depictsclusters 290 formed based on syntactic edit distance from one domain name, many variations and other clustering and/or classification algorithms, or combinations thereof, may also be used to cluster the domain names. For example, syntactic edit distances may be calculated for each domain name relative to all other domain names, and a clustering algorithm may use those distance measures to cluster the domain names. - In some implementations, the domain
name analysis device 240 determines, for each domain name, measures of similarity to each other domain name. In this situation, domain names may he clustered using the similarity measures, in addition to or instead of using syntactic edit distances directly. For example, the domainname analysis device 240 may determine an edit distances between each pair of domain names and calculate the measure of similarity between them using the edit distance and the syntactic edit distance. One example formula may be -
Similarity=1=(syntactic edit distance/edit distance) (1) - In implementations where sorted syntactic edit distances are used, the sorted syntactic edit distance may be substituted for the syntactic edit distance in the above example formula. Use of a similarity formula, like the example formula above, may facilitate in the identification of domain names that appear dissimilar, but are syntactically similar, and thus potentially related by being generated from the same DGA.
- By way of example, the measure of similarity between “www.example.com” and “2b13Ca.example.com,” using formula (1) above, is ˜0.33 (1−(4/6)). The measure of similarity between “www.example.com” and “90sw5T.example.com,” using formula (1) above, is 0.2 (1−(4/5)). The foregoing similarity values would be the same, in this example, if the sorted syntactic edit distance were used instead of syntactic edit distance. When calculating similarity between “2b13Ca.example.com” and “90sw5T.example.com,” again using formula (1) above, the resulting measure of similarity is 0.5 (1−(3/6)). This indicates that, even though the edit distance (6) between these domain names is at least as high as the distance between the domain names in the previous examples, these domain names are syntactically more similar to one another. When using sorted syntactic edit distance in formula (1), this because even more apparent, as the similarity measure between “2b13Ca.example.com” and “90sw5T.example.com” is 1 (1−(0/6)). E.g., this indicates that the sorted syntax of the foregoing domain names is the same, using the above example metacharacters.
- In implementations where similarity measures are calculated between domain names, the domain names may be clustered based on similarity measures, e.g., as opposed to using edit distances directly. A variety of similarity formulas and clustering algorithms, and combinations thereof, may he used to determine similarity between and cluster domain names.
- Based on the
clusters 290, the domainname analysis device 240 may determine the use of a DGA. In some implementations, other network anomalies may also detected, e.g., high DNS traffic volume. Use of a DGA may be determined, for example, based on the number of domain names in a particular cluster or clusters, relative to the number of domain names in the other clusters. For example, ClusterC of theclusters 290 has double the number of domain names in half of the edit distance range. In some implementations, this may trigger an anomaly. As noted above, a variety of duster features may be used to detect a network anomaly, such as the use of a DGA, e.g., a relatively large number of clusters with a relatively large number of domain names in each, a single relatively large cluster, or a large number of clusters with relatively few domain names in each. In some implementations, various supervised and/or unsupervised machine learning techniques may also be used to determine cluster features that are indicative of network anomalies and/or DGAs. - In the
example data flow 200, the domainname analysis device 240 providesanomaly data 242 to a user device 250. The user device 250, while depicted internally to thenetwork 215, may in some implementations be external to thenetwork 215. The user device 250 may be a network administrator device, e.g., one that exercises control over client computing device(s) 210, or a client device used by en entity to manage the client computing device(s) 210 from outside of thenetwork 215.Anomaly data 242 may he used for a variety of purposes, e.g., for logging, for training an anomaly prediction module, or to take an action, e.g., notify a client of anomalous network activity and/or shut down a client computing device that is potentially infected with malware. - Various devices, such as the egress point(s) 220, domain
name analysis device 240, and user device 250, are depicted separately in theexample data flow 200. In some implementations, the operations described above and additional operations may be performed, in whole or in part, by additional devices. For example, one computing device may generate syntax strings, while a second computing device may perform the clustering. As another example, a temporary storage device may be used to storeDNS query packets 212 until they may be retrieved by the domainname analysis device 240. In addition, some functionality of various devices may be combined, e.g., the domainname analysis device 240 may also take action in response to identifying potential malware by directly communicating with a client computing device to shut it down. As another example, the domainname analysis device 240 may be integrated in an egress point of thenetwork 215. Other configurations of both thenetwork 215 and the devices within or outside of thenetwork 215 may also be used. -
FIG. 3 is a flowchart of anexample method 300 determining string similarity using syntactic edit distance. Themethod 300 may be performed by a computing device, such as a computing device described inFIG. 1 . Other computing devices may also be used to executemethod 300.Method 300 may be implemented in the form of executable instructions stored on a machine-readable storage medium, such as thestorage medium 120, and/or in the form of electronic circuitry. - A first string of characters originating from a particular computing device is received (302). For example, the first string of characters may represent a domain name being queried and included in a DNS query packet received from a client computing device. An example first string may be “www.example123.com.”
- A second string of characters originating from the particular computing device is received (304). For example, the second string of characters may represent another domain name being queried in another DNS query packet from the client computing device. An example second string may be “www.uwqnci3p21.net.”
- A first syntax string is generated by replacing each character of the first string with one of a plurality of metacharacters (306). Each metacharacter represents a category of characters, and each category is different from each other category of characters represented by each other metacharacter. For example, the example first string may be represented by the syntax string “LLLPLLLLLLLDDDc,” where ‘L’ represents lower-case letters, ‘P’ represents a period, ‘D’ represents a numerical digit, and ‘c’ represents “.com.” In this example, a single metacharacter is used to represent the TLD, though in some implementations each character of the TLD may be separately represented.
- A second syntax string is generated by replacing each character of the second string with one of the plurality of metacharacters (308). Using the example metacharacters above, the second example string may be represented by the syntax string “LLLPLLLLLLDLDDn,” where the metacharacter ‘n’ represents “.net.”
- A measure of similarity is determined between the first string and the second string using a syntactic edit distance between the first string and the second string (310). The syntactic edit distance between the first and second string is determined based on the first and second syntax strings. For example, a measure of similarity may be the edit distance between syntax strings. In this situation, the measure of similarity between “www.example123.com” and “www.uwqnci3p21.net” is the edit distance between “LLLPLLLLLLLDDDc” and “LLLPLLLLLLDLDDn,” which is 3.
- In some implementations, the measure of similarity between the first and second string may be determined based on the edit distance between the first and second string. For example, using formula (1) described above with reference to
FIGS. 2A and 2B , the measure of similarity between “www.example123.com” and “www.uwqnci3p21.net” is 0.75(1−(3/12). This relative measure of similarity provides an indication of the syntactic similarity of the two strings. - In some implementations, the particular computing device from which the first and second string are received is identified as a potential source of malicious software based on the measure of similarity between the first string and the second string. For example, if the similarity measure is above a predetermined threshold, the computing device may be identified as potentially infected with malware. In some implementations, receipt of additional strings may facilitate determining similarity between strings and clustering similar strings.
- For example, additional strings of characters, such as additional domain names in DNS queries, may be received from the particular computing device. For each additional string, syntax strings may be generated, e.g., using the same metacharacters used to generate the first and second syntax strings. For each additional string, measures of similarity may be determined between the first string, second, string, and other additional strings. In this situation, the result is a set of strings that each has a measure of similarity to each other string.
- Each string may be clustered into one of multiple clusters based on the measures of similarity. Various clustering algorithms may be used to cluster the strings, and the resulting clusters will generally each include strings that are more syntactically similar to other strings in the cluster than they are to strings included in other clusters. The clusters may be used to identify the particular computing device as a potential source of malicious software. E.g., as described above, large clusters and a large number of relatively large clusters, may be indicative of the use of a DGA, which may in turn be indicative of malware.
- The foregoing disclosure describes a number of example implementations for determining string similarity using syntactic edit distance. As detailed above, examples provide a mechanism for detecting anomalies by analyzing DNS query packets and clustering syntactically similar domain names and potential applications of a system that is capable of determining string similarity using syntactic similarity.
Claims (15)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/679,757 US9479524B1 (en) | 2015-04-06 | 2015-04-06 | Determining string similarity using syntactic edit distance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/679,757 US9479524B1 (en) | 2015-04-06 | 2015-04-06 | Determining string similarity using syntactic edit distance |
Publications (2)
Publication Number | Publication Date |
---|---|
US20160294852A1 true US20160294852A1 (en) | 2016-10-06 |
US9479524B1 US9479524B1 (en) | 2016-10-25 |
Family
ID=57015635
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/679,757 Active US9479524B1 (en) | 2015-04-06 | 2015-04-06 | Determining string similarity using syntactic edit distance |
Country Status (1)
Country | Link |
---|---|
US (1) | US9479524B1 (en) |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160065611A1 (en) * | 2011-07-06 | 2016-03-03 | Nominum, Inc. | Analyzing dns requests for anomaly detection |
US20170295196A1 (en) * | 2015-04-10 | 2017-10-12 | Hewlett Packard Enterprise Development Lp | Network anomaly detection |
CN107360185A (en) * | 2017-08-18 | 2017-11-17 | 中国移动通信集团海南有限公司 | A kind of assessing network method and system based on DNS behavioural characteristics |
US9882933B1 (en) * | 2017-05-17 | 2018-01-30 | Farsight Security, Inc. | Tokenization of domain names for domain name impersonation detection using matching |
CN107679029A (en) * | 2017-08-28 | 2018-02-09 | 昆明理工大学 | A kind of high accuracy English-language domain name similarity detection method |
US20180205753A1 (en) * | 2017-01-19 | 2018-07-19 | Hewlett Packard Enterprise Development Lp | Time-based detection of malware communications |
US10089448B1 (en) * | 2018-02-06 | 2018-10-02 | Didi Research America, Llc | System and method for program security protection |
CN108763271A (en) * | 2018-04-08 | 2018-11-06 | 浙江工业大学 | A kind of hospital department similarity analysis method of two subnetwork of combination and text |
WO2018213574A1 (en) * | 2017-05-17 | 2018-11-22 | Farsight Security, Inc. | System, method and domain name tokenization for domain name impersonation detection |
CN108924100A (en) * | 2018-06-20 | 2018-11-30 | 广东电网有限责任公司 | A kind of abnormal user recognition methods |
US10200405B2 (en) * | 2017-05-17 | 2019-02-05 | Farsight Security, Inc. | Tokenization of domain names for domain name impersonation detection using matching |
US20190102536A1 (en) * | 2017-09-29 | 2019-04-04 | ZenDesk, Inc. | Rate-limiting api calls for an account in a customer-relationship-management system based on predicted abusive behavior |
US10346611B1 (en) * | 2015-11-25 | 2019-07-09 | Symantec Corporation | Detecting malicious software |
US10360379B2 (en) * | 2015-12-21 | 2019-07-23 | F-Secure Corporation | Method and apparatus for detecting exploits |
US20190238562A1 (en) * | 2018-01-31 | 2019-08-01 | Entit Software Llc | Malware-infected device identifications |
US20200106791A1 (en) * | 2018-09-28 | 2020-04-02 | Fireeye, Inc. | Intelligent system for mitigating cybersecurity risk by analyzing domain name system traffic metrics |
US10701031B2 (en) * | 2015-05-27 | 2020-06-30 | Trend Micro Incorporated | Identifying algorithmically generated domains |
CN111507400A (en) * | 2020-04-16 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Application classification method and device, electronic equipment and storage medium |
US10742591B2 (en) | 2011-07-06 | 2020-08-11 | Akamai Technologies Inc. | System for domain reputation scoring |
US10878088B2 (en) * | 2015-08-18 | 2020-12-29 | Trend Micro Incorporated | Identifying randomly generated character strings |
CN112564928A (en) * | 2019-09-10 | 2021-03-26 | 华为技术有限公司 | Service classification method and equipment and Internet system |
CN112559559A (en) * | 2020-12-24 | 2021-03-26 | 中国建设银行股份有限公司 | List similarity calculation method and device, computer equipment and storage medium |
US10965697B2 (en) | 2018-01-31 | 2021-03-30 | Micro Focus Llc | Indicating malware generated domain names using digits |
CN113268972A (en) * | 2021-05-14 | 2021-08-17 | 东莞理工学院城市学院 | Intelligent calculation method, system, equipment and medium for appearance similarity of two English words |
US11108794B2 (en) | 2018-01-31 | 2021-08-31 | Micro Focus Llc | Indicating malware generated domain names using n-grams |
WO2021232243A1 (en) * | 2020-05-19 | 2021-11-25 | 深圳市欢太科技有限公司 | Cluster management method, cluster management apparatus, storage medium and electronic device |
CN113746952A (en) * | 2021-09-14 | 2021-12-03 | 京东科技信息技术有限公司 | DGA domain name detection method, device, electronic equipment and computer storage medium |
US11201848B2 (en) | 2011-07-06 | 2021-12-14 | Akamai Technologies, Inc. | DNS-based ranking of domain names |
US11210327B2 (en) * | 2017-07-28 | 2021-12-28 | Microsoft Technology Licensing, Llc | Syntactic profiling of alphanumeric strings |
US11245720B2 (en) | 2019-06-06 | 2022-02-08 | Micro Focus Llc | Determining whether domain is benign or malicious |
CN114416972A (en) * | 2021-12-10 | 2022-04-29 | 厦门市世纪网通网络服务有限公司 | DGA domain name detection method based on density improvement unbalance sample |
US11374968B1 (en) * | 2020-06-23 | 2022-06-28 | Amazon Technologies, Inc. | Detection of adversarial networks |
JP7494065B2 (en) | 2020-09-14 | 2024-06-03 | Kddi株式会社 | DETECTION DEVICE, DETECTION METHOD, AND DETECTION PROGRAM |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10362044B2 (en) * | 2017-08-08 | 2019-07-23 | International Business Machines Corporation | Identifying command and control endpoint used by domain generation algorithm (DGA) malware |
US10880319B2 (en) | 2018-04-26 | 2020-12-29 | Micro Focus Llc | Determining potentially malware generated domain names |
US10785188B2 (en) | 2018-05-22 | 2020-09-22 | Proofpoint, Inc. | Domain name processing systems and methods |
US11271963B2 (en) | 2018-12-20 | 2022-03-08 | Micro Focus Llc | Defending against domain name system based attacks |
US11562090B2 (en) | 2019-05-28 | 2023-01-24 | International Business Machines Corporation | Enforcing sensitive data protection in security systems |
US11973799B2 (en) | 2020-09-04 | 2024-04-30 | Proofpoint, Inc. | Domain name processing systems and methods |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090037399A1 (en) | 2007-07-31 | 2009-02-05 | Yahoo! Inc. | System and Method for Determining Semantically Related Terms |
US8707426B1 (en) | 2008-05-28 | 2014-04-22 | Symantec Corporation | Method and apparatus for resolving a cousin domain name to detect web-based fraud |
US8516585B2 (en) * | 2010-10-01 | 2013-08-20 | Alcatel Lucent | System and method for detection of domain-flux botnets and the like |
WO2013009713A2 (en) | 2011-07-08 | 2013-01-17 | Uab Research Foundation | Syntactical fingerprinting |
US9922190B2 (en) * | 2012-01-25 | 2018-03-20 | Damballa, Inc. | Method and system for detecting DGA-based malware |
US9166994B2 (en) * | 2012-08-31 | 2015-10-20 | Damballa, Inc. | Automation discovery to identify malicious activity |
US9245121B1 (en) * | 2013-08-09 | 2016-01-26 | Narus, Inc. | Detecting suspicious network behaviors based on domain name service failures |
WO2015138519A1 (en) * | 2014-03-11 | 2015-09-17 | Vectra Networks, Inc. | Method and system for detecting algorithm-generated domains |
US9654484B2 (en) * | 2014-07-31 | 2017-05-16 | Cisco Technology, Inc. | Detecting DGA-based malicious software using network flow information |
US10198579B2 (en) * | 2014-08-22 | 2019-02-05 | Mcafee, Llc | System and method to detect domain generation algorithm malware and systems infected by such malware |
US9560074B2 (en) * | 2014-10-07 | 2017-01-31 | Cloudmark, Inc. | Systems and methods of identifying suspicious hostnames |
-
2015
- 2015-04-06 US US14/679,757 patent/US9479524B1/en active Active
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160065611A1 (en) * | 2011-07-06 | 2016-03-03 | Nominum, Inc. | Analyzing dns requests for anomaly detection |
US9843601B2 (en) * | 2011-07-06 | 2017-12-12 | Nominum, Inc. | Analyzing DNS requests for anomaly detection |
US10742591B2 (en) | 2011-07-06 | 2020-08-11 | Akamai Technologies Inc. | System for domain reputation scoring |
US11201848B2 (en) | 2011-07-06 | 2021-12-14 | Akamai Technologies, Inc. | DNS-based ranking of domain names |
US20170295196A1 (en) * | 2015-04-10 | 2017-10-12 | Hewlett Packard Enterprise Development Lp | Network anomaly detection |
US10686814B2 (en) * | 2015-04-10 | 2020-06-16 | Hewlett Packard Enterprise Development Lp | Network anomaly detection |
US10701031B2 (en) * | 2015-05-27 | 2020-06-30 | Trend Micro Incorporated | Identifying algorithmically generated domains |
US10878088B2 (en) * | 2015-08-18 | 2020-12-29 | Trend Micro Incorporated | Identifying randomly generated character strings |
US10346611B1 (en) * | 2015-11-25 | 2019-07-09 | Symantec Corporation | Detecting malicious software |
US10360379B2 (en) * | 2015-12-21 | 2019-07-23 | F-Secure Corporation | Method and apparatus for detecting exploits |
US10681069B2 (en) * | 2017-01-19 | 2020-06-09 | Micro Focus Llc | Time-based detection of malware communications |
US20180205753A1 (en) * | 2017-01-19 | 2018-07-19 | Hewlett Packard Enterprise Development Lp | Time-based detection of malware communications |
US9882933B1 (en) * | 2017-05-17 | 2018-01-30 | Farsight Security, Inc. | Tokenization of domain names for domain name impersonation detection using matching |
US10200405B2 (en) * | 2017-05-17 | 2019-02-05 | Farsight Security, Inc. | Tokenization of domain names for domain name impersonation detection using matching |
WO2018213574A1 (en) * | 2017-05-17 | 2018-11-22 | Farsight Security, Inc. | System, method and domain name tokenization for domain name impersonation detection |
US11210327B2 (en) * | 2017-07-28 | 2021-12-28 | Microsoft Technology Licensing, Llc | Syntactic profiling of alphanumeric strings |
CN107360185A (en) * | 2017-08-18 | 2017-11-17 | 中国移动通信集团海南有限公司 | A kind of assessing network method and system based on DNS behavioural characteristics |
CN107679029A (en) * | 2017-08-28 | 2018-02-09 | 昆明理工大学 | A kind of high accuracy English-language domain name similarity detection method |
US20190102536A1 (en) * | 2017-09-29 | 2019-04-04 | ZenDesk, Inc. | Rate-limiting api calls for an account in a customer-relationship-management system based on predicted abusive behavior |
US10795987B2 (en) * | 2017-09-29 | 2020-10-06 | ZenDesk, Inc. | Rate-limiting API calls for an account in a customer-relationship-management system based on predicted abusive behavior |
US10965697B2 (en) | 2018-01-31 | 2021-03-30 | Micro Focus Llc | Indicating malware generated domain names using digits |
US20190238562A1 (en) * | 2018-01-31 | 2019-08-01 | Entit Software Llc | Malware-infected device identifications |
US11108794B2 (en) | 2018-01-31 | 2021-08-31 | Micro Focus Llc | Indicating malware generated domain names using n-grams |
US10911481B2 (en) * | 2018-01-31 | 2021-02-02 | Micro Focus Llc | Malware-infected device identifications |
WO2019156718A1 (en) * | 2018-02-06 | 2019-08-15 | Didi Research America, Llc | System and method for program security protection |
US10853457B2 (en) | 2018-02-06 | 2020-12-01 | Didi Research America, Llc | System and method for program security protection |
US10089448B1 (en) * | 2018-02-06 | 2018-10-02 | Didi Research America, Llc | System and method for program security protection |
CN108763271A (en) * | 2018-04-08 | 2018-11-06 | 浙江工业大学 | A kind of hospital department similarity analysis method of two subnetwork of combination and text |
CN108924100A (en) * | 2018-06-20 | 2018-11-30 | 广东电网有限责任公司 | A kind of abnormal user recognition methods |
US20200106791A1 (en) * | 2018-09-28 | 2020-04-02 | Fireeye, Inc. | Intelligent system for mitigating cybersecurity risk by analyzing domain name system traffic metrics |
US11245720B2 (en) | 2019-06-06 | 2022-02-08 | Micro Focus Llc | Determining whether domain is benign or malicious |
CN112564928A (en) * | 2019-09-10 | 2021-03-26 | 华为技术有限公司 | Service classification method and equipment and Internet system |
CN111507400A (en) * | 2020-04-16 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Application classification method and device, electronic equipment and storage medium |
WO2021232243A1 (en) * | 2020-05-19 | 2021-11-25 | 深圳市欢太科技有限公司 | Cluster management method, cluster management apparatus, storage medium and electronic device |
CN115517009A (en) * | 2020-05-19 | 2022-12-23 | 深圳市欢太科技有限公司 | Cluster management method, cluster management device, storage medium and electronic equipment |
US11374968B1 (en) * | 2020-06-23 | 2022-06-28 | Amazon Technologies, Inc. | Detection of adversarial networks |
JP7494065B2 (en) | 2020-09-14 | 2024-06-03 | Kddi株式会社 | DETECTION DEVICE, DETECTION METHOD, AND DETECTION PROGRAM |
CN112559559A (en) * | 2020-12-24 | 2021-03-26 | 中国建设银行股份有限公司 | List similarity calculation method and device, computer equipment and storage medium |
CN113268972A (en) * | 2021-05-14 | 2021-08-17 | 东莞理工学院城市学院 | Intelligent calculation method, system, equipment and medium for appearance similarity of two English words |
CN113746952A (en) * | 2021-09-14 | 2021-12-03 | 京东科技信息技术有限公司 | DGA domain name detection method, device, electronic equipment and computer storage medium |
CN114416972A (en) * | 2021-12-10 | 2022-04-29 | 厦门市世纪网通网络服务有限公司 | DGA domain name detection method based on density improvement unbalance sample |
Also Published As
Publication number | Publication date |
---|---|
US9479524B1 (en) | 2016-10-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9479524B1 (en) | Determining string similarity using syntactic edit distance | |
Schüppen et al. | {FANCI}: Feature-based automated {NXDomain} classification and intelligence | |
US9560063B2 (en) | Apparatus and method for detecting malicious domain cluster | |
US8260914B1 (en) | Detecting DNS fast-flux anomalies | |
Thomas et al. | Kindred domains: detecting and clustering botnet domains using DNS traffic | |
US10050986B2 (en) | Systems and methods for traffic classification | |
Schiavoni et al. | Phoenix: DGA-based botnet tracking and intelligence | |
US9781139B2 (en) | Identifying malware communications with DGA generated domains by discriminative learning | |
WO2022040698A1 (en) | Malicious traffic detection with anomaly detection modeling | |
US10673814B1 (en) | Domain name classification systems and methods | |
US9118704B2 (en) | Homoglyph monitoring | |
US10691795B2 (en) | Quantitative unified analytic neural networks | |
US10878088B2 (en) | Identifying randomly generated character strings | |
US10701031B2 (en) | Identifying algorithmically generated domains | |
US10262122B2 (en) | Analysis apparatus, analysis system, analysis method, and analysis program | |
CN110362996B (en) | Method and system for offline detection of PowerShell malicious software | |
US20180375884A1 (en) | Detecting user behavior activities of interest in a network | |
US11108794B2 (en) | Indicating malware generated domain names using n-grams | |
US10757029B2 (en) | Network traffic pattern based machine readable instruction identification | |
EP3242240A1 (en) | Malicious communication pattern extraction device, malicious communication pattern extraction system, malicious communication pattern extraction method and malicious communication pattern extraction program | |
US10965697B2 (en) | Indicating malware generated domain names using digits | |
Schiavoni et al. | Tracking and characterizing botnets using automatically generated domains | |
US20190238562A1 (en) | Malware-infected device identifications | |
US20240323223A1 (en) | Detecting visual similarity between dns fully qualified domain names | |
Lee et al. | DGA-based malware detection using DNS traffic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P., TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HAGEN, JOSIAH;REEL/FRAME:035342/0458 Effective date: 20150406 |
|
AS | Assignment |
Owner name: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P.;REEL/FRAME:037079/0001 Effective date: 20151027 |
|
AS | Assignment |
Owner name: TREND MICRO INCORPORATED, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP;REEL/FRAME:038303/0704 Effective date: 20160308 Owner name: TREND MICRO INCORPORATED, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:TREND MICRO INCORPORATED;REEL/FRAME:038303/0950 Effective date: 20160414 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |