CN107682348A - DGA domain name Quick method and devices based on machine learning - Google Patents
DGA domain name Quick method and devices based on machine learning Download PDFInfo
- Publication number
- CN107682348A CN107682348A CN201710976231.XA CN201710976231A CN107682348A CN 107682348 A CN107682348 A CN 107682348A CN 201710976231 A CN201710976231 A CN 201710976231A CN 107682348 A CN107682348 A CN 107682348A
- Authority
- CN
- China
- Prior art keywords
- domain name
- domain
- feature
- training set
- characteristic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L61/00—Network arrangements, protocols or services for addressing or naming
- H04L61/45—Network directories; Name-to-address mapping
- H04L61/4505—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
- H04L61/4511—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of DGA domain name Quick method and devices based on machine learning, it is related to technical field of network security.The DGA domain name Quick methods based on machine learning include:Structure includes the training set of multiple DGA domain names and normal domain name;Extract the domain name feature of each domain name in the training set;Domain name feature is normalized and obtains characteristic set;Vertical domain name sorter model is built jointly based on the characteristic data set.Method and device provided by the invention is extracted more rich, more representational domain name feature by the research to domain name;By the way that characteristic is normalized, training and test are can speed up, so as to improve computational efficiency;Finally characteristic set is trained using machine learning algorithm and obtains domain name sorter model, generalization ability is improved while judging nicety rate is improved.
Description
Technical field
It is quick in particular to a kind of DGA domain names based on machine learning the present invention relates to technical field of network security
Method of discrimination and device.
Background technology
DGA domain names refer to utilize a series of of domain name generating algorithm (Domain Generation Algorithm) generation
Random domain name.This method is common in Botnet (Botnet), such as conficker, zeus etc, and they can utilize a private
Some random string generating algorithms, according to date or other random seeds, some random string domain names are generated daily, so
Some of which domain name is registered afterwards, so as to be swindled, propagates the malfeasance such as Malware, distribution Pornograph.
As the technologies such as Domain-Flux, Fast-Flux are used by hacker more and more widely, entered using DGA domain names
Capable network attack is more hidden and is difficult to follow the trail of.As long as it is infected machine in Botnet also to attempt according to same algorithm
Generate these random domain names and then collide success, can be just controlled by hacker, and then initiate distributed denial of service, rubbish postal
The network attacks such as part.
Traditional method is mainly to be detected by the experience of white cap, and this method expends substantial amounts of manpower, very
The competent huge mission requirements of today of hardly possible.Another kind of method feature based construction, is triggered from similarity measurement, by calculating sample
To obtaining threshold value, so that it is determined that whether domain name to be detected is DGA domain names, it uses relatively simple method for measuring similarity, examines
Worry feature is more single, and Generalization Capability is poor, and accuracy rate is not also high.
The content of the invention
It is an object of the invention to provide a kind of DGA domain name Quick method and devices based on machine learning, its energy
Enough it is effectively improved above mentioned problem.
What embodiments of the invention were realized in:
In a first aspect, the embodiments of the invention provide a kind of DGA domain name Quick methods based on machine learning, it is described
Method includes:Structure includes the training set of multiple DGA domain names and normal domain name;Extract each domain name in the training set
Domain name feature;Domain name feature is normalized and obtains characteristic set;Vertical domain is built jointly based on the characteristic data set
Name sorter model.
Second aspect, the embodiment of the present invention additionally provide a kind of DGA domain name fast discriminating devices based on machine learning, its
Module is built including training set, the training set of multiple DGA domain names and normal domain name is included for building;Characteristic extracting module,
For extracting the domain name feature of each domain name in the training set;Module is normalized, for returning to domain name feature
One changes acquisition characteristic set;Model building module, for building vertical domain name sorter model jointly based on the characteristic data set.
DGA domain name Quick method and devices provided in an embodiment of the present invention based on machine learning, first structure bag
Training set containing multiple DGA domain names and normal domain name, enough samples are provided subsequently to establish domain name sorter model;Then
Extract the domain name feature of each domain name in the training set, using representative domain name feature as judge domain name whether be
The criterion of DGA domain names;Domain name feature is normalized again and obtains characteristic set, with unified each characteristic
Dimension, improve computational efficiency;It is finally based on the characteristic data set and builds vertical domain name sorter model jointly, you can is easy to utilize the machine
The domain name sorter model that device learning training obtains detects to various unknown domain names, and realization quickly and accurately judges to be measured
Whether domain name is DGA domain names.Relative to prior art, the DGA domain names provided in an embodiment of the present invention based on machine learning are quick
Method of discrimination and device are extracted more rich, more representational domain name feature by the research to domain name;By to characteristic
According to being normalized, training and test are can speed up, so as to improve computational efficiency;Finally using machine learning algorithm to characteristic
It is trained according to set and obtains domain name sorter model, generalization ability is improved while judging nicety rate is improved.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by embodiment it is required use it is attached
Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair
The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to this
A little accompanying drawings obtain other related accompanying drawings.
Fig. 1 is a kind of structured flowchart for the electronic equipment that can be applied in the embodiment of the present invention;
Fig. 2 is the flow chart element for the DGA domain name Quick methods based on machine learning that first embodiment of the invention provides
Figure;
Fig. 3 is the sub-step FB(flow block) of step S210 in first embodiment of the invention;
Fig. 4 is the sub-step FB(flow block) of step S300 in first embodiment of the invention;
Fig. 5 is the sub-step FB(flow block) of step S310 in first embodiment of the invention;
Fig. 6 is the sub-step FB(flow block) of step S320 in first embodiment of the invention;
Fig. 7 is the sub-step FB(flow block) of step S330 in first embodiment of the invention;
Fig. 8 is the sub-step FB(flow block) of step S230 in first embodiment of the invention;
Fig. 9 is step S500, the step S510 FB(flow block) that first embodiment of the invention provides;
Figure 10 is the structural frames for the DGA domain name fast discriminating devices based on machine learning that second embodiment of the invention provides
Figure.
Embodiment
Below in conjunction with accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Ground describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Generally exist
The component of the embodiment of the present invention described and illustrated in accompanying drawing can be configured to arrange and design with a variety of herein.Cause
This, the detailed description of the embodiments of the invention to providing in the accompanying drawings is not intended to limit claimed invention below
Scope, but it is merely representative of the selected embodiment of the present invention.Based on embodiments of the invention, those skilled in the art are not doing
The every other embodiment obtained on the premise of going out creative work, belongs to the scope of protection of the invention.
It should be noted that:Similar label and letter represents similar terms in following accompanying drawing, therefore, once a certain Xiang Yi
It is defined, then it further need not be defined and explained in subsequent accompanying drawing in individual accompanying drawing.Meanwhile the present invention's
In description, term " first ", " second " etc. are only used for distinguishing description, and it is not intended that instruction or hint relative importance.
Fig. 1 shows a kind of structured flowchart for the electronic equipment 100 that can be applied in the embodiment of the present application.As shown in figure 1,
Electronic equipment 100 can include memory 110, storage control 120, processor 130, display screen 140 and based on engineering
The DGA domain name fast discriminating devices of habit.For example, the electronic equipment 100 can be PC (personal computer,
PC), tablet personal computer, smart mobile phone, personal digital assistant (personal digital assistant, PDA) etc..
It is directly or indirectly electric between memory 110, storage control 120, processor 130,140 each element of display screen
Connection, to realize the transmission of data or interaction.For example, one or more communication bus or signal can be passed through between these elements
Bus realizes electrical connection.The DGA domain name Quicks method based on machine learning respectively include it is at least one can be with soft
The form of part or firmware (firmware) is stored in the software function module in memory 110, such as described is based on machine learning
DGA domain name the fast discriminating devices software function module or computer program that include.
Memory 110 can store various software programs and module, if the embodiment of the present application offer is based on engineering
Programmed instruction/module corresponding to the DGA domain name Quick method and devices of habit.Processor 130 is stored in storage by operation
Software program and module in device 110, so as to perform various function application and data processing, that is, realize the embodiment of the present application
In the DGA domain name Quick methods based on machine learning.Memory 110 can include but is not limited to random access memory
(Random Access Memory, RAM), read-only storage (Read Only Memory, ROM), programmable read only memory
(Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable
Read-Only Memory, EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable
Read-Only Memory, EEPROM) etc..
Processor 130 can be a kind of IC chip, have signal handling capacity.Above-mentioned processor can be general
Processor, including central processing unit (Central Processing Unit, abbreviation CPU), network processing unit (Network
Processor, abbreviation NP) etc.;It can also be digital signal processor (DSP), application specific integrated circuit (ASIC), ready-made programmable
Gate array (FPGA) either other PLDs, discrete gate or transistor logic, discrete hardware components.It can
To realize or perform disclosed each method, step and the logic diagram in the embodiment of the present application.General processor can be micro-
Processor or the processor can also be any conventional processors etc..
Electronic equipment 100 applied in the embodiment of the present invention is DGA domain name Quick of the realization based on machine learning,
Can also possess from display function, display screen 140 therein can provide one between the electronic equipment 100 and user
Interactive interface (such as user interface) refers to for display image data to user.For example, it can show based on machine
The data such as the domain name training set of the DGA domain names fast discriminating device foundation of study and the domain name feature of extraction.
Firstly the need of explanation before the specific embodiment of the present invention is introduced, the present invention is computer technology in information
A kind of application of security technology area.In the implementation process of the present invention, the application of multiple software function modules can be related to.Shen
Ask someone to think, it is existing combining such as after application documents, accurate understanding realization principle and goal of the invention of the invention is read over
In the case of having known technology, those skilled in the art can use the software programming technical ability of its grasp to realize the present invention completely,
The software function module that all the present patent application files refer to belongs to this category, and applicant will not enumerate.
First embodiment
Fig. 2 is refer to, a kind of DGA domain name Quick methods based on machine learning is present embodiments provided, is applied to
DGA domain name fast discriminating devices based on machine learning, methods described include:
Step S200:Structure includes the training set of multiple DGA domain names and normal domain name;
In the present embodiment, the DGA domain names can be described as positive example again, and it can include what is generated by common DGA algorithms
DGA domain names, and the malice domain name obtained by channel of increasing income.The normal domain name can be described as counter-example again, and it can include mesh
Preceding generally acknowledged inert normal domain name, for example, in Alexa websites ranking forefront multinomial domain name.
For example, domain name " www.google.com ", it is normal domain name.
Step S210:Extract the domain name feature of each domain name in the training set;
, can be first in training set before the domain name feature of each domain name in extracting the training set in the present embodiment
Each domain name pre-processed, extract principal character representative in each domain name, for example, the Main Domain of each domain name,
TLD suffix (Top-Level Domain) is the last part of domain name.
For example, domain name " www.google.com ", its Main Domain is google, and its TLD suffix is com.
It is understood that in the present embodiment, the domain name feature of extraction can be single, such as only be entered by Main Domain
The differentiation of row DGA domain names;The domain name feature of extraction can also be multiple, such as by extracting Main Domain, the TLD of each domain name
Suffix, more features are also expanded on the Main Domain and TLD suffix to refine judgment rule, improve DGA domain names and differentiate
The degree of accuracy.For example, can by the character transition probability in the length of Main Domain, the characteristic of speech sounds of Main Domain, Main Domain and
The TLD suffix of domain name is extracted collectively as domain name feature.
Step S220:Domain name feature is normalized and obtains characteristic set;
In the present embodiment, the domain name feature of previous step extraction is normalized, it is special that each domain name can be unified
The dimension of sign, computational efficiency is improved, is easy to follow-up machine learning training and the foundation of discrimination model.
Step S230:Vertical domain name sorter model is built jointly based on the characteristic data set.
In the present embodiment, the characteristic set can be trained using machine learning algorithm, to establish domain name
Sorter model.The domain name sorter model obtained by machine learning, can fast and accurately be identified according to domain name feature
DGA domain names, it can be used for being predicted unknown domain name.
It refer to Fig. 3, in the present embodiment, further, the step S210 can include following sub-step:
Step S300:Extract the length characteristic of each domain name in the training set;
In the present embodiment, the length characteristic of each domain name can be the length of Main Domain in each domain name.
For example, domain name " www.google.com ", its Main Domain google length is 6.
Step S310:Extract the n-gram features of each domain name in the training set;
In the present embodiment, the n-gram is called n gram language models, and n members represent n connected characters, its frequency occurred
The characteristic of language can be embodied.
For example, when n takes 1,2,3 respectively, n phase loigature of the Main Domain " google " of domain name " www.google.com "
Symbol string is as shown in table 1:
Table 1
Step S320:Extract the transition probability feature of each domain name in the training set;
Transition probability is the key concept in Markov chain, if markov chain is divided into m state composition, historical summary conversion
For the sequence being made up of this m state.From any one state, by arbitrarily once shifting, necessarily go out present condition 1,
2nd ..., one in m, the transfer between this state are referred to as transition probability.Each character in Main Domain can regard horse as
A state in Er Kefu chains (Markov Chain), each of which state value depend on above limited individual state, limited individual shape
State generally takes 1 state.
For example, the Main Domain " google " of domain name " www.google.com ", its transition probability are:
P=p (g) × p (g → o) × p (o → o) × p (o → g) × p (g → l) × p (l → e)
Step S330:Extract the TLD suffix features of each domain name in the training set.
Under normal circumstances, DGA domain names meeting alternative costs are low and audit not tight TLD suffix, pass through extraction in the present embodiment
The TLD suffix features of each domain name, can be as the foundation for differentiating DGA domain names.
It refer to Fig. 4, in the present embodiment, further, the step S300 can include following sub-step:
Step S301:Extract the Main Domain length of each domain name in the training set, and by the main domain of specific Main Domain
Length characteristic of the name length as the specific Main Domain.
Because brief domain name registration is more, therefore brief domain name resources are fewer and fewer, so the Main Domain length of DGA domain names
Degree has the trend for becoming big.It is used as length characteristic by extracting the Main Domain length of each domain name in the present embodiment, can be used for sentencing
Other DGA domain names.For example, when the Main Domain length of some domain name to be measured exceedes a certain threshold value, it is believed that it is DGA domains that it, which has maximum probability,
Name.
It refer to Fig. 5, in the present embodiment, further, the step S310 can include following sub-step:
Step S311:The frequency that n connected characters occur in all Main Domains in the training set is counted, and by described in
The frequency that n connected characters occur in all Main Domains ranking from high to low;
Specifically, during n=1, count the frequency of single character appearance in all Main Domains of training set and arrange from high to low
Name, P1(x1);
During n=2, the frequency that two connected characters occur and from high to low ranking are counted in all Main Domains of training set,
P2(x1x2);
By that analogy, the frequency that n connected characters occur in all Main Domains of training set and from high to low ranking are counted,
Pn(x1x2...xn);
Particularly, because n is bigger, intercharacter relevance gradually weakens, and the frequency reference value come out decreases,
Therefore n value suggestion is n≤3, and n is integer.
Step S312:Based on the frequency ranking that n connected characters occur in all Main Domains, specific Main Domain is calculated
The average and variance for the frequency ranking that middle n connected characters occur, and n connected characters in the specific Main Domain are occurred
N-gram feature of the average and variance of frequency ranking as the specific Main Domain.
Specifically, it is directed to specific Main Domain A=" a1a2…an-2an-1an", n connected characters frequency of occurrences can be calculated
Ranking average and variance:
During n=1, the ranking of single character occurrence frequency in all Main Domains of training set obtained according to abovementioned steps can
Calculate single character ranking average and variance in its specific Main Domain A:
During n=2, the row of two connected characters frequencies of occurrences in all Main Domains of training set obtained according to abovementioned steps
Name, can calculate two connected characters ranking averages and variance in its specific Main Domain A:
During n=3, the row of three connected characters frequencies of occurrences in all Main Domains of training set obtained according to abovementioned steps
Name, can calculate three connected characters ranking averages and variance in its specific Main Domain A:
In the present embodiment, by regarding the ranking average of n connected characters frequency of occurrences in Main Domain and variance as the master
The n-gram features of domain name, its ranking average is smaller, illustrates that the n-gram in its Main Domain occurs more frequent, then the domain name
Probability for DGA domain names is lower.
It refer to Fig. 6, in the present embodiment, further, the step S320 can include following sub-step:
Step S321:All Main Domains in the training set count to obtain Markov chain transfer matrix;
Specifically, the Markov chain transfer matrix that statistics obtains is:
α1 a2 …aj …
Wherein, ajFor the character that all Main Domains occur in training set;
Step S322:Based on the Markov chain transfer matrix, calculate the transition probability of specific Main Domain, and will described in
Transition probability feature of the transition probability of specific Main Domain as the specific Main Domain.
Specifically, it is directed to specific Main Domain A=" a1a2…an-2an-1an", its transition probability, which can be calculated, is:
P=p (a1)×p(a1→a2)×…×p(an-2→an-1)×p(an-1→an)
Wherein, p (*) can directly obtain from transfer matrix.
Find that normal domain name is common, readable, easy to remember, and transition probability value is bigger than normal, and DGA domain names are on the contrary, its turn by research
It is less than normal to move probable value.In the present embodiment, by the way that transition probability P to be used as to the feature of the Main Domain, it can be used for differentiating DGA domains
Name.
It refer to Fig. 7, in the present embodiment, further, the step S330 can include following sub-step:
Step S331:Extract all different TLD suffix of each domain name in the training set, construction TLD vectors;
In the present embodiment, OneHotEncoder coded systems can be used to TLD suffix.
Specifically, construction TLD vectors (TLD1 TLD2 … TLDN)。
Step S332:For each sample TLD in the TLD vectors, value is 1 in corresponding dimension, in its codimension
Value is 0 on degree, obtains TLD matrixes;
In the present embodiment, for each sample TLD extracted from training set, value in dimension is corresponded in sample TLD
It is 0 for 1, in remaining dimension.The TLD matrixes of acquisition are:
Step S333:Based on the TLD matrixes, the TLD suffix features of certain domain name are obtained.
In the present embodiment, if domain name to be measured using it is non-it is famous, non-mainstream, price is low, the TLD suffix domain names that easily pass through, can
Think that the probability that the domain name is DGA domain names is high.
In the present embodiment, after step S300, step S310, step S320 and step S330 is carried out, step S220
Feature normalization can be carried out to each domain name feature.
Specifically, length characteristic is normalized:
N-gram features are normalized:
During for n=1,
During for n=2,
During for n=3,
Transition probability feature is normalized:
When TLD suffix features are normalized, greatest member value is 1 in the TLD matrixes obtained due to step S332, most
Small element value is 0, therefore each element therein is normalized, and value does not change.So behaviour is normalized in TLD suffix
Make, it is and consistent before normalization.
Particularly, if to certain row TLD in TLD matrixes1Its maximum and minimum value is consistent, then this row can be cancelled, because
Linked character can not be provided for it.
It refer to Fig. 8, in the present embodiment, further, the step S230 can include following sub-step:
Step S400:Feature Dimension Reduction is carried out to the characteristic set, obtains the characteristic set after dimensionality reduction;
In the present embodiment, sample data is converted into the spy in higher dimensional space by above-mentioned steps S210, step S220 afterwards
Data acquisition system is levied, by carrying out Feature Dimension Reduction to it, the complexity of calculating can be substantially reduced, reduce redundancy and made
Into identification error, improve the precision of identification.
Particularly, patent of the present invention carries out Feature Dimension Reduction, this method using PCA dimension reduction methods to the characteristic set
The dimension of feature can be greatly reduced while most information is retained.
Step S410:Characteristic set after the dimensionality reduction is trained using GBDT classifier algorithms, establishes domain
Name sorter model.
In the present embodiment, the characteristic set after the dimensionality reduction that step S400 is obtained uses GBDT (Gradient
Boost Decision Tree) classifier algorithm is trained, and after training terminates, establishes domain name sorter model.It is described
The feature of domain name to be measured can be identified for domain name sorter model, can be with so as to realize fast and effectively differentiation DGA domain names
Unknown domain name is predicted.
It refer to Fig. 9, in the present embodiment, further, after the step S230, can also comprise the following steps:
Step S500:Treat detection domain name and carry out feature extraction, feature normalization and Feature Dimension Reduction successively, after obtaining dimensionality reduction
Characteristic to be detected;
In the present embodiment, the side similar with step S210, step S220, step S400 can be used to domain name to be detected
Method carries out feature extraction, feature normalization and Feature Dimension Reduction successively.
Step S510:The characteristic to be detected is loaded into domain name sorter model, judges the domain to be detected
Whether name is DGA domain names.
The DGA domain name Quick methods based on machine learning that the present embodiment provides, first by the research to domain name,
More rich more representational feature is extracted, then characteristic dimension is reduced using Principal Component Analysis Algorithm (PCA), can speed up
Training and test, so as to improve computational efficiency, finally carried using machine learning algorithm while domain name differentiation accuracy rate is improved
Generalization ability is risen.
Second embodiment
Figure 10 is refer to, present embodiments provides a kind of DGA domain names fast discriminating device 600 based on machine learning, its
Including:
Training set builds module 610, and the training set of multiple DGA domain names and normal domain name is included for building;
Characteristic extracting module 620, for extracting the domain name feature of each domain name in the training set;
Module 630 is normalized, characteristic set is obtained for domain name feature to be normalized;
Model building module 640, for building vertical domain name sorter model jointly based on the characteristic data set.
In summary, the DGA domain name Quick method and devices provided in an embodiment of the present invention based on machine learning, it is first
First structure includes the training set of multiple DGA domain names and normal domain name, is provided enough subsequently to establish domain name sorter model
Sample;Then the domain name feature of each domain name in the training set is extracted, using representative domain name feature as judgement
Domain name whether be DGA domain names criterion;Domain name feature is normalized again and obtains characteristic set, with unified
Each characteristic dimension, improve computational efficiency;It is finally based on the characteristic data set and builds vertical domain name sorter model jointly, you can just
In training the domain name sorter model obtained to detect various unknown domain names using the machine learning, realize quick and accurate
Judge whether domain name to be measured is DGA domain names.It is provided in an embodiment of the present invention based on machine learning relative to prior art
DGA domain name Quick method and devices are extracted more rich, more representational domain name feature by the research to domain name;
By the way that characteristic is normalized, training and test are can speed up, so as to improve computational efficiency;Finally utilize machine learning
Algorithm is trained to characteristic set and obtains domain name sorter model, is improved while judging nicety rate is improved extensive
Ability.The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for those skilled in the art
For member, the present invention can have various modifications and variations.Any modification within the spirit and principles of the invention, being made,
Equivalent substitution, improvement etc., should be included in the scope of the protection.
Claims (10)
- A kind of 1. DGA domain name Quick methods based on machine learning, it is characterised in that methods described includes:Structure includes the training set of multiple DGA domain names and normal domain name;Extract the domain name feature of each domain name in the training set;Domain name feature is normalized and obtains characteristic set;Vertical domain name sorter model is built jointly based on the characteristic data set.
- 2. according to the method for claim 1, it is characterised in that the domain name for extracting each domain name in the training set is special Sign, including:Extract the length characteristic of each domain name in the training set;Extract the n-gram features of each domain name in the training set;Extract the transition probability feature of each domain name in the training set;Extract the TLD suffix features of each domain name in the training set.
- 3. according to the method for claim 2, it is characterised in that the length for extracting each domain name in the training set is special Sign, including:The Main Domain length of each domain name in the training set is extracted, and using the Main Domain length of specific Main Domain as described in The length characteristic of specific Main Domain.
- 4. according to the method for claim 2, it is characterised in that the n-gram for extracting each domain name in the training set is special Sign, including:Count in all Main Domains in the training set frequency that n connected characters occur, and by n in all Main Domains The frequency that individual connected characters occur ranking from high to low;Based on the frequency ranking that n connected characters occur in all Main Domains, n connected characters in specific Main Domain are calculated The average and variance of the frequency ranking of appearance, and the frequency ranking that n connected characters in the specific Main Domain are occurred is equal Value and n-gram feature of the variance as the specific Main Domain.
- 5. according to the method for claim 2, it is characterised in that extract the transition probability of each domain name in the training set Feature, including:All Main Domains in the training set count to obtain Markov chain transfer matrix;Based on the Markov chain transfer matrix, the transition probability of specific Main Domain is calculated, and by the specific Main Domain Transition probability feature of the transition probability as the specific Main Domain.
- 6. according to the method for claim 2, it is characterised in that extract the TLD suffix of each domain name in the training set Feature, including:Extract all different TLD suffix of each domain name in the training set, construction TLD vectors;For each sample TLD in the TLD vectors, value is 1 in corresponding dimension, and value is 0 in remaining dimension, is obtained Obtain TLD matrixes;Based on the TLD matrixes, the TLD suffix features of certain domain name are obtained.
- 7. according to the method for claim 1, it is characterised in that vertical domain name grader mould is built jointly based on the characteristic data set Type, including:Feature Dimension Reduction is carried out to the characteristic set, obtains the characteristic set after dimensionality reduction;Characteristic set after the dimensionality reduction is trained using GBDT classifier algorithms, establishes domain name sorter model.
- 8. according to the method for claim 7, it is characterised in that Feature Dimension Reduction is carried out to the characteristic set, obtained Characteristic set after dimensionality reduction, including:Feature Dimension Reduction is carried out to the characteristic set using PCA dimension reduction methods, obtains the characteristic set after dimensionality reduction.
- 9. according to the method for claim 1, it is characterised in that methods described also includes:Treat detection domain name and carry out feature extraction, feature normalization and Feature Dimension Reduction successively, obtain the feature to be detected after dimensionality reduction Data;The characteristic to be detected is loaded into domain name sorter model, judges whether the domain name to be detected is DGA domains Name.
- A kind of 10. DGA domain name fast discriminating devices based on machine learning, it is characterised in that including:Training set builds module, and the training set of multiple DGA domain names and normal domain name is included for building;Characteristic extracting module, for extracting the domain name feature of each domain name in the training set;Module is normalized, characteristic set is obtained for domain name feature to be normalized;Model building module, for building vertical domain name sorter model jointly based on the characteristic data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710976231.XA CN107682348A (en) | 2017-10-19 | 2017-10-19 | DGA domain name Quick method and devices based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710976231.XA CN107682348A (en) | 2017-10-19 | 2017-10-19 | DGA domain name Quick method and devices based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107682348A true CN107682348A (en) | 2018-02-09 |
Family
ID=61141747
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710976231.XA Pending CN107682348A (en) | 2017-10-19 | 2017-10-19 | DGA domain name Quick method and devices based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107682348A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109450842A (en) * | 2018-09-06 | 2019-03-08 | 南京聚铭网络科技有限公司 | A kind of network malicious act recognition methods neural network based |
CN109714356A (en) * | 2019-01-08 | 2019-05-03 | 北京奇艺世纪科技有限公司 | A kind of recognition methods of abnormal domain name, device and electronic equipment |
CN110012122A (en) * | 2019-03-21 | 2019-07-12 | 东南大学 | A kind of domain name similarity analysis method of word-based embedded technology |
CN110266647A (en) * | 2019-05-22 | 2019-09-20 | 北京金睛云华科技有限公司 | It is a kind of to order and control communication check method and system |
CN110324273A (en) * | 2018-03-28 | 2019-10-11 | 蓝盾信息安全技术有限公司 | A kind of Botnet detection method combined based on DNS request behavior with domain name constitutive characteristic |
WO2019223587A1 (en) * | 2018-05-21 | 2019-11-28 | 新华三信息安全技术有限公司 | Domain name identification |
CN110798481A (en) * | 2019-11-08 | 2020-02-14 | 杭州安恒信息技术股份有限公司 | Malicious domain name detection method and device based on deep learning |
CN110830607A (en) * | 2019-11-08 | 2020-02-21 | 杭州安恒信息技术股份有限公司 | Domain name analysis method and device and electronic equipment |
CN111147459A (en) * | 2019-12-12 | 2020-05-12 | 北京网思科平科技有限公司 | C & C domain name detection method and device based on DNS request data |
CN111200576A (en) * | 2018-11-16 | 2020-05-26 | 慧盾信息安全科技(苏州)股份有限公司 | Method for realizing malicious domain name recognition based on machine learning |
CN112771523A (en) * | 2018-08-14 | 2021-05-07 | 北京嘀嘀无限科技发展有限公司 | System and method for detecting a generated domain |
CN112839012A (en) * | 2019-11-22 | 2021-05-25 | 中国移动通信有限公司研究院 | Zombie program domain name identification method, device, equipment and storage medium |
CN113542202A (en) * | 2020-04-21 | 2021-10-22 | 深信服科技股份有限公司 | Domain name identification method, device, equipment and computer readable storage medium |
CN113645173A (en) * | 2020-04-27 | 2021-11-12 | 北京观成科技有限公司 | Malicious domain name identification method, system and equipment |
CN113691489A (en) * | 2020-05-19 | 2021-11-23 | 北京观成科技有限公司 | Malicious domain name detection feature processing method and device and electronic equipment |
CN115065567A (en) * | 2022-08-19 | 2022-09-16 | 北京金睛云华科技有限公司 | Plug-in execution method for DGA domain name studying and judging inference machine |
US20220417261A1 (en) * | 2021-06-23 | 2022-12-29 | Comcast Cable Communications, Llc | Methods, systems, and apparatuses for query analysis and classification |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104486461A (en) * | 2014-12-29 | 2015-04-01 | 北京奇虎科技有限公司 | Domain name classification method and device and domain name recognition method and system |
CN105072214A (en) * | 2015-08-28 | 2015-11-18 | 携程计算机技术(上海)有限公司 | C&C domain name identification method based on domain name feature |
CN105610830A (en) * | 2015-12-30 | 2016-05-25 | 山石网科通信技术有限公司 | Method and device for detecting domain name |
CN105871619A (en) * | 2016-04-18 | 2016-08-17 | 中国科学院信息工程研究所 | Method for n-gram-based multi-feature flow load type detection |
CN105897714A (en) * | 2016-04-11 | 2016-08-24 | 天津大学 | Botnet detection method based on DNS (Domain Name System) flow characteristics |
CN106713312A (en) * | 2016-12-21 | 2017-05-24 | 深圳市深信服电子科技有限公司 | Method and device for detecting illegal domain name |
-
2017
- 2017-10-19 CN CN201710976231.XA patent/CN107682348A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104486461A (en) * | 2014-12-29 | 2015-04-01 | 北京奇虎科技有限公司 | Domain name classification method and device and domain name recognition method and system |
CN105072214A (en) * | 2015-08-28 | 2015-11-18 | 携程计算机技术(上海)有限公司 | C&C domain name identification method based on domain name feature |
CN105610830A (en) * | 2015-12-30 | 2016-05-25 | 山石网科通信技术有限公司 | Method and device for detecting domain name |
CN105897714A (en) * | 2016-04-11 | 2016-08-24 | 天津大学 | Botnet detection method based on DNS (Domain Name System) flow characteristics |
CN105871619A (en) * | 2016-04-18 | 2016-08-17 | 中国科学院信息工程研究所 | Method for n-gram-based multi-feature flow load type detection |
CN106713312A (en) * | 2016-12-21 | 2017-05-24 | 深圳市深信服电子科技有限公司 | Method and device for detecting illegal domain name |
Non-Patent Citations (4)
Title |
---|
佚名: "使用深度学习检测DGA(域名生成算法)", 《HTTPS://WWW.FREEBUF.COM/ARTICLES/NETWORK/139697.HTML》 * |
周敏: "《制造业信息化工程学》", 31 January 2017 * |
赵越: "基于DNS流量特征的僵尸网络检测方法研究", 《万方数据库》 * |
陈敏: "《认知计算导论》", 31 May 2017 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110324273A (en) * | 2018-03-28 | 2019-10-11 | 蓝盾信息安全技术有限公司 | A kind of Botnet detection method combined based on DNS request behavior with domain name constitutive characteristic |
WO2019223587A1 (en) * | 2018-05-21 | 2019-11-28 | 新华三信息安全技术有限公司 | Domain name identification |
CN112771523A (en) * | 2018-08-14 | 2021-05-07 | 北京嘀嘀无限科技发展有限公司 | System and method for detecting a generated domain |
CN109450842A (en) * | 2018-09-06 | 2019-03-08 | 南京聚铭网络科技有限公司 | A kind of network malicious act recognition methods neural network based |
CN109450842B (en) * | 2018-09-06 | 2023-06-13 | 南京聚铭网络科技有限公司 | Network malicious behavior recognition method based on neural network |
CN111200576A (en) * | 2018-11-16 | 2020-05-26 | 慧盾信息安全科技(苏州)股份有限公司 | Method for realizing malicious domain name recognition based on machine learning |
CN109714356A (en) * | 2019-01-08 | 2019-05-03 | 北京奇艺世纪科技有限公司 | A kind of recognition methods of abnormal domain name, device and electronic equipment |
CN110012122A (en) * | 2019-03-21 | 2019-07-12 | 东南大学 | A kind of domain name similarity analysis method of word-based embedded technology |
CN110266647A (en) * | 2019-05-22 | 2019-09-20 | 北京金睛云华科技有限公司 | It is a kind of to order and control communication check method and system |
CN110798481A (en) * | 2019-11-08 | 2020-02-14 | 杭州安恒信息技术股份有限公司 | Malicious domain name detection method and device based on deep learning |
CN110830607A (en) * | 2019-11-08 | 2020-02-21 | 杭州安恒信息技术股份有限公司 | Domain name analysis method and device and electronic equipment |
CN110830607B (en) * | 2019-11-08 | 2022-07-08 | 杭州安恒信息技术股份有限公司 | Domain name analysis method and device and electronic equipment |
CN112839012A (en) * | 2019-11-22 | 2021-05-25 | 中国移动通信有限公司研究院 | Zombie program domain name identification method, device, equipment and storage medium |
CN112839012B (en) * | 2019-11-22 | 2023-05-09 | 中国移动通信有限公司研究院 | Bot domain name identification method, device, equipment and storage medium |
CN111147459A (en) * | 2019-12-12 | 2020-05-12 | 北京网思科平科技有限公司 | C & C domain name detection method and device based on DNS request data |
CN111147459B (en) * | 2019-12-12 | 2021-11-30 | 北京网思科平科技有限公司 | C & C domain name detection method and device based on DNS request data |
CN113542202B (en) * | 2020-04-21 | 2022-09-30 | 深信服科技股份有限公司 | Domain name identification method, device, equipment and computer readable storage medium |
CN113542202A (en) * | 2020-04-21 | 2021-10-22 | 深信服科技股份有限公司 | Domain name identification method, device, equipment and computer readable storage medium |
CN113645173A (en) * | 2020-04-27 | 2021-11-12 | 北京观成科技有限公司 | Malicious domain name identification method, system and equipment |
CN113691489A (en) * | 2020-05-19 | 2021-11-23 | 北京观成科技有限公司 | Malicious domain name detection feature processing method and device and electronic equipment |
US20220417261A1 (en) * | 2021-06-23 | 2022-12-29 | Comcast Cable Communications, Llc | Methods, systems, and apparatuses for query analysis and classification |
CN115065567A (en) * | 2022-08-19 | 2022-09-16 | 北京金睛云华科技有限公司 | Plug-in execution method for DGA domain name studying and judging inference machine |
CN115065567B (en) * | 2022-08-19 | 2022-11-11 | 北京金睛云华科技有限公司 | Plug-in execution method for DGA domain name study and judgment inference machine |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107682348A (en) | DGA domain name Quick method and devices based on machine learning | |
CN111897970B (en) | Text comparison method, device, equipment and storage medium based on knowledge graph | |
CN111371806B (en) | Web attack detection method and device | |
CN109005145B (en) | Malicious URL detection system and method based on automatic feature extraction | |
WO2022104540A1 (en) | Cross-modal hash retrieval method, terminal device, and storage medium | |
CN113596007B (en) | Vulnerability attack detection method and device based on deep learning | |
CN110309304A (en) | A kind of file classification method, device, equipment and storage medium | |
CN106709345A (en) | Deep learning method-based method and system for deducing malicious code rules and equipment | |
CN109471938A (en) | A kind of file classification method and terminal | |
CN110602113B (en) | Hierarchical phishing website detection method based on deep learning | |
CN107341143A (en) | A kind of sentence continuity determination methods and device and electronic equipment | |
CN113691542A (en) | Web attack detection method based on HTTP request text and related equipment | |
CN112183672A (en) | Image classification method, and training method and device of feature extraction network | |
CN116150747A (en) | Intrusion detection method and device based on CNN and SLTM | |
CN110362995A (en) | It is a kind of based on inversely with the malware detection of machine learning and analysis system | |
CN113591077B (en) | Network attack behavior prediction method and device, electronic equipment and storage medium | |
CN112328657A (en) | Feature derivation method, feature derivation device, computer equipment and medium | |
CN111460783A (en) | Data processing method and device, computer equipment and storage medium | |
CN110958244A (en) | Method and device for detecting counterfeit domain name based on deep learning | |
CN112364198A (en) | Cross-modal Hash retrieval method, terminal device and storage medium | |
CN111625636A (en) | Man-machine conversation refusal identification method, device, equipment and medium | |
CN114826681A (en) | DGA domain name detection method, system, medium, equipment and terminal | |
CN112417886B (en) | Method, device, computer equipment and storage medium for extracting intention entity information | |
CN117235137B (en) | Professional information query method and device based on vector database | |
CN113657443A (en) | Online Internet of things equipment identification method based on SOINN network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180209 |