CN111209497B - DGA domain name detection method based on GAN and Char-CNN - Google Patents
DGA domain name detection method based on GAN and Char-CNN Download PDFInfo
- Publication number
- CN111209497B CN111209497B CN202010007697.0A CN202010007697A CN111209497B CN 111209497 B CN111209497 B CN 111209497B CN 202010007697 A CN202010007697 A CN 202010007697A CN 111209497 B CN111209497 B CN 111209497B
- Authority
- CN
- China
- Prior art keywords
- domain name
- layer
- char
- cnn
- equal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 67
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 106
- 238000012549 training Methods 0.000 claims abstract description 64
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims abstract description 55
- 238000012795 verification Methods 0.000 claims abstract description 34
- 230000003042 antagnostic effect Effects 0.000 claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 9
- 230000003416 augmentation Effects 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 44
- 230000004913 activation Effects 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 17
- 238000011176 pooling Methods 0.000 claims description 13
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000002372 labelling Methods 0.000 claims description 9
- NLINVDHEDVEOMJ-UHFFFAOYSA-N 1-Methylamino-1-(3,4-Methylenedioxyphenyl)Propane Chemical compound CCC(NC)C1=CC=C2OCOC2=C1 NLINVDHEDVEOMJ-UHFFFAOYSA-N 0.000 claims description 6
- 102400000233 M-alpha Human genes 0.000 claims description 6
- 101800001695 M-alpha Proteins 0.000 claims description 6
- 230000003190 augmentative effect Effects 0.000 claims description 6
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000004088 simulation Methods 0.000 description 6
- 230000008034 disappearance Effects 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a DGA domain name detection method based on GAN and Char-CNN, which is used for solving the problem of low detection recall rate of a low-randomness DGA domain name in the prior art and comprises the following implementation steps: acquiring a training sample set and a verification sample set; constructing and generating a countermeasure network GAN and a character-level convolutional neural network Char-CNN; generating an antagonistic network GAN and performing iterative training; acquiring an augmentation training set; performing iterative training on the character-level convolutional neural network Char-CNN; and detecting the domain name based on the trained character-level convolutional neural network Char-CNN'. According to the method, the antagonistic domain name is generated by using the GAN to augment the data set, the richness of the training sample set is improved, the error rate of the detection model is reduced by the residual block structure, the detection recall rate of the low-randomness DGA domain name is improved, meanwhile, the hyper-parameters needing to be calculated by the Char-CNN are few, and the training time of the detection model is shortened.
Description
Technical Field
The invention belongs to the technical field of network security, relates to a DGA domain name detection method, and particularly relates to a DGA domain name detection method based on GAN and Char-CNN, which can be used for positioning infected hosts, closing botnets and defending network attacks.
Background
The DGA domain name is a domain name periodically generated by using domain name Generation algorithm DGA (domain Generation algorithms) according to random seeds such as numbers, dates, Twitter hotspots, and the like. Network attackers register DGA domain names as the medium for bots to communicate with command and control servers, and these large number of potential DGA domain names make it difficult for law enforcement personnel to effectively shut down the botnet. The DGA domain name seriously threatens the safety of a network host, and particularly, the emerging low-randomness DGA domain name is strong in concealment and larger in threat, so that the DGA domain name is significant in effective detection. The DGA domain name detection task is to extract the characteristics of the domain name, calculate the extracted characteristics, output the prediction probability and further detect whether the domain name is the DGA domain name. Indexes for evaluating the detection effect of the DGA domain names are many, such as a working characteristic curve of a subject, an F1 value, a detection recall rate and the like, wherein the detection recall rate indicates a ratio of the detected DGA domain names to all DGA domain names, and thus is important for evaluating the detection recall rate indexes.
The DGA domain name detection method can be classified into a blacklist-based DGA domain name detection method, a machine learning-based DGA domain name detection method, and a deep learning-based DGA domain name detection method. The DGA domain name detection method based on the blacklist detects whether the domain name is the DGA domain name or not by judging whether the domain name is in a preset blacklist list or not, and the blacklist needs to be updated continuously, so that the method is poor in real-time performance. The DGA domain name detection method based on machine learning comprises the steps of manually extracting the characteristics of the length, the information entropy, the vowel character proportion, the number of repeated characters and the like of a domain name, detecting the DGA domain name by using machine learning algorithms such as a support vector machine and a random forest, and carrying out real-time detection. According to the DGA domain name detection method based on deep learning, potential features of a domain name are automatically extracted through a neural network model, prediction probability is output after neuron calculation, and therefore whether the domain name is the DGA domain name or not is detected.
In order to solve the problem, methods for extracting multidimensional characteristics of domain names through an integrated neural network and further detecting the DGA domain names are continuously provided in recent years. For example, an article, "integrated DGA domain name detection method based on deep learning" was published in 2018, volume 37, phase 10, "information technology and network security", by people such as ralla Yun, a middle electric great wall internet system application limited company, and an integrated DGA domain name detection method based on deep learning is proposed. The method integrates a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN) in deep learning, and constructs an integrated detection model consisting of a character embedding layer, a feature extraction layer and a classification layer. The characteristic extraction layer adopts a CNN model and an RNN model to automatically extract the characteristics of the input characters from the dimensions of space and time respectively, and the detection recall rate of the DGA domain name is effectively improved. However, this method still has disadvantages: the low randomness DGA domain names contained in the training sample set are too small in number and low in richness, and meanwhile, the problem of gradient disappearance occurs when the network level is too deep, so that the error rate is increased, and the detection recall rate of the low randomness DGA domain names is low; the calculation of each time step in the recurrent neural network RNN depends on the calculation and the output of the previous time step, so that more hyper-parameters need to be calculated, and the training time of the detection model is increased.
Disclosure of Invention
The invention aims to provide a DGA domain name detection method based on GAN and Char-CNN aiming at the defects of the prior art, which is used for solving the problem of low detection recall rate of low-randomness DGA domain name in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) acquiring a training sample set and a verification sample set:
(1a) sequentially selecting the first L hot domain names from the hot domain name set Alexa to form a training sample set A, wherein L is more than or equal to 600000;
(1b) randomly selecting M benign domain names with the class of 0 from a benign domain name set TRANCO, labeling the class of each benign domain name, randomly selecting N DGA domain names with the class of 1 from a DGA domain name set DGArchive, labeling the class of each DGA domain name, then combining alpha, M benign domain names, alpha, N DGA domain names and labels corresponding to the domain names into a training sample set B, combining the rest M-alpha, M benign domain names, the rest N-alpha, N DGA domain names and labels corresponding to the domain names into a verification sample set, wherein M is more than or equal to 100000, N is more than or equal to 100000, and alpha is more than or equal to 0.6 and less than or equal to 0.8;
(2) constructing and generating a countermeasure network GAN and a character-level convolutional neural network Char-CNN:
constructing a generation countermeasure network GAN comprising a generator network and a discriminator network, wherein the generator network comprises a full connection layer, a plurality of residual blocks, a one-dimensional convolution layer and an activation layer; the discriminator network comprises a one-dimensional convolution layer, a plurality of residual blocks and a full connection layer;
constructing a character-level convolutional neural network Char-CNN comprising an embedded layer, a plurality of one-dimensional convolutional layers, a plurality of active layers, a plurality of one-dimensional maximum pooling layers, a plurality of residual blocks, a Dropout layer and a plurality of fully-connected layers;
(3) generating an anti-network GAN for iterative training:
(3a) let the number of iterations be q1Maximum number of iterations is Q1,Q1Not less than 2000, and q is1=0;
(3b) Will random noise1Calculating as the input of a generator network to obtain m confrontation domain name vectors, and simultaneously coding m hot domain names randomly selected from a training sample set A to obtain m hot domain name vectors, wherein m is more than or equal to 64 and less than or equal to L;
(3c) predicting by taking m confrontation domain name vectors and m hot domain name vectors as the input of a discriminator network to obtain a probability setWherein,for the probability that the ith antagonistic domain name vector originates from the training sample set A, djThe probability that the jth hot domain name vector is derived from the training sample set A is represented by i being more than or equal to 1 and less than or equal to m, and j being more than or equal to 1 and less than or equal to m;
(3e) Using Adam's algorithm and passing through lossgAnd lossdTraining the generation antagonistic network GAN and judging q1=Q1If yes, obtaining a trained generation confrontation network GAN', otherwise, making q1=q1+1, and performing step (3 b);
(4) obtaining an augmentation training set:
(4a) will random noise2Calculating as a trained input for generating an antagonistic network GAN' to obtain P antagonistic domain name vectors, and decoding each antagonistic domain name vector to obtain P antagonistic domain names with the category of 1, wherein P is more than or equal to 20000 and less than or equal to L;
(4b) labeling the category of each confrontation domain name, and adding P confrontation domain names and the label of each confrontation domain name into a training sample set B to obtain an augmented training set;
(5) performing iterative training on a character-level convolutional neural network Char-CNN:
(5a) let the number of iterations be q2Maximum number of iterations is Q2,Q2Not less than 1000, and let q2=0;
(5b) Coding n domain names randomly selected from an augmented training set to obtain n domain name vectors, and predicting the n domain name vectors as the input of a character-level convolutional neural network Char-CNN to obtain a probability set { p }1,p2,...,pk,...,pnIn which p iskThe probability that the category of the kth domain name is 1 is more than or equal to 1 and less than or equal to N, and the probability that N is more than or equal to 32 and less than or equal to (alpha M + alpha N + P);
(5c) according to { p1,p2,...,pk,...,pnCalculating loss of the character-level convolutional neural network Char-CNN;
(5d) training a character-level convolutional neural network Char-CNN by adopting an RMSprop algorithm and through a value of lossObtaining the trained Char-CNN model Char-CNnq2;
(5e) C verification domain names randomly selected from the verification sample set are coded to obtain c verification domain name vectors, and the c verification domain name vectors are used as Char-CNNq2Is predicted to obtain a probability setWherein,the probability that the category of the verification domain name is 1 is the v-th verification domain name, v is more than or equal to 1 and less than or equal to c, and c is more than or equal to 32 and less than or equal to (M-alpha M + N-alpha N);
(5g) judging q2=Q2Whether the result is true or whether Accuracy is not increased any more is judged, if yes, a trained character-level convolutional neural network Char-CNN' is obtained, and otherwise, q is made2=q2+1, and performing step (5 b);
(6) detecting the domain name based on the trained character-level convolutional neural network Char-CNN':
(6a) setting the number of the domain names to be detected as t, and coding each domain name to be detected to obtain t domain name vectors to be detected, wherein t is more than or equal to 1;
(6b) predicting t domain name vectors to be detected as input of the trained character-level convolutional neural network Char-CNN' to obtain a probability setAnd judgeIf the result is true, the u-th domain name to be detected is the DGA domain name, otherwise, the u-th domain name to be detected is the non-DGA domain name,the probability that the category of the u-th domain name to be detected is 1 is shown, and u is more than or equal to 1 and less than or equal to t.
Compared with the prior art, the invention has the following advantages:
firstly, the confrontation domain name is generated by generating the confrontation network GAN, and a generator network and a discriminator network in the generated confrontation network GAN are trained together to mutually game, so that the generated confrontation domain name can well simulate the hot domain name with low randomness; meanwhile, the residual block relieves the problem of gradient disappearance of a deep network through a target function of conversion learning, and reduces the error rate of a detection model, so that the detection recall rate of the low-randomness DGA domain name is further improved, and a simulation result shows that the detection recall rate is improved by 28.3 percent compared with the prior art.
Secondly, the DGA domain name is detected through the character-level convolutional neural network Char-CNN, the Char-CNN learns local features through convolutional calculation and then obtains overall features through aggregation, compared with the cyclic neural network RNN, the number of hyper-parameters needing to be calculated is less, meanwhile, the structure of a residual block in the Char-CNN is simple, the learning speed is high, and therefore the training time of a detection model is shortened.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a block diagram of the present invention for generating residual blocks in the countermeasure network GAN and the character level convolutional neural network Char-CNN;
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments.
Referring to fig. 1, the present invention includes the steps of:
(1) acquiring a training sample set and a verification sample set:
(1a) sequentially selecting the first L hot domain names from the hot domain name set Alexa to form a training sample set A, wherein L is more than or equal to 600000;
(1b) randomly selecting M benign domain names with the class of 0 from a benign domain name set TRANCO, labeling the class of each benign domain name, randomly selecting N DGA domain names with the class of 1 from a DGA domain name set DGArchive, labeling the class of each DGA domain name, then combining alpha, M benign domain names, alpha, N DGA domain names and labels corresponding to the domain names into a training sample set B, combining the rest M-alpha, M benign domain names, the rest N-alpha, N DGA domain names and labels corresponding to the domain names into a verification sample set, wherein M is more than or equal to 100000, N is more than or equal to 100000, and alpha is more than or equal to 0.6 and less than or equal to 0.8;
(2) constructing and generating a countermeasure network GAN and a character-level convolutional neural network Char-CNN:
constructing a generation countermeasure network GAN comprising a generator network and a discriminator network, wherein the generator network comprises a full connection layer, a plurality of residual blocks, a one-dimensional convolution layer and an activation layer; the discriminator network comprises a one-dimensional convolution layer, a plurality of residual blocks and a full connection layer;
constructing a character-level convolutional neural network Char-CNN comprising an embedded layer, a plurality of one-dimensional convolutional layers, a plurality of active layers, a plurality of one-dimensional maximum pooling layers, a plurality of residual blocks, a Dropout layer and a plurality of fully-connected layers;
referring to fig. 2, a generator network, a discriminator network and a character-level convolutional neural network Char-CNN, wherein the residual block contained therein includes 2 active layers and 2 one-dimensional convolutional layers: the first active layer → the first one-dimensional convolution layer → the second active layer → the second one-dimensional convolution layer, wherein the activation function of the active layer is ReLU; the output space dimension of the one-dimensional convolution layer is 128, the size of the convolution kernel is 5, and the step length of the convolution kernel movement is 1 character; the input x of the first active layer and the output f (x) of the second one-dimensional convolution layer are added in a jump mode, the target function finally learned by the residual block is h (x), h (x) eta f (x) and x + x, wherein eta is a weight coefficient, and 0 is larger than or equal to eta and smaller than or equal to 1.
The target function of common deep network learning is f (x) ═ x, the derivative of the target function is constantly 1, the problem of gradient disappearance in the back propagation process can be caused, the problem of gradient disappearance of the deep network is relieved by the residual block through converting the learned target function, the error rate of the detection model is reduced, the detection recall rate of the low-randomness DGA domain name is improved, meanwhile, the residual block is simple in structure and high in learning speed, and the training time of the detection model is shortened.
The number of the residual blocks contained in the generator network and the arbiter network in the generation countermeasure network GAN is 5, where:
the specific structure of the generator network is as follows: fully-connected layer → first residual block → second residual block → third residual block → fourth residual block → fifth residual block → one-dimensional convolution layer → active layer, wherein the fully-connected layer has an input spatial dimension of 128 and an output spatial dimension of 128 × 63; the weight coefficients of all the residual blocks are 0.3; the output space dimension of the one-dimensional convolutional layer is 38, the size of the convolutional kernel is 1, and the step length of the convolutional kernel movement is 1 character; the activation function of the activation layer is Softmax;
the specific structure of the discriminator network is as follows: one-dimensional convolution layer → first residual block → second residual block → third residual block → fourth residual block → fifth residual block → full-connected layer, wherein the input space dimension of the one-dimensional convolution layer is 38, the output space dimension is 128, the convolution kernel size is 1, and the step length of the convolution kernel movement is 1 character; the weight coefficients of all the residual blocks are 0.3; the output space dimension of the fully connected layer is 1;
the number of one-dimensional convolutional layers contained in a character-level convolutional neural network Char-CNN is 2, the number of active layers is 4, the number of one-dimensional maximum pooling layers is 2, the number of residual blocks is 3, the number of full-link layers is 2, and the specific structure of the Char-CNN is as follows: the embedded layer → the first one-dimensional convolutional layer → the first active layer → the first one-dimensional maximum pooling layer → the second one-dimensional convolutional layer → the second active layer → the second one-dimensional maximum pooling layer → the first fully-connected layer → the first residual block → the second residual block → the third active layer → the Dropout layer → the second fully-connected layer → the fourth active layer, wherein the embedded layer has an input spatial dimension of 38, an output spatial dimension of 128, and a sequence length of 63; the output space dimensionality of all the one-dimensional convolutional layers is 128, the moving step length of the convolution kernel is 1 character, the convolution kernel size of the first one-dimensional convolutional layer is 3, and the convolution kernel size of the second one-dimensional convolutional layer is 2; the activation functions of the first, second and third activation layers are all ThresholdReLU, and the activation function of the fourth activation layer is Sigmoid; all the one-dimensional maximum pooling layers are filled in a same mode, and the size of a pooling window is 2; the weight coefficients of all the residual blocks are 0.3; the drop rate of the Dropout layer is 0.5; the output spatial dimension of the first fully-connected layer is 64 and the output spatial dimension of the second fully-connected layer is 1.
(3) Generating an anti-network GAN for iterative training:
(3a) let the number of iterations be q1Maximum number of iterations is Q1,Q1Not less than 2000, and q is1=0;
(3b) Random noise is generated by using random _ normal function contained in third-party library NumPy in Python language1To give noise1Calculating as the input of a generator network to obtain m confrontation domain name vectors, and simultaneously coding m hot domain names randomly selected from a training sample set A to obtain m hot domain name vectors, wherein m is more than or equal to 64 and less than or equal to L;
(3c) predicting by taking m confrontation domain name vectors and m hot domain name vectors as the input of a discriminator network to obtain a probability setWherein,for the probability that the ith antagonistic domain name vector originates from the training sample set A, djThe probability that the jth hot domain name vector is derived from the training sample set A is represented by i being more than or equal to 1 and less than or equal to m, and j being more than or equal to 1 and less than or equal to m;
(3d) according toLoss of compute generator networkgLoss of sum arbiter networkdThe calculation formulas are respectively as follows:
(3e) using Adam's algorithm and passing through lossgAnd lossdTraining the generation antagonistic network GAN and judging q1=Q1If yes, obtaining a trained generation confrontation network GAN', otherwise, making q1=q1+1, and performing step (3 b);
(4) obtaining an augmentation training set:
(4a) random noise is generated by using random _ normal function contained in third-party library NumPy in Python language2To give noise2Calculating as a trained input for generating an antagonistic network GAN' to obtain P antagonistic domain name vectors, and decoding each antagonistic domain name vector to obtain P antagonistic domain names with the category of 1, wherein P is more than or equal to 20000 and less than or equal to L;
(4b) labeling the category of each confrontation domain name, and adding P confrontation domain names and the label of each confrontation domain name into a training sample set B to obtain an augmented training set;
the confrontation domain names generated by mutual game of the generator network and the discriminator network in the GAN can well simulate the hot domain names with low randomness, are generated by an algorithm and have low randomness, can be regarded as DGA domain names with low randomness, and can be added into the training sample set to improve the richness of the training sample set and effectively improve the detection recall rate of the DGA domain names with low randomness.
(5) Performing iterative training on a character-level convolutional neural network Char-CNN:
(5a) let the number of iterations be q2Maximum number of iterations is Q2,Q2Not less than 1000, and let q2=0;
(5b) Coding n domain names randomly selected from the augmented training set to obtain n domain name vectors, andpredicting n domain name vectors as the input of a character-level convolutional neural network Char-CNN to obtain a probability set { p }1,p2,...,pk,...,pnIn which p iskThe probability that the category of the kth domain name is 1 is more than or equal to 1 and less than or equal to N, and the probability that N is more than or equal to 32 and less than or equal to (alpha M + alpha N + P);
(5c) according to { p1,p2,...,pk,...,pnAnd calculating loss of the character-level convolutional neural network Char-CNN, wherein the calculation formula is as follows:
wherein, ykTrue category for the kth domain name;
(5d) training a character-level convolutional neural network Char-CNN by adopting an RMSprop algorithm and a loss value to obtain a trained Char-CNN model Char-CNNq2;
(5e) C verification domain names randomly selected from the verification sample set are coded to obtain c verification domain name vectors, and the c verification domain name vectors are used as Char-CNNq2Is predicted to obtain a probability setWherein,the probability that the category of the verification domain name is 1 is the v-th verification domain name, v is more than or equal to 1 and less than or equal to c, and c is more than or equal to 32 and less than or equal to (M-alpha M + N-alpha N);
(5f) according toCalculating the detection Accuracy of the c verification samples, wherein the calculation formula is as follows:
wherein tp is the number of samples of which the real category is 1 and the probability of predicting the category to be 1 is greater than 0.5 in the c verification samples; tn is the number of samples with the true category of 0 in the verification samples and the probability of predicting the category of 1 not more than 0.5;
(5g) judging q2=Q2Whether the result is true or whether Accuracy is not increased any more is judged, if yes, a trained character-level convolutional neural network Char-CNN' is obtained, and otherwise, q is made2=q2+1, and performing step (5 b);
the character-level convolutional neural network Char-CNN is a feedforward neural network which comprises convolutional calculation and has a deep structure, local learning features are reunited to obtain overall features, potential features can be fully extracted, compared with a Recurrent Neural Network (RNN), the number of hyper-parameters needing calculation is less, meanwhile, a residual block in the convolutional neural network has a simple structure and high learning speed, and therefore training time of a detection model is shortened.
(6) Detecting the domain name based on the trained character-level convolutional neural network Char-CNN':
(6a) setting the number of the domain names to be detected as t, and coding each domain name to be detected to obtain t domain name vectors to be detected, wherein t is more than or equal to 1;
(6b) predicting t domain name vectors to be detected as input of the trained character-level convolutional neural network Char-CNN' to obtain a probability setAnd judgeIf the result is true, the u-th domain name to be detected is the DGA domain name, otherwise, the u-th domain name to be detected is the non-DGA domain name,the probability that the category of the u-th domain name to be detected is 1 is shown, and u is more than or equal to 1 and less than or equal to t.
The process of domain name coding involved in the above steps is: firstly, establishing mapping from characters to numbers according to an effective character set in a domain name, then traversing the characters in the domain name in sequence, converting the characters into corresponding numbers one by one, and finally filling 0 to obtain domain name vectors with the same length; the process of domain name decoding is as follows: firstly, mapping from numbers to characters is established according to an effective character set in a domain name, then, numbers in a vector are traversed in sequence, non-0 numbers are converted into corresponding characters one by one, and finally, the domain name is obtained.
The technical effects of the present invention will be further described with reference to simulation experiments.
1. Simulation conditions and contents:
during simulation experiments, a training sample set A consists of the first 600000 popular domain names sequentially selected from a popular domain name set Alexa; the training sample set B consists of 80000 benign domain names randomly selected from a benign domain name set TRANCO, 80000 DGA domain names randomly selected from a DGA domain name set DGArchive and labels corresponding to the domain names; the verification sample set consists of 20000 benign domain names randomly selected from a benign domain name set TRANCO, 20000 DGA domain names randomly selected from a DGA domain name set DGArchive and labels corresponding to the domain names; the number of training iterations is 2000; the domain names to be detected comprise 1000 low-randomness DGA domain names and 1000 high-randomness DGA domain names. The hardware platform is an Intel Core i7-7700K @4.50GHz CPU, an 8GB RAM and an NVIDIA Geforce GTX2080 GPU, and the operating system is Ubuntu 16.04 LTS; the simulation experiment software platforms are Python 3.6.5, Tensorflow 1.3 and Keras 2.2.1.
Simulation I, comparing and simulating the detection recall rate of the low-randomness DGA domain name of the integrated DGA domain name detection method based on deep learning, wherein the result is shown in table 1;
secondly, comparing and simulating the training time of the detection model of the integrated DGA domain name detection method based on deep learning, wherein the result is shown in Table 2;
2. and (3) simulation result analysis:
TABLE 1
TABLE 2
Training time for prior art detection models | Training time of detection model of the invention |
724min | 482min |
As can be seen from Table 1, compared with the existing integrated DGA domain name detection method based on deep learning, the DGA domain name detection method based on GAN and Char-CNN provided by the invention has the advantages that the detection recall rate of the low-randomness DGA domain name is improved by 28.3% on the premise of keeping the detection recall rate of the traditional high-randomness DGA domain name, which shows that the DGA domain name detection method based on GAN and Char-CNN provided by the invention can well extract features, improve the richness of a training sample set, reduce the error rate of a detection model, and further improve the detection recall rate of the low-randomness DGA domain name, thereby having important practical significance.
As can be seen from Table 2, compared with the existing integrated DGA domain name detection method based on deep learning, the DGA domain name detection method based on GAN and Char-CNN provided by the invention shortens the training time of the detection model by 242 minutes, which shows that the DGA domain name detection method based on GAN and Char-CNN provided by the invention has fewer hyper-parameters to be calculated, the structure of the residual block in Char-CNN is simple, the learning speed is high, and further the training time of the detection model is shortened.
The foregoing description is only an example of the present invention and should not be construed as limiting the invention in any way, and it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the principles and arrangements of the invention, but such changes and modifications are within the scope of the invention as defined by the appended claims.
Claims (7)
1. A DGA domain name detection method based on GAN and Char-CNN is characterized by comprising the following steps:
(1) acquiring a training sample set and a verification sample set:
(1a) sequentially selecting the first L hot domain names from the hot domain name set Alexa to form a training sample set A, wherein L is more than or equal to 600000;
(1b) randomly selecting M benign domain names with the class of 0 from a benign domain name set TRANCO, labeling the class of each benign domain name, randomly selecting N DGA domain names with the class of 1 from a DGA domain name set DGArchive, labeling the class of each DGA domain name, then combining alpha, M benign domain names, alpha, N DGA domain names and labels corresponding to the domain names into a training sample set B, combining the rest M-alpha, M benign domain names, the rest N-alpha, N DGA domain names and labels corresponding to the domain names into a verification sample set, wherein M is more than or equal to 100000, N is more than or equal to 100000, and alpha is more than or equal to 0.6 and less than or equal to 0.8;
(2) constructing and generating a countermeasure network GAN and a character-level convolutional neural network Char-CNN:
constructing a generation countermeasure network GAN comprising a generator network and a discriminator network, wherein the generator network comprises a full connection layer, a plurality of residual blocks, a one-dimensional convolution layer and an activation layer; the discriminator network comprises a one-dimensional convolution layer, a plurality of residual blocks and a full connection layer;
constructing a character-level convolutional neural network Char-CNN comprising an embedded layer, a plurality of one-dimensional convolutional layers, a plurality of active layers, a plurality of one-dimensional maximum pooling layers, a plurality of residual blocks, a Dropout layer and a plurality of fully-connected layers;
(3) generating an anti-network GAN for iterative training:
(3a) let the number of iterations be q1Maximum number of iterations is Q1,Q1Not less than 2000, and q is1=0;
(3b) Will random noise1As the input of the generator network, calculating to obtain m confrontation domain name vectors, and simultaneously carrying out hot-gating on m randomly selected from the training sample set ACoding the domain name to obtain m hot domain name vectors, wherein m is more than or equal to 64 and less than or equal to L;
(3c) predicting by taking m confrontation domain name vectors and m hot domain name vectors as the input of a discriminator network to obtain a probability setWherein,for the probability that the ith antagonistic domain name vector originates from the training sample set A, djThe probability that the jth hot domain name vector is derived from the training sample set A is represented by i being more than or equal to 1 and less than or equal to m, and j being more than or equal to 1 and less than or equal to m;
(3e) Using Adam's algorithm and passing through lossgAnd lossdTraining the generation antagonistic network GAN and judging q1=Q1If yes, obtaining a trained generation confrontation network GAN', otherwise, making q1=q1+1, and performing step (3 b);
(4) obtaining an augmentation training set:
(4a) will random noise2Calculating as a trained input for generating an antagonistic network GAN' to obtain P antagonistic domain name vectors, and decoding each antagonistic domain name vector to obtain P antagonistic domain names with the category of 1, wherein P is more than or equal to 20000 and less than or equal to L;
(4b) labeling the category of each confrontation domain name, and adding P confrontation domain names and the label of each confrontation domain name into a training sample set B to obtain an augmented training set;
(5) performing iterative training on a character-level convolutional neural network Char-CNN:
(5a) let the number of iterations be q2Maximum number of iterations is Q2,Q2Not less than 1000, and let q2=0;
(5b) Coding n domain names randomly selected from an augmented training set to obtain n domain name vectors, and predicting the n domain name vectors as the input of a character-level convolutional neural network Char-CNN to obtain a probability set { p }1,p2,...,pk,...,pnIn which p iskThe probability that the category of the kth domain name is 1 is more than or equal to 1 and less than or equal to N, and the probability that N is more than or equal to 32 and less than or equal to (alpha M + alpha N + P);
(5c) according to { p1,p2,...,pk,...,pnCalculating loss of the character-level convolutional neural network Char-CNN;
(5d) training a character-level convolutional neural network Char-CNN by adopting an RMSprop algorithm and a loss value to obtain a trained Char-CNN model Char-CNNq2;
(5e) C verification domain names randomly selected from the verification sample set are coded to obtain c verification domain name vectors, and the c verification domain name vectors are used as Char-CNNq2Is predicted to obtain a probability setWherein,the probability that the category of the verification domain name is 1 is the v-th verification domain name, v is more than or equal to 1 and less than or equal to c, and c is more than or equal to 32 and less than or equal to (M-alpha M + N-alpha N);
(5g) judging q2=Q2Whether the result is true or whether Accuracy is not increased any more is judged, if yes, a trained character-level convolutional neural network Char-CNN' is obtained, and otherwise, q is made2=q2+1, and performing step (5 b);
(6) detecting the domain name based on the trained character-level convolutional neural network Char-CNN':
(6a) setting the number of the domain names to be detected as t, and coding each domain name to be detected to obtain t domain name vectors to be detected, wherein t is more than or equal to 1;
(6b) predicting t domain name vectors to be detected as input of the trained character-level convolutional neural network Char-CNN' to obtain a probability setAnd judgeIf the result is true, the u-th domain name to be detected is the DGA domain name, otherwise, the u-th domain name to be detected is the non-DGA domain name,the probability that the category of the u-th domain name to be detected is 1 is shown, and u is more than or equal to 1 and less than or equal to t.
2. The GAN and Char-CNN based DGA domain name detection method of claim 1, wherein the generator network, the discriminator network and the character level convolutional neural network Char-CNN in step (2) comprise a residual block comprising 2 active layers and 2 one-dimensional convolutional layers: the first active layer → the first one-dimensional convolution layer → the second active layer → the second one-dimensional convolution layer, wherein the activation function of the active layer is ReLU; the output space dimension of the one-dimensional convolution layer is 128, the size of the convolution kernel is 5, and the step length of the convolution kernel movement is 1 character; the input x of the first active layer and the output f (x) of the second one-dimensional convolution layer are added in a jump mode, the target function finally learned by the residual block is h (x), h (x) eta f (x) and x + x, wherein eta is a weight coefficient, and 0 is larger than or equal to eta and smaller than or equal to 1.
3. The DGA domain name detection method based on GAN and Char-CNN as claimed in claim 1, wherein the generation of the antagonistic network GAN and the character level convolutional neural network Char-CNN in step (2) has the following specific structures:
the generation countermeasure network GAN, in which the generator network and the discriminator network each include 5 residual blocks, where:
the specific structure of the generator network is as follows: fully-connected layer → first residual block → second residual block → third residual block → fourth residual block → fifth residual block → one-dimensional convolution layer → active layer, wherein the fully-connected layer has an input spatial dimension of 128 and an output spatial dimension of 128 × 63; the weight coefficients of all the residual blocks are 0.3; the output space dimension of the one-dimensional convolutional layer is 38, the size of the convolutional kernel is 1, and the step length of the convolutional kernel movement is 1 character; the activation function of the activation layer is Softmax;
the specific structure of the discriminator network is as follows: one-dimensional convolution layer → first residual block → second residual block → third residual block → fourth residual block → fifth residual block → full-connected layer, wherein the input space dimension of the one-dimensional convolution layer is 38, the output space dimension is 128, the convolution kernel size is 1, and the step length of the convolution kernel movement is 1 character; the weight coefficients of all the residual blocks are 0.3; the output space dimension of the fully connected layer is 1;
the number of the one-dimensional convolutional layers contained in the character-level convolutional neural network Char-CNN is 2, the number of the active layers is 4, the number of the one-dimensional maximum pooling layers is 2, the number of the residual blocks is 3, the number of the full-connection layers is 2, and the specific structure of the Char-CNN is as follows: the embedded layer → the first one-dimensional convolutional layer → the first active layer → the first one-dimensional maximum pooling layer → the second one-dimensional convolutional layer → the second active layer → the second one-dimensional maximum pooling layer → the first fully-connected layer → the first residual block → the second residual block → the third active layer → the Dropout layer → the second fully-connected layer → the fourth active layer, wherein the embedded layer has an input spatial dimension of 38, an output spatial dimension of 128, and a sequence length of 63; the output space dimensionality of all the one-dimensional convolutional layers is 128, the moving step length of the convolution kernel is 1 character, the convolution kernel size of the first one-dimensional convolutional layer is 3, and the convolution kernel size of the second one-dimensional convolutional layer is 2; the activation functions of the first, second and third activation layers are all ThresholdReLU, and the activation function of the fourth activation layer is Sigmoid; all the one-dimensional maximum pooling layers are filled in a same mode, and the size of a pooling window is 2; the weight coefficients of all the residual blocks are 0.3; the drop rate of the Dropout layer is 0.5; the output spatial dimension of the first fully-connected layer is 64 and the output spatial dimension of the second fully-connected layer is 1.
6. The GAN and Char-CNN based DGA domain name detection method of claim 1, wherein the detection Accuracy of the c verification samples in step (5f) is calculated by the following formula:
wherein tp is the number of samples of which the real category is 1 and the probability of predicting the category to be 1 is greater than 0.5 in the c verification samples; tn is the number of samples in which the true class is 0 and the probability of predicting class to be 1 is not more than 0.5 in the verification samples.
7. The GAN and Char-CNN based DGA domain name detection method of claim 1 wherein the random noise in step (3b)1And the random noise described in the step (4a)2All generated by using random _ normal function contained in the third-party library NumPy in Python language.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010007697.0A CN111209497B (en) | 2020-01-05 | 2020-01-05 | DGA domain name detection method based on GAN and Char-CNN |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010007697.0A CN111209497B (en) | 2020-01-05 | 2020-01-05 | DGA domain name detection method based on GAN and Char-CNN |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111209497A CN111209497A (en) | 2020-05-29 |
CN111209497B true CN111209497B (en) | 2022-03-04 |
Family
ID=70788417
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010007697.0A Active CN111209497B (en) | 2020-01-05 | 2020-01-05 | DGA domain name detection method based on GAN and Char-CNN |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111209497B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112116601B (en) * | 2020-08-18 | 2023-04-28 | 河南大学 | Compressed sensing sampling reconstruction method and system based on generation of countermeasure residual error network |
CN112019651B (en) * | 2020-08-26 | 2021-11-23 | 重庆理工大学 | DGA domain name detection method using depth residual error network and character-level sliding window |
CN112101464B (en) * | 2020-09-17 | 2024-03-15 | 西安锐思数智科技股份有限公司 | Deep learning-based image sample data acquisition method and device |
CN112104674B (en) * | 2020-11-17 | 2021-05-11 | 鹏城实验室 | Attack detection recall rate automatic test method, device and storage medium |
CN112527547B (en) * | 2020-12-17 | 2022-05-17 | 中国地质大学(武汉) | Mechanical intelligent fault prediction method based on automatic convolution neural network |
CN112765319B (en) * | 2021-01-20 | 2021-09-03 | 中国电子信息产业集团有限公司第六研究所 | Text processing method and device, electronic equipment and storage medium |
CN112953914A (en) * | 2021-01-29 | 2021-06-11 | 浙江大学 | DGA domain name detection and classification method and device |
CN113673680B (en) * | 2021-08-20 | 2023-09-15 | 上海大学 | Model verification method and system for automatically generating verification properties through an antagonism network |
CN113709152B (en) * | 2021-08-26 | 2022-11-25 | 东南大学 | Antagonistic domain name generation model with high-resistance detection capability |
CN114006752A (en) * | 2021-10-29 | 2022-02-01 | 中电福富信息科技有限公司 | DGA domain name threat detection system based on GAN compression algorithm and training method thereof |
CN114021698A (en) * | 2021-10-30 | 2022-02-08 | 河南省鼎信信息安全等级测评有限公司 | Malicious domain name training sample expansion method and device based on capsule generation countermeasure network |
CN113806338B (en) * | 2021-11-18 | 2022-02-18 | 深圳索信达数据技术有限公司 | Data discrimination method and system based on data sample imaging |
CN114782961B (en) * | 2022-03-23 | 2023-04-18 | 华南理工大学 | Character image augmentation method based on shape transformation |
CN115913764A (en) * | 2022-12-14 | 2023-04-04 | 国家计算机网络与信息安全管理中心甘肃分中心 | Malicious domain name training data generation method based on generation of countermeasure network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109391602A (en) * | 2017-08-11 | 2019-02-26 | 北京金睛云华科技有限公司 | A kind of zombie host detection method |
CN110113327A (en) * | 2019-04-26 | 2019-08-09 | 北京奇安信科技有限公司 | A kind of method and device detecting DGA domain name |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2754097C (en) * | 2002-01-28 | 2013-12-10 | Nichia Corporation | Nitride semiconductor device having support substrate and its manufacturing method |
-
2020
- 2020-01-05 CN CN202010007697.0A patent/CN111209497B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109391602A (en) * | 2017-08-11 | 2019-02-26 | 北京金睛云华科技有限公司 | A kind of zombie host detection method |
CN110113327A (en) * | 2019-04-26 | 2019-08-09 | 北京奇安信科技有限公司 | A kind of method and device detecting DGA domain name |
Non-Patent Citations (2)
Title |
---|
MaskDGA:a black-box evasion technique against DGA classifiers and adversarial defenses;Lior Sidi et al.;《arXiv preprint arXiv》;20191231;全文 * |
基于生成对抗网络的恶意域名训练数据生成;袁辰 等;《计算机应用研究》;20191231;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111209497A (en) | 2020-05-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111209497B (en) | DGA domain name detection method based on GAN and Char-CNN | |
CN113408743B (en) | Method and device for generating federal model, electronic equipment and storage medium | |
CN110048827B (en) | Class template attack method based on deep learning convolutional neural network | |
CN112487807A (en) | Text relation extraction method based on expansion gate convolution neural network | |
CN109670303B (en) | Password attack evaluation method based on conditional variation self-coding | |
CN112215292B (en) | Image countermeasure sample generation device and method based on mobility | |
CN106897254B (en) | Network representation learning method | |
CN110427461A (en) | Intelligent answer information processing method, electronic equipment and computer readable storage medium | |
CN113691542B (en) | Web attack detection method and related equipment based on HTTP request text | |
CN111651762A (en) | Convolutional neural network-based PE (provider edge) malicious software detection method | |
CN112232087A (en) | Transformer-based specific aspect emotion analysis method of multi-granularity attention model | |
CN113269228B (en) | Method, device and system for training graph network classification model and electronic equipment | |
CN112235434B (en) | DGA network domain name detection and identification system fusing k-means and capsule network thereof | |
CN113033822A (en) | Antagonistic attack and defense method and system based on prediction correction and random step length optimization | |
CN114417427A (en) | Deep learning-oriented data sensitivity attribute desensitization system and method | |
Feng et al. | A phishing webpage detection method based on stacked autoencoder and correlation coefficients | |
CN113947579A (en) | Confrontation sample detection method for image target detection neural network | |
Jiang et al. | Cycle‐Consistent Adversarial GAN: The Integration of Adversarial Attack and Defense | |
Liu et al. | Defend Against Adversarial Samples by Using Perceptual Hash. | |
CN116306780B (en) | Dynamic graph link generation method | |
CN115834251B (en) | Hypergraph-transform-based threat hunting model building method | |
Gupta et al. | The effect of pretraining on extractive summarization for scientific documents | |
CN112261169A (en) | DGA domain name Botnet identification and judgment method utilizing capsule network and k-means | |
CN115495571A (en) | Method and device for evaluating influence of knowledge distillation on model backdoor attack | |
Krithivasan et al. | Efficiency attacks on spiking neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |