CN112712163B

CN112712163B - Coverage rate-based neural network effective data enhancement method

Info

Publication number: CN112712163B
Application number: CN202011562234.7A
Authority: CN
Inventors: 薛云志; 孟令中; 董乾; 康舒婷; 杨光; 师源; 武斌
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2022-10-14
Anticipated expiration: 2040-12-25
Also published as: CN112712163A

Abstract

The invention discloses a coverage rate-based neural network effective data enhancement method, which comprises the following steps: 1) Selecting a training data set according to a neural network model to be trained and selecting a plurality of coverage rate indexes for the neural network model; 2) Training the neural network model to be trained by using the training data set, and counting the number of activated neurons corresponding to different coverage rate indexes in the neural network model during training; 3) Calculating each coverage index value of the training data set according to the number of activated neurons corresponding to each coverage index; then selecting a coverage rate index which is most related to the accuracy of the neural network model as an evaluation index according to each coverage rate index value; 4) Expanding the training data set to obtain an expanded data set; 5) And (2) respectively testing the evaluation index value of the training data set and the evaluation index value of the expansion data set by using the neural network model trained in the step 1), and determining an effective data set.

Description

Coverage rate-based neural network effective data enhancement method

Technical Field

The invention relates to a coverage rate-based neural network effective data enhancement method, and belongs to the technical field of computer software.

Background

The best way that neural networks need to be optimized to improve the generalization ability of deep learning models (i.e., neural networks) is to use more data for training. However, in practice, situations such as a large scale of the trained neural network and an insufficient amount of data exist often occur, which may result in a large difference between the training effect and the testing effect, and the trained neural network lacks practicability. The method for solving the problem of insufficient data volume of the training network at present mainly comprises data enhancement, namely, expanding a data set and adding the data set into a training set; for image data, the data enhancement mode includes image turning, rotation, translation, contrast change, saturation change, blurring, generation of a countermeasure network (GAN) and the like; for text data, the data enhancement mode includes text mutation, synonym replacement and the like.

Data enhancement can increase the samples of the training set, effectively relieve the overfitting condition of the neural network, and can bring stronger generalization capability to the model. However, as the size of the data set is enlarged, new problems are introduced, such as uncertainty in the generated data set, containing many noisy data, and possibly degrading the performance of the model. Therefore, data enhancement techniques need to introduce a data set screening and judgment mechanism to obtain more effective data sets.

Disclosure of Invention

The data set is expanded by using a data enhancement technology and usually comprises noisy data, so that a data set screening mechanism needs to be introduced, an effective data set is obtained quickly, and a neural network can be trained better by using the enhanced effective data, so that the accuracy of a pre-training model is improved.

In order to achieve the purpose, the coverage rate characteristic and a data enhancement mechanism are organically combined, effective data are screened out by using the coverage rate index after the data enhancement is carried out on the training data set, and the accuracy rate of the pre-training model is improved.

The coverage rate of the neural network is mainly used for evaluating the performance of the neural network, and on the premise of certain accuracy rate, the higher the coverage rate of the model is, the more fully the neural network is verified. From the aspect of the data set, the coverage rate can see whether the nodes of the neural network are activated or not, and if the nodes are activated, the test is more sufficient; from the level of a single sample, the more neurons that are activated, the more features the sample contains, and the more complex the sample is, which is usually at the network classification boundary, and these data sets are very important for training neural networks. Therefore, the coverage rate characteristics and the data enhancement mechanism can be effectively combined to form the coverage rate-based neural network effective data enhancement method.

The coverage rate-based neural network effective data enhancement method of the invention has a flow as shown in fig. 1, and comprises the following steps:

step 1: selecting a training data set according to a neural network model to be trained; training parameters of the neural network model (referred to as neural network 1) using a training data set; wherein, the neural network 1 is a pre-trained neural network model;

and 2, step: expanding the training data set to obtain an expanded data set;

and 3, step 3: screening a plurality of coverage rate indexes, selecting the coverage rate index most relevant to the accuracy of the neural network model as a judgment basis, comparing the coverage rate of a training data set with the coverage rate of an expansion data set, and screening the coverage rate indexes to obtain an effective data set, namely taking the data set with high coverage rate as the effective data set; preferably, the coverage rate index of the text data comprises coverage rate of a hidden unit, coverage rate of a positive sequence and coverage rate of a negative sequence;

and 4, step 4: the pre-trained model is retrained using the active data set, resulting in the neural network 2.

Further, the step 2 of expanding the data set specifically includes:

for image data, the data expansion mode comprises a geometric transformation class, a color transformation class, a generation class and the like, wherein the geometric transformation class comprises turning, rotating, clipping, deforming, scaling and the like, the color transformation class comprises noise adding, blurring, color transformation, erasing, filling and the like, and the generation class comprises GAN and the like; for text data, the data expansion mode includes text mutation, synonym substitution and the like.

Further, step 3 specifically comprises:

step 3.1: preferably, the invention provides coverage rate indexes, namely, the coverage rate of the positive sequence, the coverage rate of the negative sequence and the coverage rate of the hidden unit, based on a pre-training language model;

the hidden layer formula based on the pre-training language model is shown as formula (1), wherein the pre-training language model is based on a Transformer model (Transformer-XL);

wherein Attention () is a formal representation of the content stream in the dual stream self-Attention mechanism in the model, let

Is X _Z<t Where m is the number of encoder layers, Z _t Represents a sequence [ 1.,. T., T of text length T]All possible orders of (i.e. Z) _t Set of ordering methods for sequences of text length T), and Z is one of the ordering methods, Z belonging to Z _t ，z _t Is Z _t The t-th element (i.e. Z) _t T ordering method) in (1), z _<t Including all elements preceding the t-th element.

Wherein h is _Zt Is a representation of a hidden layer, h _Zt (i) Indicating a hidden layer h _Zt The ith constituent element of (1); hidden unit coverage refers to the observable state change of the hidden layer, when the change is larger than a threshold value, the neuron is considered to be activated, and the coverage is used for calculating the ratio of the activated neurons; the sequence coverage range reflects information about the state of the successive hidden layers, which is represented by the element z _t Number of forward sequences consisting of neurons activated during forward propagation

And the element z _t Negative sequence number of neurons activated in reverse propagation

And (4) forming.

The hidden unit coverage formula is as follows:

wherein, NUM _ hidden _ activated represents the number of activated hidden units, and NUM _ hidden represents the total number of hidden units of the model.

The Positive Sequence Coverage (PSC) and Negative Sequence Coverage (NSC) obtained from equation (2) are as follows:

wherein,

is the total number of forward sequences (in terms of z) _t To pair

Is obtained by summing),

is the total number of negative sequences (each z is _t Corresponding to

Summation).

Step 3.2: a comparison of coverage is shown in fig. 2;

step 3.2.1: calculating correlation coefficients of a plurality of coverage rate indexes of the training data set and the accuracy rate of the model, wherein the correlation coefficients can use Pearson correlation coefficients;

the model accuracy rate represents the proportion of the number of samples correctly predicted by the model in the total samples;

step 3.2.2: selecting a coverage rate index with the highest coefficient related to the model accuracy rate as a judgment condition for searching effective data in the next step;

step 3.3.3: comparing the coverage rate of the training data and the coverage rate of the expansion data by using the trained neural network 1, and if the coverage rate of the expansion data is greater than that of the training data, keeping the expansion data as effective data; otherwise, the expansion data is failure data; the method filters the extended data and selects more effective data for data enhancement. In experiments, the screened data are verified to have better data enhancement effect than the primary data.

The coverage rate index is the coverage rate index which is selected by the coverage rate screening step and has the highest coefficient related to the model accuracy, the training data is the data of the initial training neural network 1, and the expansion data is the data selected from the expansion data set generated by the training data; for the extended data generated by a single training data sample, such as the extended data generated by a (image data) geometric transformation method, an (image data) color transformation method, and a (text data) synonym replacement method by using the training data, the coverage ratio comparison is to compare the coverage ratio of the single training data sample with the coverage ratio of the corresponding generated extended data sample; for extended data generated by a generative class method, such as extended data generated by a GAN, the coverage ratio comparison is performed by comparing the average coverage ratio calculated from data samples of the same class as the extended data in the training data set with the coverage ratio of a single sample of the extended data.

The invention has the positive effects that:

(1) The coverage rate is used for effectively screening the expansion data set, so that effective data of model training can be quickly obtained, and the accuracy of the pre-training model is improved;

(2) Screening a plurality of coverage rate indexes, and using the correlation coefficient of the model accuracy as a screening basis, so that the correlation degree of effective data and a model is improved, and the efficiency of data set screening is improved;

(3) Three coverage rate indexes are provided aiming at the pre-training language model, so that the accuracy rate of the pre-training language model can be improved.

Drawings

FIG. 1 is a flow chart of a coverage-based neural network effective data enhancement method;

fig. 2 is a flowchart of a comparison of coverage indicators.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The technical scheme of the invention is that the coverage rate characteristic and a data enhancement mechanism are organically combined, and after data set is enhanced, an effective data set is screened out by using a coverage rate index, so that the accuracy rate of a pre-training model is improved.

The invention provides a coverage rate-based neural network effective data enhancement method, and the specific technical scheme is as shown in figure 1:

step 1: training parameters of the neural network 1 using a training data set; wherein the model of the neural network 1 is a pre-training model;

step 2: expanding the training data set to obtain an expanded data set;

and 3, step 3: screening a plurality of coverage rate indexes, selecting the coverage rate index most relevant to the accuracy of the model as a judgment basis, comparing the coverage rates of the training data set and the expansion data set, and obtaining an effective data set after screening the coverage rate indexes; preferably, the coverage rate index of the text data comprises coverage rate of a hidden unit, coverage rate of a positive sequence and coverage rate of a negative sequence;

In one embodiment, the end of story prediction task (SCT) is to predict the end of a story using a pre-trained linguistic model, which requires a model to select the correct end from two candidate ends (one erroneous and the other correct) given the context of a four sentence story. In the field of text, text mutation and synonym replacement techniques are always used as methods for generating discrete data. The end-of-story prediction task is an evaluation task provided for automatic construction of an event chain technology, and the task is to give a series of event chains, delete a segment and enable a trained model to select one from a candidate data set for prediction.

Further, the pre-training model in step 1 specifically includes:

the pre-training language model may be selected from a Long Short Term Memory (LSTM) model, a transformer (transformer) model, an XLNet model, etc., and in one embodiment, the XLNet model is selected during a model training phase. The biggest advantage of XLNET is that it can learn context information through various permutations of input sequences, the algorithm uses permutation language model to adjust model parameters, and uses a dual-stream auto-attention mechanism to achieve target-sensitive representation to obtain better results; some parameters may be saved to select the best coverage criteria and model in preparation for subsequent calculations after training.

Further, the data set and data expansion in step 2 specifically include:

In one embodiment, the specification is performed using text-like data, using the story-ending prediction task data set (SCT v 1.0) as the initial training data set, each data set containing four short sentences describing a story, two candidate answers as story predictions, and one label as the correct answer. The data format is shown in table 1.

Table 1 story-ending prediction data example

Content providing method and apparatus	Answer 1	Answer 2	Label (R)
				C1；C2；C3；C4.	A1	B1	1
C5；C6；C7；C8.	A2	B2	2

A variety of data generation techniques may be used during the data generation phase. For discrete text, we use two competing techniques to generate enough data from the training data, namely text mutation and synonym substitution techniques. Text mutation alters the original training data by random insertion, random exchange, and random deletion. But in this way the meaning of the story may change. Another approach is to use synonym substitution, i.e., using Paragram-SL999 to generate adjacent paraphrases for words. All generated data is five times the size of the original training data set.

Further, step 3 specifically comprises:

step 3.1: preferably, the invention provides coverage rate indexes based on a pre-training language model, namely positive sequence coverage rate, negative sequence coverage rate and hidden unit coverage rate;

the hidden layer formula based on the pre-training model is shown as formula (1), wherein the pre-training model is based on a Transformer model (Transformer-XL);

order to

Is X _Z<t Where m is the number of encoder layers, Z _t Representing a sequence [1,. Ang., T ] of text length T]And Z is one of the ordering methods, Z belongs to Z _t ，z _t Is the t-th element, z _<t Including all previous tuples.

Hidden unit coverage refers to the observable state change of the hidden layer, when the change is larger than a threshold value, the neuron is considered to be activated, and the coverage is used for calculating the ratio of the activated neurons; the sequence coverage range reflects the information about the state of the successive hidden layers, which is covered by the forward sequence coverage

And negative sequence coverage

And (4) forming.

The hidden unit coverage formula is as follows:

step 3.2: comparing the coverage rate;

step 3.2.1: calculating correlation coefficients of a plurality of coverage rate indexes of the training data set and the accuracy rate of the model, wherein the correlation coefficients can be Pearson correlation coefficients;

the model accuracy rate represents the proportion of the number of samples correctly predicted by the model in the total samples, and the formula is as follows:

wherein, TP represents the number of predicted positive samples and actual positive samples, TN represents the number of predicted negative samples and actual negative samples, FP represents the number of predicted negative samples and actual positive samples, and FN represents the number of predicted positive samples and actual negative samples.

step 3.3.3: comparing the coverage rate of the training data and the coverage rate of the expansion data by using the trained neural network 1, and if the coverage rate of the expansion data is greater than that of the training data, keeping the expansion data as an effective data set; otherwise, the expansion data is failure data;

the coverage rate index is the coverage rate index which is selected by the coverage rate screening step and has the highest coefficient related to the model accuracy, the training data is the data of the initial training neural network 1, and the expansion data is the data selected from the expansion data set generated by the training data;

it is worth noting that for the image expansion data generated by the method of geometric transformation, color transformation or the text expansion data generated by the method of synonym replacement, the coverage ratio comparison is to compare the coverage ratio of the single sample of the training data with the coverage ratio of the single sample of the corresponding generated expansion data; taking text data as an example, a single training data sample is a certain sentence C1, a single training data sample plus a single extended data sample is C1+ C2, C2 is extended data obtained from C1, and the coverage ratio comparison is the coverage ratio comparison of C1 and C2.

For the generated class data, the coverage ratio comparison is to compare the average coverage ratio calculated by the data of the same class as the extended data in the training data set with the coverage ratio of a single sample of the extended data. Taking the generation of an image of a cat as an example, the data samples in the same category as the extended data in the training data set are all extracted from all images of the cat in the training data set to form a data subset D1, and the average coverage rate of D1 is calculated, and the single sample of the extended data is the single image I1 of the cat of which the label category is generated by GAN, and the comparison of the coverage rates is the comparison of the average value of the coverage rates of the single training data samples in D1 and the coverage rate of I1.

The above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the same, and those skilled in the art can make modifications or equivalent substitutions to the technical solutions of the present invention without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A coverage rate-based neural network effective data enhancement method comprises the following steps:

1) Selecting a training data set according to a neural network model to be trained and selecting a plurality of coverage rate indexes for the neural network model; the neural network model to be trained is a language model, the sample data in the training data set is text data, and the coverage rate index comprises hidden unit coverage rate, positive sequence coverage rate and negative sequence coverage rate; the hidden unit coverage

NUM _ hidden _ activated represents the number of activated hidden units in the neural network model, and NUM _ hidden represents the total number of the hidden units in the neural network model; the sequence coverage range reflects information about the state of the successive hidden layers, which is represented by the element z _t Number of forward sequences consisting of neurons activated during forward propagation

Composition of element z _t The method is a T-th arrangement mode in all possible sequences of the text with the length of T; the forward sequence coverage rate

The negative sequence coverage rate

Wherein,

indicates the total number of forward sequences,

Indicates the total number of negative sequences;

2) Training the neural network model to be trained by using the training data set, and counting the number of activated neurons corresponding to different coverage rate indexes in the neural network model during training;

3) Calculating each coverage index value of the training data set according to the number of activated neurons corresponding to each coverage index; then selecting a coverage rate index which is most related to the accuracy of the neural network model as an evaluation index according to each coverage rate index value;

4) Expanding the training data set to obtain an expanded data set;

5) Respectively testing the evaluation index value of the training data set and the evaluation index value of the extended data set by using the neural network model trained in the step 1); if the evaluation index value of the expansion data set is larger than that of the training data set, taking the expansion data set as an effective data set; otherwise, the training data set is regarded as an invalid data set.

2. The method of claim 1, wherein the coverage index most correlated to the accuracy of the neural network model is selected as the evaluation index according to a pearson correlation coefficient between each coverage index and the accuracy of the neural network model.

3. A method for training a neural network model, wherein the effective data set obtained by the method of claim 1 is used to train the neural network model to be trained.