CN114817546A

CN114817546A - Label noise learning method for taxpayer industry classification

Info

Publication number: CN114817546A
Application number: CN202210498954.4A
Authority: CN
Inventors: 郑庆华; 曹书植; 阮建飞; 赵锐; 董博; 师斌
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-07-29
Anticipated expiration: 2042-05-09
Also published as: CN114817546B

Abstract

The invention discloses a label noise learning method for taxpayer industry classification, which comprises the following steps: firstly, extracting text information and non-text information in taxpayer business information, and performing text embedding and non-text coding processing respectively based on an XLNET text pre-training network and a coding technology to obtain characteristic information; secondly, constructing a TextCNN network for taxpayer industry classification, determining the number of layers of the network, the shape of a convolution kernel and the input and output dimensions of each layer according to the characteristic information and the target classification number, connecting an XLNet text pre-training network and the TextCNN network in series, and constructing an end-to-end training device by combining noisy taxpayer industry label data as supervision; thirdly, estimating a conditional transition matrix based on an improved mixed proportion estimation method; and finally, learning network parameters in the training device, and taking the conditional transfer matrix as a linear layer behind the TextCNN network, so as to realize the conversion from noise label prediction to real taxpayer industry label prediction and carry out taxpayer industry classification.

Description

Label noise learning method for taxpayer industry classification

Technical Field

The invention belongs to the technical field of text classification with labeled noise, and particularly relates to a label noise learning method for taxpayer industry classification.

Background

In recent years, market economy continues to prosper, the number of enterprises is increasing, and division of labor of the enterprises is refining. Concomitantly, the upgrade and further construction of tax systems has become an urgent need.

The taxpayer industry classification is a precondition for determining the policy of a taxpayer main body and preferential benefit, and is an important link for tax collection. At present, China mainly divides the taxpayer industry into 20 categories and 97 categories. Due to numerous types, the traditional manual classification method needs to consume a large amount of human resources and is limited by professional knowledge and experience of classifiers, and classification errors, namely label noise of taxpayer industry classification, are inevitably introduced, so that a series of adverse effects are caused to national statistics, tax collection and industrial and commercial management.

In recent years, with the acceleration of the "intelligence +" era, the artificial intelligence industry is rapidly developing and applied to various fields, and possibility is provided for the exploration and development of the wisdom tax. Research on enterprise taxpayer industry classification is the fundamental work of tax source classification management and is a key premise of intelligent tax informatization. Therefore, how to train a classifier based on the existing label noise data by a machine learning means to correctly classify the taxpayer industry becomes a problem to be solved urgently.

The invention relates to a related technical scheme of taxpayer industry classification problems, and the related invention patents comprise the following patents:

document 1: tax payer industry two-level classification method (201910024324.1) based on MIMO recurrent neural network

Document 2: taxpayer industry classification method based on noise label learning (202110201214.5)

Document 1 designs a GRU-based multi-input multi-output neural network structure, establishes a mapping relationship from industry major categories to industry details, and constructs a two-layer classification structure for realizing industry classification of taxpayers. However, this method relies on a strict labeling of the data and lacks practical value in the presence of tag noise.

Document 2 designs a BERT-CNN network for text classification, and constructs classifiers with consistent classifications by using tag noise data based on a semantic clustering method, but the performance limitation of the semantic clustering method introduces new errors into the classifiers.

Aiming at the defects of the technical scheme, the invention aims to construct a classifier based on the consistent risks of the tag noise data without depending on additional manual labeling, overcome the classification deviation caused by adopting a semantic clustering method in the prior art, and ensure that the classifier constructed based on the tag noise data and the classifier constructed by adopting real labeling data have consistent classification risks in a statistical sense.

The core of constructing a risk-consistent classifier based on tag noise data is as follows: a statistically consistent classifier is constructed by evaluating a conditional transition matrix (a matrix made up of the conditional probabilities of true labels given noisy labels). The invention creatively converts the problem of the conditional transition matrix estimation into the problem of the mixed proportion estimation, and obtains an approximate conditional transition matrix by estimating the mixed proportion coefficient. However, the conventional mixed proportion estimation method is only suitable for the two-classification scene and depends on an anchor point (a sample definitely belongs to a certain class), and the taxpayer industry classification problem has a plurality of industry categories and belongs to a multi-classification problem, and the anchor point is difficult to label and acquire. Therefore, it is a major challenge of the present invention to scale the mixture ratio estimation problem from the binary analogy to the multi-classification and overcome the anchor point dependency problem.

Disclosure of Invention

The invention aims to provide a label noise learning method for taxpayer industry classification, which is used for constructing a risk consistency classifier by estimating a conditional transition matrix (a matrix formed by the conditional probability of a real label under the condition of giving a noise label) based on label noise data.

The invention is realized by adopting the following technical scheme:

a label noise learning method for taxpayer industry classification comprises the following steps:

firstly, extracting text information and non-text information in taxpayer business information, and performing text embedding and non-text coding processing respectively based on an XLNET text pre-training network and a coding technology to obtain characteristic information; secondly, constructing a TextCNN network for taxpayer industry classification, determining the number of layers of the network, the shape of a convolution kernel and the input and output dimensions of each layer according to the characteristic information and the target classification number, connecting an XLNet text pre-training network and the TextCNN network in series, and constructing an end-to-end training device by combining noisy taxpayer industry label data as supervision; thirdly, estimating a conditional transition matrix based on an improved mixed proportion estimation method; and finally, learning network parameters in the training device, and taking the conditional transfer matrix as a linear layer behind the TextCNN network, so as to realize the conversion from noise label prediction to real taxpayer industry label prediction and carry out taxpayer industry classification.

A further improvement of the invention is that the method comprises in particular the steps of:

1) taxpayer industry information processing

The taxpayer industry information processing comprises text information processing and non-text information processing, firstly, word segmentation and word embedding are carried out on taxpayer text information based on an XLNET text pre-training network to form corresponding word vectors, then text features are generated by splicing, secondly, numerical value features and category features in the taxpayer non-text information are preprocessed by respectively using a standardization process and a one-hot coding technology, then a linear network layer is established for feature mapping to generate non-text features consistent with text feature dimensions, and finally, the text features and the non-text features are spliced to form feature information;

2) taxpayer industry classification network construction and training device initialization

Constructing a TextCNN network for taxpayer industry classification, wherein the network comprises three layers of a convolutional layer, a pooling layer and a full-connection layer, sequentially determining the layer number of the TextCNN network, the shape of a convolutional core and the input and output dimensions of each layer based on the characteristic information and the target classification number obtained in the step 1), then connecting an XLNet pre-training network with the TextCNN network in series, and constructing an end-to-end training device by taking a noisy taxpayer industry information label as supervision;

3) conditional branch matrix estimation

Estimating a probability density function according to noisy taxpayer industry information data based on a kernel density estimation method, converting a conditional transfer matrix estimation problem into a mixed proportion estimation problem, and solving a corresponding mixed proportion coefficient based on an improved mixed proportion estimation method to obtain a conditional transfer matrix;

4) training device network parameter learning and taxpayer industry classification

And learning network parameters of the training device based on the label noise data, and after the training is finished, adding the estimated conditional transition matrix as a linear conversion layer to the training device to finish the conversion from the noise label prediction to the real label prediction, thereby realizing the tax payer industry classification.

The further improvement of the invention is that in the step 1), the taxpayer industry information processing specifically comprises the following steps:

step 1: taxpayer industry text information preprocessing

Extracting text information of taxpayer industry, deleting special symbols, numbers and meaningless symbols of quantifier words in the text information, and finishing preprocessing the text information of taxpayers;

step 2: text word embedding based on XLNET pre-training network

The method comprises the steps that a text is coded based on an XLNT pre-training network to generate word vectors, an XLNT pre-training model is designed based on a transform, and meanwhile, the relation between two-way contexts is captured, so that the problem that a pre-training stage is inconsistent with a fine-tuning stage caused by a mask mechanism of a bert model is solved, and a double-current self-attention mechanism is used, so that the pre-training effect is more obvious; the XLNET model applied to Chinese uses a 24-layer network structure, and adopts sentencepiec to perform word segmentation; encoding the text characteristics obtained in Step1 by using XLNET of a Chinese version, thereby obtaining a slave word vector;

step 3: taxpayer industry text feature generation

Assuming that the taxpayer has k text features in total, a word element is mapped into a t-dimensional word vector by an XLNT pre-training network, and the ith text feature is recorded to have h _i For each word element, the ith text feature is mapped to an h _i A matrix of x t; splicing the feature matrixes of the text feature mappings, so that the text features of the samples are mapped into one

Generating a taxpayer text feature matrix;

step 4: tax payer industry value feature processing

Standardizing the numerical characteristics of the nontext characteristics of the taxpayers, assuming that n training samples and m numerical characteristics are shared, and recording the value of the jth numerical characteristic of the ith sample as X _ij The mean value of the jth numerical characteristic is mu _j To satisfy

The standard deviation of the jth numerical characteristic is sigma _j Satisfy the following requirements

The numerical characteristic after normalization is

Step 5: tax payer industry category feature processing

Coding the category characteristics in the nontextual characteristics of the taxpayer, and coding and representing the category characteristics by adopting an N-dimensional vector if the category characteristics have N possible values; specifically, the corresponding position of the category feature value is set to be 1, and the rest positions are set to be 0, namely, a one-hot coding method is adopted, after all the category features are coded, the longest coding length in the category features is selected for completion, and all vectors after completion are spliced to form a category feature matrix;

step 6: taxpayer industry non-text feature generation

M normalized numerical features and a shape of v × N were obtained after Step4 and Step5, respectively _max Class feature matrix of (1), where N _max Expressing the longest class code length, then establishing two linear network layers for feature mapping, wherein the first linear network layer is 1 × t in shape and is used for converting the normalized numerical features into a m × t numerical feature matrix, and the second linear network layer is N in shape _max The xt is used for mapping the category characteristics into a upsilon xt category characteristic matrix, and splicing the two mapped characteristic matrices to obtain a final non-text characteristic matrix with the shape of (v + m) xt;

step 7: taxpayer characteristic information generation

Splicing the text feature matrix generated at Step3 and the non-text feature matrix generated at Step6 to generate a shape of

As final characteristic information.

The further improvement of the invention is that in the step 2), the taxpayer industry classification network construction and training device is initialized: establishing a TextCNN network for text classification, wherein the TextCNN network comprises three layers: (1) a convolutional layer, (2) a maximum pooling layer and (3) a full connection layer, wherein an XLNET pre-training network in the step 1) is connected with a TextCNN network in series to construct a training device, and end-to-end training is carried out by taking taxpayer label noise data as supervision; the specific implementation details are as follows:

step 1: taxpayer industry classification network construction

Constructing a TextCNN network for taxpayer industry classification, wherein the TextCNN network comprises three layers, namely a convolution layer, a pooling layer and a full-connection layer; specifically, the convolution layer of the TextCNN uses a convolution kernel with the shape of n × t to perform convolution operation for extracting line features, wherein the values of n are {2, 3,4, 5, 6}, the TextCNN adopts a maximum pooling layer as a pooling layer for maximum value extraction of a feature map after convolution, further compression is performed to extract features, then a full connection layer is established, assuming that the total number of categories to be classified of taxpayer industry classification is c, and if the number of features is s after passing through the maximum pooling layer, the full connection layer with the shape of s × c is established for mapping feature information into a c-dimensional vector, and then taxpayer industry classification is performed;

step 2: training device initialization

Connecting the XLNET text pre-training network in the step 1) with the constructed TextCNN network in series to form a training device; and (3) taking the label noise data of the taxpayer industry as input, predicting the noise label, forming an end-to-end device for training, and initializing the network parameters of the training device.

The invention further improves the method in Step2 in Step 2), wherein the network parameter is alpha, the sample is X, and the noise label is

The set of network parameters is w, and the memory sample X is output as

For

And

making cross entropy loss and adding a regularization term to prevent overfitting, where λ is the regularization term control coefficient, mostThe loss function is minimized and the optimization objective is as follows:

a further improvement of the invention is that, in step 3), the conditional transition matrix estimation: converting a conditional transfer matrix estimation problem in the label noise learning problem into a mixing proportion estimation problem, and solving a mixing proportion coefficient based on an improved mixing proportion estimation method to further obtain a conditional transfer matrix; the specific implementation details are as follows:

step 1: hybrid ratio estimation problem construction

Assume that the noise label in the taxpayer registration information is

The true label of the sample is Y, assuming sample X and noise label

Independently of each other, for any class C ∈ C there is:

note the book

P _i ＝P(X|Y＝i)、

Where Q represents the conditional transition probability of a noisy tag to a true tag, the above equation is expressed in matrix form as follows:

further decomposing the matrix to obtain the following form; where H is a c x c matrix and satisfies that the diagonal element is 0, and G is a real diagonal matrix shaped as c x c;

according to the nature of matrix transformation, it can be seen that matrix H, matrix G, and matrix Q satisfy the following relationships:

(i-H) ^-1 G＝Q ^T

here Q is ^T The matrix is a conditional transition matrix in label noise learning, and the above relationship indicates that if the matrix H is solved, the conditional transition matrix is further solved, and the decomposition of the matrix is equivalent to the following c equations:

the equation is further expressed in the form:

wherein the following are satisfied:

the standard mixing ratio estimation problem is expressed in the form: f ═ kH + (1-k) G (k ≧ 0), where fh G is the probability distribution function, and samples sampled at distribution F, H are assumed to be known, where F is the mixture and H, G is the composition; the equation obtained by the above matrix decomposition:

it is the standard mixing ratio estimation problem, the mixing ratio coefficient H estimated by which is the mixing ratio estimation problem _ij It is the elements of matrix H; therefore, by solving a series of mixed proportion estimation problems, the H matrix can be solved, and the conditional transition matrix Q is estimated according to the matrix relation ^T Therefore, a classifier with consistent risk is constructed based on the tag noise data, and taxpayer industry classification is carried out;

step 2: regeneration of compositions

The solution of the mixed scaling problem relies on the labeling of the anchor point, specifically, the maximum estimator of the mixed scaling coefficients if the anchor point samples are present and known

Is an unbiased estimate of the true mixture scaling factor k;

specifically, first, the mixture F sample is labeled as a positive sample class Y ═ 1, the labeled composition component H sample is labeled as a negative sample class Y ═ 1, an MLP network is constructed to perform binary prediction, and the output of the network is assumed to be F _η (X), wherein X is sample characteristics, eta is network parameters, the MLP network is supervised trained by using noisy positive and negative samples, after training, the posterior probability prediction is carried out on the samples of the positive sample class by using the network, a threshold value tau is selected, and the sample set of the positive sample class is recorded as

Set of negative sample class samples as

Inputting samples of the positive sample class into the network for prediction, wherein the sample set with a prediction value smaller than a selected threshold value is recorded as

Then there is

Bringing the samples with the posterior probability ratio smaller than the threshold value into a negative sample set to respectively obtainPositive and negative sample sets after reconstruction:

and

satisfy the requirement of

And

therefore, the regeneration of the composition sample is completed, and the problem of dependence of the traditional mixed proportion estimation method on the anchor point is solved;

step 3: probability density estimation based on kernel density estimation

On the basis of reconstructing the composition at Step2, estimating a probability density function of sample distribution based on a kernel density estimation method; specifically, a kernel function is established for representing the probability density estimation of the existing sample to any point in the feature space, wherein x is a point in the feature space, and x is _i Is a known sample; and μ is the sample mean, and Σ is ρ ² Q is the covariance matrix of the sample, then sample x is the case using a Gaussian kernel _i The contribution to the probability density at x represents the form of the kernel function as follows:

then over the entire sample set, the probability density function estimator is:

wherein

For sets of samples, from the positive and negative sample sets already obtained

The probability density function of the reconstructed positive and negative samples is estimated as follows:

step 4: conditional branch matrix estimation

Sequentially solving c mixing proportion estimation problems constructed in Step1, solving corresponding c-1 mixing proportion coefficients for any mixing proportion estimation problem, and setting the noise label of the mixture as

Noise signatures of compositions are

Collecting original samples

Set of positive and negative samples, respectively, as among the mixture ratio estimation problem

Method based on Step2 generates new positive and negative sample sets

And

and carrying out probability density estimation according to the kernel density estimation method of Step3 to respectively obtain

And

then, the maximum estimation quantity of the mixing proportion coefficient is estimated by adopting the method of maximum estimation of the mixing proportion problem in Step1

Where G is a legal probability density function, estimator

I.e. element H _ij (i ≠ j) through circulation and repetition of the steps 2,3 and 4, all elements of the H matrix can be solved, and then the G matrix and the conditional transition matrix Q can be solved according to the following properties ^T ；

(I-H) ^-1 G＝Q ^T 。

The further improvement of the invention is that in the step 4), the training device network parameter learning and taxpayer industry classification comprises the following specific steps:

step 1: training device learning based on tag noise data

Assume that the network parameter in the training apparatus is η and the noise sample is

The network parameter set is combined as w, the network parameter in the training device is learned by using the label noise data as supervision, and the output of the memory sample X under the mapping of the training device is g _η (X) for g _η (X) and

making cross entropy loss, and adding a regularization term to prevent overfitting, wherein lambda is a regularization term control coefficient, minimizing a loss function, and an optimization objective is as follows:

under control of the optimization objective, a network of training devices is used to predict noise signatures of input samples

Output result g _η (X) performing operation through a softmax layer, wherein the sotfmax operation performs exponential normalization processing on the original output, and the original output is expressed as a predicted value of posterior probability; specifically, assume that the original network output is

Performing exponential operation on the output vector and performing normalization processing by softmax, wherein the output is in the following form;

step 2: constructing a conditional transition matrix layer

After finishing learning the network parameters of the training device, outputting g of the network _η (X) outputting the posterior probability of the sample through softmax operation

The method is used for predicting the noise label, a conditional branch layer is added behind a softmax layer to serve as a branch layer, and the conversion from noise label prediction to real label prediction is realized;

step 3: taxpayer industry classification

On the basis of constructing the conditional transition layer, for a newly input sample X, the output of the TextCNN network is q (X), and the output is calculated

And obtaining a subscript r corresponding to the maximum component q (X), namely the trade classification corresponding to the taxpayer.

The further improvement of the invention is that in Step2 of Step 4), the specific method is as follows: let the noise label be

The true sample label is Y, the total number of classes is C, and sample characteristics X and noise labels are assumed

Independent of each other, for any class

Comprises the following steps:

the original network output g is then _η (X) passing through conditional transition matrix Q ^T The conversion can convert the original output into a new output Q (x) satisfying Q (x) ═ Q ^T g (X), where the new output q (X) is the posterior probability of the genuine tag

Wherein q is ⁱ (X) (i ═ 1, 2., C) is the i-th component of q (X), representing that X is the probability prediction value P of the i-th class of true tags (Y ═ i | X).

The invention has at least the following beneficial technical effects:

the invention provides a label noise learning method facing taxpayer industry classification, compared with the prior art, the invention has the advantages that:

(1) the invention creatively converts the conditional transition matrix estimation problem in label noise learning into a mixed proportion estimation problem, and constructs classifiers with consistent risks based on label noise data by solving the mixed proportion estimation problem. The method is different from the method that the prior art scheme depends on semantic clustering, and the method does not depend on an additional clustering method, so that a new error caused by the performance limitation of the clustering method is avoided.

(2) The invention expands the traditional mixed proportion estimation method from dichotomy to multi-classification scenes, is different from the situation that the traditional method is limited to two classifications, and the improved mixed proportion estimation method can be applied to the multi-classification situation and has wider application scenes.

(3) The invention overcomes the problem of dependence of the traditional mixed proportion estimation method on the anchor point, is different from the requirement of the traditional method on the anchor point marking, constructs a completely new mixed proportion estimation problem based on the method for regenerating the composition, and realizes the direct estimation of the mixed proportion coefficient under the condition of not depending on the anchor point marking.

Drawings

FIG. 1 is an overall framework flow diagram.

Fig. 2 is a flow chart of taxpayer business information processing.

FIG. 3 is a flowchart of the taxpayer industry classification network construction and training device initialization.

Fig. 4 is a flow chart of conditional branch matrix estimation.

FIG. 5 is a flow chart of network parameter learning and taxpayer industry classification for the training device.

FIG. 6 is a schematic diagram of a tag noise learning network.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

As shown in fig. 1, in an embodiment of the present invention, the label noise learning method for taxpayer industry classification according to the present invention includes the following steps:

step 1. taxpayer industry information processing

As shown in fig. 2, the method for extracting the text information and the non-text information of the taxpayer and performing information processing includes the following steps:

s101, taxpayer industry text information preprocessing

Illegal characters such as special symbols, numbers, quantifiers, etc. in the taxpayer text information are deleted (fig. 2S 101). In an embodiment, a total of 3 text features are extracted as the text information features of the taxpayer, including: { taxpayer name, registered address, and operation scope }, wherein the taxpayer name is ' xi ' an city xin Yao ceramic SI limited science and technology company ', a special symbol SI (fig. 2S101) is deleted firstly, and sequence segmentation is performed according to words to obtain { xi, an, city, xin, Yao, pottery, porcelain, limited, department, skill, public, department }.

S102, embedding text words based on XLNET pre-training network

Word embedding is performed on the text based on a text pre-training network XLNET (FIG. 2S102), forming a word vector. In this embodiment, assuming the encoding length is t, the XLNet text pre-training network embeds the original lemmas into word vectors with length t. If the original text sequence length is 13, the XLNet pre-training network may map the text to a 13 × t text feature, specifically, in the embodiment, if t is 528, a 13 × 528 text feature may be obtained (fig. 2S 102).

S103, taxpayer industry text feature generation

Based on an XLNET text pre-training network, repeating the process of S102, performing word embedding on all text characteristic sequences, and further splicing the embedded word vectors to form taxpayer text characteristics (FIG. 2S 103).

In particular, in the embodiment, it is assumed that there are 3 items in total for the taxpayer industry text feature, including: { taxpayer name, registration address, business range }, and 3 text features are respectively mapped to 13 × 528, 7 × 528, and 10 × 528 text features, and the text features are spliced to obtain an overall taxpayer text feature (as shown in fig. 2S103) with a shape of 30 × 528.

S104, tax payer industry value feature processing

And extracting numerical characteristics of the taxpayer industry, including 4 numerical characteristics in total, namely { registered fund, total investment, total asset amount and interest liability } and carrying out standardized operation.

Specifically, in the present embodiment, the sample mean μ of the 4 columns of features is first calculated ₁ ，μ ₂ ，...，μ ₄ And the sample variance σ ₁ ，σ ₂ ，...，σ ₄ Let X be written _i The value of the ith numerical characteristic of the sample X is obtained through a z-score formula

The numerical feature is normalized (fig. 2S 104).

S105, taxpayer industry category feature processing

The category information is encoded based on a one-hot encoding technique. In this embodiment, the selecting 2 category features for encoding specifically includes: { unit property, accounting method }, wherein one item of the unit property includes five cases of enterprises, civil non-enterprise units, public institutions, social groups, and the like. The corresponding one-hot codes are {10000, 01000, 00100, 00010, 00001}, respectively, and the one-hot codes are performed on all the category feature information (fig. 2S 105).

S106, non-text characteristic generation of taxpayer industry

And constructing a linear network mapping layer, mapping the obtained numerical characteristics and the category characteristics into the dimension which is the same as the dimension of the text characteristics, and then splicing the numerical characteristics and the category characteristics to form the nontext characteristics of the taxpayer industry.

Specifically, in an embodiment, linear network mapping layers with shapes of 1 × 528 and 5 × 528 are established, respectively. The method is used for mapping the numerical features and the category features to the same dimension of the text features, and then splicing is carried out to form a non-text feature matrix (fig. 2S 106).

S107, taxpayer characteristic information generation

And splicing the taxpayer text characteristics obtained in the step S103 and the taxpayer non-text characteristics obtained in the step S106 to finally form taxpayer industry characteristic information.

In an embodiment, the text feature with the shape of 30 × 528 and the non-text feature with the shape of 6 × 258 are spliced to form the final taxpayer industry feature information, which has the shape of 36 × 528 (fig. 2S 107).

Step2, taxpayer industry classification network construction and training device initialization

As shown in fig. 3, the TextCNN network is established for taxpayer industry classification, and the shape of the TextCNN convolution kernel and the input and output dimensions are sequentially determined according to the generated taxpayer industry characteristics and the target total number to be classified. And connecting the XLNET text pre-training network and the TextCNN network in series to form a training device, and performing end-to-end training on the training device based on the label noise data for initializing the network parameters of the training device.

S201. taxpayer industry classification network construction

Constructing a TextCNN network for taxpayer industry classification, wherein the TextCNN network comprises three layers: a convolutional layer, a pooling layer, and a fully-connected layer.

Specifically, in the embodiment, according to the situation of the taxpayer text feature, a convolution kernel is established, the line feature of the feature map is extracted, in the embodiment, a convolution kernel with the shape of n × 528 is used, where n is {2, 3,4, 5, 6}, a maximum pooling layer is established, further feature compression and extraction are performed on the features after convolution, a full connection layer is finally established, and it is assumed that the number of the total features of the feature map output after the pooling layer is n ₁ If the total number of categories is c, the shape is n ₁ And x c, in this embodiment, c is 97 (fig. 3S 201).

S204. training device initialization

Connecting the XLNET text pre-training network in the step 1) and the constructed TextCNN network in series to form a training device. And performing end-to-end training based on the label noise data, and initializing network parameters of the training device.

In an embodiment, taxpayer industry label noise data is used as input, noise label is predicted, and end-to-end training is performed for initializing network parameters (as in fig. 3S 202). Assume a network parameter of α and a noise sample of

The set of network parameters is w, and the memory sample X is output as

For the

And

step3, solving conditional transition matrix

As shown in fig. 4, firstly, a mixture ratio estimation problem is constructed, so that the original conditional transition matrix estimation problem is converted into a mixture ratio estimation problem, secondly, a brand-new mixture ratio estimation problem is constructed based on a composition regeneration method, probability density is estimated according to a kernel density estimation method, and then a mixture ratio coefficient is solved, and a conditional transition matrix is estimated. The specific steps are as follows:

s301, construction of mixed proportion estimation problem

In the present embodiment, it is assumed that the noisy label in the taxpayer registration information is

The sample is X, the true label of the sample is Y, if the sample is X and the label with noise is Y

Independently of each other, the following relationships are given:

meanwhile, the above relationship may be converted into the following form:

it follows that the above c equations are equivalent to the c standard mixing ratio problem. In the embodiment, the total number of the objects to be classified is c, which is 97, and if the matrix H and the matrix G can be obtained, the original equation can be obtained

Further, the overall conditional branch matrix is obtained, and therefore the original conditional branch matrix estimation problem is converted into the mixture ratio estimation problem (fig. 4S 301).

S302, regeneration of composition

In an embodiment, the hypothesis corresponds to a noise label class

Respectively is

And

using the { i, j } classes as positive and negative sample sets respectively

And

designing a two-class network for prediction, assuming the output of the network is f _η (X), wherein X is the sample feature after dimensionality reduction of the input. η is a parameter of the network. Using positive and negative samples to perform sensor networkSupervised training, after training of the network is completed. And carrying out posterior probability prediction on the samples of the positive sample class by using the network. Selecting a threshold value tau, and correcting a sample class sample set into

Set of negative examples class examples as

The sample set of positive sample class whose output is less than the selected threshold value through network prediction is recorded as

Then there is

With a posterior probability ratio less than a threshold

Copying the sample set to a negative sample set to obtain a reconstructed positive sample set and a reconstructed negative sample set:

and

and satisfy

And

thereby completing the regeneration of the sample (as in fig. 4S 302).

S303. probability density function estimation

For the new sample obtained in S302

And

probability density function estimation is performed on the sample set, and the estimated functions (as shown in fig. 4S303) obtained by using the kernel density estimation method are respectively as follows:

s304. solving conditional transition matrix

A dual cycle structure is established, and the outer layer and the inner layer are sequentially traversed through cycle

And

and if i ≠ j, the processes of S302 and S303 are executed circularly, and the mixing ratio coefficient is further calculated

The mixing ratio coefficient is H _ij Then obtaining a G matrix according to the following relation;

based on the obtained H matrix and G matrix, the following relationships may be followed: (I-H) ^-1 G＝Q ^T Obtaining a conditional transition matrix Q ^T (as in fig. 4S 304).

Step4, training device network parameter learning and taxpayer industry classification

As shown in fig. 5, training the training device based on the tag noise data for learning the network parameters of the training device, and adding a conditional transition layer after the training device to complete taxpayer industry classification, which includes the following steps:

s401, learning of a training device based on tag noise data

In this embodiment, it is assumed that the input to the training apparatus is a noisy data sample

Wherein X is 36 × 528 input feature vector, and is mapped to g through network _η The 97-dimensional output vector of (X). For noise tag

And network output g _η (X) making cross entropy loss, training network parameters according to the loss function, and marking the trained network parameters as eta (FIG. 5S401)

S402, constructing a conditional transfer matrix layer

The conditional branch matrix layer is added after the training device and prediction is performed for the new sample.

Specifically, in the present embodiment, the calculated 97 × 97 conditional transition matrix Q is used ^T As a conditional transfer layer. Outputs g of the original _η (X) is converted to Q (X), i.e. Q (X) ═ Q ^T g _η (x) Here, q (X) denotes the prediction of the true label for sample X. Wherein q is ⁱ (X) is the ith component of q (X), representing the probability that sample X belongs to class i (fig. 5S 402).

S403. taxpayer industry classification

As shown in fig. 6, text information and non-text feature information of the taxpayer are respectively extracted, taxpayer industry features are extracted through a feature extraction module, a condition transfer matrix is estimated based on the extracted features to serve as a final condition transfer layer of the training device, and taxpayer industry classification is performed based on the training device. Specifically, in the embodiment, assuming that the taxpayer characteristic information is X, the output of the training device is q (X), where q (X) is the true label prediction of the sample X, and q (X) is recorded ⁱ (X) (i ═ 1, 2.., 97) for the ith component of q (X), choosing the subscript corresponding to the largest component

As classifications for taxpayer industry(fig. 5S 403).

It will be understood by those skilled in the art that the foregoing is only exemplary of the method of the present invention and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A label noise learning method for taxpayer industry classification is characterized by comprising the following steps:

2. The label noise learning method for taxpayer industry classification as claimed in claim 1, wherein the method specifically comprises the following steps:

1) taxpayer industry information processing

3) conditional branch matrix estimation

3. The label noise learning method for taxpayer industry classification as claimed in claim 2, wherein in the step 1), the taxpayer industry information processing specifically comprises the following steps:

step 1: taxpayer industry text information preprocessing

step 2: text word embedding based on XLNET pre-training network

step 3: taxpayer industry text feature generation

Generating a taxpayer text feature matrix;

step 4: tax payer industry value feature processing

Standardizing the numerical characteristics of the nontext characteristics of the taxpayers, assuming that n training samples and m numerical characteristics are shared, and recording the value of the jth numerical characteristic of the ith sample as X _ij The mean value of the jth numerical characteristic is mu _j Satisfy the following requirements

The numerical characteristic after normalization is

Step 5: tax payer industry category feature processing

step 6: taxpayer industry non-text feature generation

M normalized numerical features and a shape of v × N were obtained after Step4 and Step5, respectively _max Class feature matrix of (1), where N _max Expressing the longest class code length, then establishing two linear network layers for feature mapping, wherein the shape of the first linear network layer is 1 × t and is used for converting the numerical features after standardization into a numerical feature matrix of m × t, and the shape of the second linear network layer is N _max The xt is used for mapping the category characteristics into a v xt category characteristic matrix, and splicing the two mapped characteristic matrices to obtain a final (v + m) xt non-text characteristic matrix;

step 7: taxpayer characteristic information generation

As final characteristic information.

4. The label noise learning method for taxpayer industry classification as claimed in claim 3, wherein in the step 2), the taxpayer industry classification network construction and training device is initialized: establishing a TextCNN network for text classification, wherein the TextCNN network comprises three layers: (1) a convolutional layer, (2) a maximum pooling layer and (3) a full connection layer, wherein an XLNET pre-training network in the step 1) is connected with a TextCNN network in series to construct a training device, and end-to-end training is carried out by taking taxpayer label noise data as supervision; specific implementation details are as follows:

step 1: taxpayer industry classification network construction

step 2: training device initialization

5. The method as claimed in claim 4, wherein in Step2 of Step 2), let the net parameter be α, the sample be X, and the noise label be

The set of network parameters is w, and the memory sample X is output as

For

And

6. the label noise learning method for taxpayer industry classification as claimed in claim 5, wherein in step 3), the conditional transition matrix estimation: converting a conditional transfer matrix estimation problem in the label noise learning problem into a mixing proportion estimation problem, and solving a mixing proportion coefficient based on an improved mixing proportion estimation method to further obtain a conditional transfer matrix; the specific implementation details are as follows:

step 1: hybrid ratio estimation problem construction

Assume that the noise label in the taxpayer registration information is

The true label of the sample is Y, assuming sample X and noise label

Independently of each other, for any class C ∈ C there is:

note the book

(I-H) ^-1 G＝Q ^T

the equation is further expressed in the form:

wherein the following are satisfied:

the standard mixing ratio estimation problem is expressed in the form: f ═ κ H + (1-k) G (k ≧ 0), where FHG is the probability distribution function, and samples sampled at distribution F, H are assumed to be known, where F is the mixture and H, G is the composition; the equation obtained by the above matrix decomposition:

step 2: regeneration of compositions

The solution of the mixed scaling problem relies on the labeling of the anchor point, specifically, the maximum estimator of the mixed scaling factor if the anchor point sample is present and known

Is an unbiased estimate of the true mixture scaling factor k;

Set of negative sample class samples as

Then there is

Bringing the samples with the posterior probability ratio smaller than the threshold value into a negative sample set, and respectively obtaining a positive sample set and a negative sample set after reconstruction:

and

satisfy the requirements of

And

step 3: probability density estimation based on kernel density estimation

On the basis of reconstructing the composition at Step2, estimating a probability density function of sample distribution based on a kernel density estimation method; specifically, a kernel function is established for representing the probability density estimation of the existing sample to any point in the feature space, wherein x is a point in the feature space, and x is _i Is a known sample; and μ is the sample mean, and Σ is ρ ² Q is the covariance matrix of the sample, then sample x using a Gaussian kernel function _i For x place probability densityThe equation for the kernel is as follows:

then over the entire sample set, the probability density function estimator is:

wherein

step 4: conditional branch matrix estimation

Noise signatures of compositions are

Collecting the original samples

Respectively as sets of positive and negative samples in a mixture ratio estimation problem

Method based on Step2 generates new positive and negative sample sets

And

And

Where G is a legal probability density function, estimator

I.e. the element H _ij And (i ≠ j) estimating values, repeating the processes Step2,3 and 4 to solve all elements of the H matrix through circulation, then solving the G matrix according to the following properties, and further solving the conditional transition matrix Q ^T ；

(I-H) ^-1 G＝Q ^T 。

7. The label noise learning method for taxpayer industry classification as claimed in claim 6, wherein in the step 4), the training device network parameter learning and taxpayer industry classification comprises the following specific steps:

step 1: training device learning based on tag noise data

under the control of the optimization objective, a network of training devices is used to predict the noise signature of the input samples

Output result g _η (X) performing operation through a softmax layer, wherein the sotfmax operation performs exponential normalization processing on the original output, and the original output is expressed as a predicted value of posterior probability; in particular, assume that the original network output is

step 2: constructing a conditional transition matrix layer

step 3: taxpayer industry classification

8. The label noise learning method for taxpayer industry classification as claimed in claim 7, wherein in Step2 of Step 4), the specific method is as follows: let the noise label be

Independently of each other, for any class C ∈ C there is:

the original network output g is then _η (X) passing through conditional transition matrixQ ^T The conversion can convert the original output into a new output Q (X) which satisfies Q (X) Q ^T g (X), where the new output q (X) is the posterior probability of the genuine tag