CN114817546B

CN114817546B - Tax payer industry classification-oriented label noise learning method

Info

Publication number: CN114817546B
Application number: CN202210498954.4A
Authority: CN
Inventors: 郑庆华; 曹书植; 阮建飞; 赵锐; 董博; 师斌
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2024-09-10
Anticipated expiration: 2042-05-09
Also published as: CN114817546A

Abstract

The invention discloses a label noise learning method for taxpayer industry classification, which comprises the following steps: firstly, extracting text information and non-text information in tax-paying industry information, and respectively performing text embedding and non-text coding processing based on XLNet text pre-training network and coding technology to obtain characteristic information; secondly, a TextCNN network for taxpayer industry classification is constructed, the number of layers, convolution kernel shape and input and output dimensions of each layer of the network are determined according to the characteristic information and the target classification number, a XLNet text pre-training network and a TextCNN network are connected in series, noisy taxpayer industry label data are combined as supervision, and an end-to-end training device is constructed; thirdly, estimating a conditional transfer matrix based on the improved mixing proportion estimation method; and finally, learning network parameters in the training device, taking the conditional transfer matrix as a linear layer behind TextCNN networks, realizing conversion from noise label prediction to real tax-paying industry label prediction, and carrying out tax-paying industry classification.

Description

Tax payer industry classification-oriented label noise learning method

Technical Field

The invention belongs to the technical field of text classification with label noise, and particularly relates to a label noise learning method for taxpayer industry classification.

Background

In recent years, market economy continues to flourish, the number of enterprises is increasing, and the division of enterprises is continuously refined. Along with this, upgrades and further construction of tax systems have become urgent.

Tax payer industry classification is a precondition for determining tax main policy and preference, and is an important link for tax collection. Currently, china divides the tax payer industry into 20 categories and 97 major categories. Because of the vast majority of the categories, the traditional manual classification method consumes a great deal of human resources, is limited by the expertise and experience of the classifier, inevitably introduces classification errors, namely label noise of tax payer industry classification, and causes a series of adverse effects on statistics, tax and business management of the country.

In recent years, with the acceleration of the 'intelligent+' age, the artificial intelligent industry rapidly develops and is applied to various fields, and the development of intelligent tax exploration and development is possible. The research enterprise taxpayer industry classification is the basic work of tax source classification management, and is a key premise of intelligent tax informatization. Therefore, how to train a classifier based on the existing label noise data by means of machine learning and correctly classify the tax payer industry has become a problem to be solved urgently.

Related technical schemes of tax payer industry classification problems, the related invention patents are as follows:

Document 1: tax payer industry two-level classification method (201910024324.1) based on MIMO recurrent neural network

Document 2: tax payer industry classification method (202110201214.5) based on noise label learning

The literature 1 designs a GRU-based multi-input multi-output neural network structure, establishes a mapping relation from industry major classes to industry details, and constructs a two-layer classification structure for realizing industry classification of tax payers. However, this method relies on strict labeling of the data, lacking practical value in the presence of tag noise.

Document 2 has designed a BERT-CNN network for text classification, a semantic clustering-based method, and constructed a classifier with consistent classification by using label noise data, however, the performance limitation of the semantic clustering method introduces new errors into the classifier.

Aiming at the defects of the technical scheme, the invention aims to overcome the classification deviation caused by the adoption of a semantic clustering method without depending on additional manual labeling, and construct a classifier only based on label noise data, so that the classifier constructed based on the label noise data has consistent classification risk with the classifier constructed by adopting real labeling data in a statistical sense.

The core of constructing a risk-consistent classifier based on tag noise data is: a statistically consistent classifier is constructed by estimating a conditional transition matrix (a matrix of conditional probabilities of real labels given noisy labels). The invention creatively converts the problem of estimating the conditional transfer matrix into the problem of estimating the mixing proportion, and obtains the approximate conditional transfer matrix by estimating the mixing proportion coefficient. However, the traditional mixed proportion estimation method is only suitable for a two-class scene and depends on an anchor point (a sample explicitly belonging to a certain class), while the taxpayer industry classification problem has a plurality of industry classes, belongs to a multi-class problem, and is difficult to mark and acquire. Thus, it is a major solution challenge of the present invention to extend the mixing ratio estimation problem from binary analogy to multiple classifications and overcome the anchor point dependence problem.

Disclosure of Invention

The invention aims to provide a label noise learning method for taxpayer industry classification, which constructs a risk consistency classifier based on a label noise data estimation condition transfer matrix (a matrix formed by the conditional probabilities of real labels under the condition of given noise labels).

The invention is realized by adopting the following technical scheme:

a label noise learning method for taxpayer industry classification comprises the following steps:

Firstly, extracting text information and non-text information in tax-paying industry information, and respectively performing text embedding and non-text coding processing based on XLNet text pre-training network and coding technology to obtain characteristic information; secondly, a TextCNN network for taxpayer industry classification is constructed, the number of layers, convolution kernel shape and input and output dimensions of each layer of the network are determined according to the characteristic information and the target classification number, a XLNet text pre-training network and a TextCNN network are connected in series, noisy taxpayer industry label data are combined as supervision, and an end-to-end training device is constructed; thirdly, estimating a conditional transfer matrix based on an improved mixing proportion estimation method; and finally, learning network parameters in the training device, taking the conditional transfer matrix as a linear layer behind TextCNN networks, realizing conversion from noise label prediction to real tax-paying industry label prediction, and carrying out tax-paying industry classification.

The invention is further improved in that the method specifically comprises the following steps:

1) Tax payer industry information processing

The tax payer information processing comprises text information processing and non-text information processing, firstly, word segmentation and word embedding are carried out on tax payer text information based on XLNet text pre-training network to form corresponding word vectors, then text characteristics are generated by splicing, secondly, numerical characteristics and category characteristics in the tax payer non-text information are respectively preprocessed by using standardized processing and independent heat coding technology, then a linear network layer is established to carry out characteristic mapping to generate non-text characteristics consistent with text characteristic dimensions, and finally, the text characteristics and the non-text characteristics are spliced to form characteristic information;

2) Tax payer industry classification network construction and training device initialization

Constructing TextCNN a network for tax payer industry classification, wherein the network comprises three layers of a convolution layer, a pooling layer and a full connection layer, sequentially determining the number of layers of the TextCNN network, the shape of a convolution kernel and the input and output dimensions of each layer based on the characteristic information and the target classification number obtained in the step 1), connecting a XLNet pre-training network with a TextCNN network in series, combining noisy tax payer industry information labels as supervision, and constructing an end-to-end training device;

3) Conditional transition matrix estimation

Based on a nuclear density estimation method, estimating a probability density function according to noisy taxpayer industry information data, converting a conditional transfer matrix estimation problem into a mixed proportion estimation problem, solving a corresponding mixed proportion coefficient based on an improved mixed proportion estimation method, and further obtaining a conditional transfer matrix;

4) Training device network parameter learning and tax payer industry classification

Based on the label noise data, the network parameters of the training device are learned, after training is completed, the estimated conditional transfer matrix is used as a linear conversion layer to be added after the training device, and conversion from noise label prediction to real label prediction is completed, so that tax payer industry classification is realized.

The invention further improves that in the step 1), tax payer industry information processing specifically comprises the following steps:

Step1: taxpayer industry text information preprocessing

Extracting text information of the tax payer industry, deleting special symbols, numbers and meaningless symbols of the graduated words in the text information, and finishing preprocessing of the tax payer text information;

Step2: text word embedding based on XLNet pre-training network

Encoding a text based on XLNet pre-training network to generate word vectors, wherein a XLNet pre-training model is based on a transducer design, and simultaneously captures the relationship between two contexts, so that the problem that a pre-training stage and a fine tuning stage are inconsistent due to a mask mechanism of a bert model is improved, and a double-flow self-attention mechanism is used, so that the pre-training effect is more remarkable; the XLNet model applied to Chinese uses a 24-layer network structure and adopts SENTENCEPIEC for word segmentation; encoding the text features obtained in Step1 with XLNet of the chinese version, thereby obtaining a slave word vector;

step3: tax payer industry text feature generation

Assuming that the taxpayer has k text features in total, mapping a word element into a word vector of t dimension by XLNet pre-training network, and recording that the ith text feature has h _i word elements, mapping the ith text feature into a matrix of h _i ×t; splicing feature matrixes mapped by each text feature, so that the text feature of the sample is mapped into oneGenerating a taxpayer text feature matrix;

Step4: tax administration industry numerical value characteristic processing

For the standardized operation of the numerical characteristics of non-text characteristics of taxpayers, n training samples and m numerical characteristics are assumed, the value of the j-th numerical characteristic of the i-th sample is recorded as X _ij, the average value of the j-th numerical characteristic is mu _j, and the method meets the requirements ofThe standard deviation of the jth numerical value characteristic is sigma _j, which satisfiesThe numerical characteristics after normalization are

Step5: tax administration industry class feature processing

Coding the category characteristics in the non-text characteristics of the taxpayer, and assuming that the category characteristics have N possible values, coding the category characteristics by adopting an N-dimensional vector; specifically, setting the corresponding position of the class feature value as 1, setting the rest positions as 0, namely adopting a one-hot coding method, selecting the longest coding length in the class features to complement after coding is completed on all the class features, and splicing the vectors after the complement to form a class feature matrix;

Step6: non-text feature generation for taxpayer industry

Respectively obtaining m standardized numerical characteristics and a class characteristic matrix with a shape of v multiplied by N _max after Step4 and Step5, wherein N _max represents the longest class coding length, then establishing two linear network layers for characteristic mapping, wherein the first linear network layer has a network shape of 1 multiplied by t and is used for converting the standardized numerical characteristics into the m multiplied numerical characteristic matrix, the second linear network layer has a network shape of N _max multiplied by t and is used for mapping the class characteristics into a v multiplied by t class characteristic matrix, and splicing the two mapped characteristic matrices to obtain a final non-text characteristic matrix with a shape of (v+m) multiplied by t;

Step7: taxpayer characteristic information generation

Splicing the text feature matrix generated by Step3 and the non-text feature matrix generated by Step6 to generate a shape ofAs final characteristic information.

The invention is further improved in that in the step 2), the tax payer industry classification network construction and training device is initialized: a TextCNN network was built for text classification, and a TextCNN network comprised three layers, respectively: the method comprises the steps of (1) a convolution layer, (2) a maximum pooling layer and (3) a full connection layer, connecting XLNet pre-training networks in the step 1) with TextCNN networks in series, constructing a training device, and performing end-to-end training by taking tax payer tag noise data as supervision; specific implementation details are as follows:

step1: taxpayer industry classification network construction

The TextCNN network is constructed for tax administration industry classification, and the TextCNN network comprises three layers, namely a convolution layer, a pooling layer and a full connection layer; specifically, the TextCNN convolution layer uses a convolution kernel with the shape of n×t to perform convolution operation for extracting row characteristics, n takes a maximum pooling layer as a pooling layer, textCNN is used for extracting the maximum value of a characteristic diagram after convolution, further compresses and extracts characteristics, then establishes a full-connection layer, and supposes that the total number of categories to be classified in the taxpayer industry classification is c, if the number of the characteristics is s after passing through the maximum pooling layer, establishes a full-connection layer with the shape of s×c for mapping characteristic information into a vector with c dimension, and further performs taxpayer industry classification;

Step2: training device initialization

Connecting XLNet text pre-training networks in the step 1) and the constructed TextCNN networks in series to form a training device; and taking tax-paying pedestrian label noise data as input, predicting the noise label, forming an end-to-end device for training, and initializing training device network parameters.

The invention is further improved in that in Step 2) of the Step 2), the network parameter is alpha, the sample is X, and the noise label isThe network parameter set is w, and the output of the sample X under the mapping of the training device is recorded asFor the followingAndCross entropy loss is made, and a regularization term is added to prevent overfitting, wherein lambda is the regularization term control coefficient, the loss function is minimized, and the optimization objective is as follows:

The invention is further improved in that in step 3), the conditional transfer matrix is estimated: converting a conditional transfer matrix estimation problem in a label noise learning problem into a mixed proportion estimation problem, and solving a mixed proportion coefficient based on an improved mixed proportion estimation method to obtain a conditional transfer matrix; specific implementation details are as follows:

Step1: mixing ratio estimation problem construction

Assume that the noise label in the taxpayer registration information isThe true label of the sample is Y, assuming sample X and noise labelIndependent of each other, then for any class c∈C there is:

Recording device P_i＝P(X|Y＝i)、Where Q represents the conditional migration probability of a noisy tag to a real tag, the equation above is expressed in the form of a matrix as follows:

Further decomposing the matrix to obtain the following form; where H is a c×c matrix and satisfies the diagonal element 0, and G is a real diagonal matrix of c×c shape;

According to the nature of the matrix transformation, it can be seen that the matrix H, the matrix G, and the matrix Q satisfy the following relationships, respectively:

(i-H)^-1G＝Q^T

The Q ^T matrix is the conditional transfer matrix in the label noise learning, and the above relation indicates that if the matrix H is solved, the conditional transfer matrix is further solved, and the decomposition of the matrix is equivalent to the following c equations:

the equation is further expressed as follows:

Wherein the following are satisfied:

The standard mixing ratio estimation problem is expressed in the form: f=kh+ (1-k) G (k+.gtoreq.0), where fhg is a probability distribution function and assuming that samples sampled at distribution F, H are known, where F is mixture and H, G is composition; equation obtained by the above matrix decomposition: It is the standard mixing ratio estimation problem that the estimated mixing ratio coefficient H _ij is the element of the matrix H; therefore, by solving a series of mixed proportion estimation problems, the H matrix can be solved, and then the matrix Q ^T is transferred according to the matrix relation estimation condition, so that a classifier with consistent risk is constructed based on label noise data, and tax-paying industry classification is carried out;

Step2: regeneration of the composition

Solving the problem of mixed proportion estimation, depending on the labeling of the anchor point, in particular, if the anchor point sample is present and known, the maximum estimated amount of mixed proportion coefficientIs an unbiased estimate of the true mixing proportionality coefficient k;

Specifically, firstly, marking a mixture F sample as positive sample class Y=1, marking a composition component H sample as negative sample class Y= -1, constructing an MLP network for two-class prediction, assuming that the output of the network is F _η (X), wherein X is a sample characteristic, eta is a parameter of the network, performing supervised training on the MLP network by using the noisy positive and negative samples, performing posterior probability prediction on the positive sample class sample by using the network after training, selecting a threshold tau, and marking the positive sample class sample set as The negative sample class sample set isPredicting the sample input network of the positive sample class, wherein the sample set with the predicted value smaller than the selected threshold value is recorded asThen there isTaking samples with posterior probability rate smaller than a threshold value into a negative sample set, and respectively obtaining positive and negative sample sets after reconstruction: And Satisfy the following requirementsAndThereby completing regeneration of the composition sample and solving the problem of dependence of the traditional mixing proportion estimation method on anchor points;

step3: probability density estimation based on kernel density estimation

Estimating a probability density function of sample distribution based on a kernel density estimation method on the basis of the Step2 reconstruction composition; specifically, a kernel function is established for representing probability density estimation of an existing sample for any point in the feature space, wherein x is taken as a point in the feature space, and x _i is a known sample; and μ is the sample mean, Σ=ρ ² Q is the covariance matrix of the sample, then in case a gaussian kernel is used, the contribution of sample x _i to the probability density at x is represented by the form of the kernel:

The probability density function estimator over the entire sample set is: Wherein the method comprises the steps of Is a set of samples, and is based on the positive and negative sample sets obtainedThe probability density function of the reconstructed positive and negative samples is estimated as follows:

Step4: conditional transition matrix estimation

Sequentially solving c mixing proportion estimation problems constructed in Step1, solving corresponding c-1 mixing proportion coefficients for any one mixing proportion estimation problem, and setting the noise label of the mixture asThe noise label of the composition isCollecting the original samplesRespectively as positive and negative sample sets in the mixing proportion estimation problemStep 2-based method for generating new positive and negative sample setsAndAnd estimating probability density according to the kernel density estimation method of Step3 to obtain respectivelyAndThen estimating the maximum estimation amount of the mixing proportion coefficient by adopting a method for estimating the maximum value of the mixing proportion problem in Step1Where G is a legal probability density function, an estimatorNamely, the estimated value of the element H _ij (i not equal to j), all the elements of the H matrix are solved through the cyclic and repeated processes Step2,3 and 4, and then the G matrix can be obtained according to the following properties, so as to obtain the condition transition matrix Q ^T;

(I-H)^-1G＝Q^T。

The invention is further improved in that in the step 4), the training device network parameter learning and tax payer industry classification are carried out, and the specific steps are as follows:

step1: training device learning based on tag noise data

Assuming that the network parameter in the training device is eta, the noise sample isThe network parameter set is w, the label noise data is used as supervision, the network parameters in the training device are learned, the sample X is recorded and output as g _η (X) under the mapping of the training device, and for g _η (X) andCross entropy loss is made, and a regularization term is added to prevent overfitting, wherein lambda is the regularization term control coefficient, the loss function is minimized, and the optimization objective is as follows:

Under control of the optimization objective, training device network is used for predicting noise label of input sample The output result g _η (X) is calculated through a softmax layer, sotfmax is calculated to carry out exponential normalization on the original output, and the original output is expressed as a predicted value of posterior probability; specifically, assume that the original network output isThe softmax performs exponential operation and normalization on the output vector, and outputs the output vector in the following form;

step2: construction of conditional transfer matrix layer

After the training device network parameter learning is completed, the output g _η (X) of the network is subjected to softmax operation to output the posterior probability of the sampleThe method comprises the steps of adding a conditional transfer layer as a transfer layer after a softmax layer for predicting a noise label, and realizing conversion from noise label prediction to real label prediction;

Step3: tax payer industry classification

Based on the construction of the conditional transfer layer, for the newly input sample X, textCNN the output of the network is q (X), by calculationAnd obtaining a subscript r corresponding to the maximum component of q (X), namely the industry classification corresponding to the taxpayer.

The invention is further improved in that in Step2 of the Step 4), the specific method is as follows: set noise labelThe true sample label is Y, the total class number is C, and the sample feature X and the noise label are assumedIndependent of each other, for any categoryThe method comprises the following steps:

The original network output g _η (X) is converted by the conditional transfer matrix Q ^T to convert the original output into a new output Q (X) which satisfies Q (X) =Q ^T g (X), wherein the new output Q (X) is the posterior probability of the real label Where q ⁱ (X) (i=1, 2,.., C) is the i-th component of q (X), representing the probability predictor P (y=i|x) that X is the i-th class of real tags.

The invention has at least the following beneficial technical effects:

the invention provides a label noise learning method for taxpayer industry classification, which is oriented to the taxpayer industry classification, and has the advantages that compared with the prior art, the invention has the following advantages:

(1) The invention creatively converts the condition transition matrix estimation problem in label noise learning into the mixed proportion estimation problem, and constructs the classifier with consistent risk based on label noise data by solving the mixed proportion estimation problem. Unlike the prior art scheme which relies on semantic clustering, the method does not depend on an additional clustering method, so that new errors caused by the limitation of the performance of the clustering method are avoided.

(2) The invention expands the traditional mixing proportion estimation method from two minutes to multiple classification scenes, is different from the situation that the traditional method is limited to two classifications, and the improved mixing proportion estimation method can be applied to the situation of multiple classifications, and has wider application scenes.

(3) The invention solves the problem of dependence of the traditional mixing proportion estimation method on the anchor point, is different from the requirement of the traditional method on the anchor point marking, constructs a totally new mixing proportion estimation problem based on the method for regenerating the composition, and realizes the direct estimation of the mixing proportion coefficient under the condition of not depending on the anchor point marking.

Drawings

Fig. 1 is a flow chart of an overall framework.

Fig. 2 is a flow chart of tax payer industry information processing.

FIG. 3 is a flow chart of the tax payer industry classification network construction and training device initialization.

Fig. 4 is a conditional transition matrix estimation flow chart.

FIG. 5 is a flow chart for training the device to learn network parameters and classify tax payers.

Fig. 6 is a schematic diagram of a tag noise learning network.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.

As shown in fig. 1, in the implementation of the present invention, the label noise learning method for taxpayer industry classification of the present invention includes the following steps:

step 1, tax payer industry information processing

As shown in fig. 2, text information and non-text information of the tax payer are respectively extracted, and information processing is performed, specifically including the following steps:

S101, preprocessing text information in taxpayer industry

Illegal characters such as special symbols, numbers, and graduated words in the tax payer text information are deleted (fig. 2 s 101). In an embodiment, extracting 3 text features as text information features of the taxpayer includes: { tax payer name, registration address, operating Range }, assuming that one of the tax payer names is "Xin ceramic SI limited science and technology company in Xishan", special symbol SI is deleted first (FIG. 2S 101), and the sequence division is performed according to the words, so that { Xin, an, xin, , ceramic, limited, ke, skill, gong, si }.

S102, text word embedding based on XLNet pre-training network

Word embedding (fig. 2 s 102) is performed on the text based on the text pre-training network XLNet to form word vectors. In this embodiment, assuming the encoding length is t, XLNet text pre-training network embeds the original tokens into a word vector of length t. If the original text sequence length is 13, then XLNet the pre-training network may map the text to 13×t text features, specifically, in an embodiment, selecting t=528, then a 13×528 text feature may be obtained (s 102 of fig. 2).

S103, generating text features of taxpayer industry

Based on XLNet text pre-training network, repeating S102, performing word embedding on all text feature sequences, and then splicing the word vectors obtained by embedding to form tax payer text features (S103 in FIG. 2).

In particular, in an embodiment, assuming that the taxpayer industry text features have 3 items in total, the method comprises the following steps: { taxpayer name, registration address, business scope }, and 3 text features are mapped into text features of 13× 528,7 ×528 and 10×528, respectively, and the text features are spliced to obtain an overall taxpayer text feature (as in fig. 2s 103) with a shape of 30×528.

S104, tax payer industry numerical value characteristic processing

And extracting the numerical characteristics of the tax payer industry, wherein the numerical characteristics comprise 4 numerical characteristics of { registered funds, investment sum, asset sum, interest liabilities }, and carrying out standardized operation.

Specifically, in this embodiment, first, the sample mean μ ₁,μ₂,...,μ₄ and the sample variance σ ₁,σ₂,...,σ₄ of the 4-column features are calculated, and then X _i is recorded as the value of the ith numerical feature of the sample X, and the z-score formula is passedThe normalization process is performed on the numerical features (s 104 of fig. 2).

S105, tax administration industry category feature processing

The category information is encoded based on a one-hot encoding technique. In this embodiment, selecting 2 category features for encoding specifically includes: { Unit Property, accounting means }, wherein one Unit property includes five cases of enterprise, non-governmental non-enterprise units, public institutions, social groups, and others. The corresponding one-hot codes are {10000, 01000, 00100, 00010, 00001}, and one-hot codes are performed on all the category characteristic information (fig. 2 s 105).

S106, non-text feature generation in taxpayer industry

And constructing a linear network mapping layer, mapping the obtained numerical characteristics and category characteristics into the same dimension as the text characteristic dimension, and then splicing the numerical characteristics and the category characteristics to form non-text characteristics of the taxpayer industry.

Specifically, in an embodiment, linear network mapping layers in the shape of 1×528 and 5×528 are respectively established. For mapping numeric features and category features to the same dimension of text features, and then stitching to form a non-text feature matrix (s 106 of fig. 2).

S107, generating taxpayer characteristic information

And (3) splicing the taxpayer text features obtained in the step (S103) and the taxpayer non-text features obtained in the step (S106), and finally forming the taxpayer industry feature information.

In an embodiment, the text feature with the shape of 30×528 and the non-text feature with the shape of 6×258 are spliced to form the final taxpayer industry feature information, which has the shape of 36×528 (s 107 of fig. 2).

Step 2, initializing a tax administration industry classification network construction and training device

As shown in fig. 3, a TextCNN network is established for tax payer industry classification, and the shape of the TextCNN convolution kernel and the dimensions of input and output are sequentially determined according to the generated tax payer industry characteristics and the total number of targets to be classified. And concatenating XLNet the text pre-training network and TextCNN network to form a training device, and performing end-to-end training on the training device based on the tag noise data for initializing network parameters of the training device.

S201, construction of taxpayer industry classification network

The TextCNN network is constructed for tax administration industry classification, and the TextCNN network comprises three layers, namely: convolution layer, pooling layer and full connection layer.

Specifically, in the embodiment, according to the situation of the characteristics of the taxpayer text, a convolution kernel is established, the row characteristics of the characteristic map are extracted, in this embodiment, a convolution kernel with a shape of n×528 is used, where n= {2,3,4,5,6}, a maximum pooling layer is established, further feature compression and extraction are performed on the characteristics after convolution, finally, a full connection layer is established, assuming that the number of total characteristics of the characteristic map output after pooling layer is n ₁ and the number of total categories is c, a full connection layer with a shape of n ₁ ×c is established, and in this embodiment, c=97 (as in fig. 3 s 201).

S204, initializing a training device

And (3) connecting the XLNet text pre-training network in the step (1) and the constructed TextCNN network in series to form a training device. And performing end-to-end training based on the label noise data, and initializing network parameters of the training device.

In an embodiment, tax administration industry tag noise data is used as input, noise tags are predicted, and end-to-end training is performed for initializing network parameters (e.g., s202 of fig. 3). Assuming that the network parameter is alpha, the noise sample isThe network parameter set is w, and the output of the sample X under the mapping of the training device is recorded asFor the followingAndCross entropy loss is made, and a regularization term is added to prevent overfitting, wherein lambda is the regularization term control coefficient, the loss function is minimized, and the optimization objective is as follows:

Step 3, solving the conditional transfer matrix

As shown in fig. 4, firstly, a mixing proportion estimation problem is constructed, so that an original conditional transition matrix estimation problem is converted into a mixing proportion estimation problem, secondly, a brand new mixing proportion estimation problem is constructed based on a method for regenerating a composition, probability density is estimated according to a method for estimating nuclear density, and then a mixing proportion coefficient is solved, and a conditional transition matrix is estimated. The specific steps are as follows:

s301, construction of mixing proportion estimation problem

In this embodiment, it is assumed that the noisy tag in the taxpayer registration information isSample X, sample true label Y, if sample X and noisy labelIndependent of each other, the following relationship is provided:

Meanwhile, the above relationship may be converted into the following form:

From this, the above c equations are equivalent to the mixing ratio problem of c standards. In the embodiment, the total number to be classified is c=97, and if the matrix H and the matrix G can be found, the original equation can be found Further, the overall conditional transfer matrix is obtained, so that the original conditional transfer matrix estimation problem is converted into a mixed ratio estimation problem (s 301 of fig. 4).

S302, regenerating the composition

In an embodiment, it is assumed that the noise label class corresponds toRespectively is set asAnd The { i, j } class is respectively used as positive and negative sample setAndA two-class network is designed to predict, assuming the output of the network is f _η (X), where X is the sample feature after the dimension reduction of the input. η is a parameter of the network. And performing supervised training on the sensor network by using the positive and negative samples, and after the training of the network is completed. And (3) performing posterior probability prediction on the samples of the positive sample class by using a network. Selecting a threshold tau and marking the positive sample class sample set asThe negative sample class sample set isThe positive sample class is recorded as a sample set with the output of the network prediction being smaller than a selected threshold valueThen there isThe posterior probability rate is less than the thresholdThe sample set is copied to the negative sample set, and a reconstructed positive sample set and negative sample set can be obtained: And And satisfy the followingAndThereby completing the regeneration of the samples (s 302 of fig. 4).

S303, probability density function estimation

For the new sample obtained in S302AndThe probability density function estimation is performed on the sample set, and the kernel density estimation method is adopted to obtain estimated functions (as shown in fig. 4s 303) as follows:

S304, solving a conditional transfer matrix

Establishing a double circulation structure, and traversing the outer layer and the inner layer in turnAndAnd satisfies i.noteq.j. the processes of S302 and S303 are cyclically performed to determine the mixing ratio coefficientThe mixing proportion coefficient is H _ij, and then a G matrix is obtained according to the following relation;

Based on the H matrix and the G matrix, the following relationships can be obtained: (I-H) ^-1G＝Q^T to obtain a conditional transfer matrix Q ^T (see FIG. 4S 304).

Step 4, training the device to learn network parameters and classify tax payer industries

As shown in fig. 5, training the training device based on the label noise data, for learning the network parameters of the training device, and adding a condition transfer layer after the training device, to complete the tax payer industry classification, specifically comprising the following steps:

s401 training device learning based on label noise data

In the present embodiment, it is assumed that the input of the training device is a noise data sampleWhere X is the 36X 528 input feature vector, and is mapped to the 97-dimensional output vector g _η (X) via the network. For noisy labelsAnd network output g _η (X) as cross entropy loss, training network parameters according to the loss function, the trained network parameters being noted as eta (FIG. 5S 401)

S402, constructing a conditional transfer matrix layer

A conditional transfer matrix layer is added after the training device to predict new samples.

Specifically, in this embodiment, the calculated 97×97 conditional transfer matrix Q ^T is used as the conditional transfer layer. The original output g _η (X) is converted to Q (X), i.e., Q (X) =q ^Tg_η (X), where Q (X) represents the prediction of the true label for sample X. Where q ⁱ (X) is the ith component of q (X), representing the probability that sample X belongs to class i (FIG. 5S 402).

S403 taxpayer industry classification

As shown in fig. 6, the text information and the non-text feature information of the taxpayer are extracted respectively, the taxpayer industry features are extracted through the feature extraction module, the condition transfer matrix is estimated based on the extracted features, and the taxpayer industry classification is performed based on the training device as the final condition transfer layer of the training device. Specifically, in the embodiment, assuming that the taxpayer feature information is X, the output of the training device is q (X), where q (X) is a real label prediction of the sample X, q ⁱ (X) (i=1, 2..97) is an ith component of q (X), and a subscript corresponding to the largest component is selectedAs a classification of the tax payer industry (s 403 of fig. 5).

It will be readily appreciated by those skilled in the art that the foregoing is merely illustrative of the present invention and is not intended to limit the invention, but any modifications, equivalents, improvements or the like which fall within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims

1. A label noise learning method for taxpayer industry classification is characterized by comprising the following steps:

Firstly, extracting text information and non-text information in tax-paying industry information, and respectively performing text embedding and non-text coding processing based on XLNet text pre-training network and coding technology to obtain characteristic information; secondly, a TextCNN network for taxpayer industry classification is constructed, the number of layers, convolution kernel shape and input and output dimensions of each layer of the network are determined according to the characteristic information and the target classification number, a XLNet text pre-training network and a TextCNN network are connected in series, noisy taxpayer industry label data are combined as supervision, and an end-to-end training device is constructed; thirdly, estimating a conditional transfer matrix based on an improved mixing proportion estimation method; finally, network parameters in the training device are learned, and a conditional transfer matrix is used as a linear layer behind TextCNN networks, so that conversion from noise label prediction to real tax-paying industry label prediction is realized, and tax-paying industry classification is carried out;

the method specifically comprises the following steps:

1) Tax payer industry information processing

3) Conditional transition matrix estimation

Based on a nuclear density estimation method, estimating a probability density function according to noisy taxpayer industry information data, converting a conditional transfer matrix estimation problem into a mixed proportion estimation problem, solving a corresponding mixed proportion coefficient based on an improved mixed proportion estimation method, and further obtaining a conditional transfer matrix; conditional transfer matrix estimation: converting a conditional transfer matrix estimation problem in a label noise learning problem into a mixed proportion estimation problem, and solving a mixed proportion coefficient based on an improved mixed proportion estimation method to obtain a conditional transfer matrix; specific implementation details are as follows:

Step1: mixing ratio estimation problem construction

(I-H)^-1G＝Q^T

the equation is further expressed as follows:

Wherein the following are satisfied:

Step2: regeneration of the composition

step3: probability density estimation based on kernel density estimation

Step4: conditional transition matrix estimation

(I-H)^-1G＝Q^T

2. The method for learning label noise for taxpayer industry classification according to claim 1, wherein in step 1), taxpayer industry information processing specifically comprises the following steps:

Step1: taxpayer industry text information preprocessing

Step2: text word embedding based on XLNet pre-training network

step3: tax payer industry text feature generation

Step4: tax administration industry numerical value characteristic processing

Step5: tax administration industry class feature processing

Step6: non-text feature generation for taxpayer industry

Respectively obtaining m standardized numerical characteristics and a class characteristic matrix with a shape of v multiplied by N _max after Step4 and Step5, wherein N _max represents the longest class coding length, then establishing two linear network layers for characteristic mapping, wherein the first linear network layer has a network shape of 1 multiplied by t and is used for converting the standardized numerical characteristics into m multiplied numerical characteristic matrices, the second linear network layer has a network shape of N _max multiplied by t and is used for mapping the class characteristics into a v multiplied by t class characteristic matrix, and splicing the mapped two characteristic matrices to obtain a non-text characteristic matrix with a final shape of (v+m) multiplied by t;

Step7: taxpayer characteristic information generation

3. The method for learning label noise for taxpayer industry classification according to claim 2, wherein in step 2), the taxpayer industry classification network construction and training device is initialized: a TextCNN network was built for text classification, and a TextCNN network comprised three layers, respectively: the method comprises the steps of (1) a convolution layer, (2) a maximum pooling layer and (3) a full connection layer, connecting XLNet pre-training networks in the step 1) with TextCNN networks in series, constructing a training device, and performing end-to-end training by taking tax payer tag noise data as supervision; specific implementation details are as follows:

step1: taxpayer industry classification network construction

Step2: training device initialization

4. The method for learning label noise for taxpayer industry classification according to claim 3, wherein in Step 2) of Step2, the network parameter is set to be α, the sample is set to be X, and the noise label is set to be XThe network parameter set is w, and the output of the sample X under the mapping of the training device is recorded asFor the followingAndCross entropy loss is made, and a regularization term is added to prevent overfitting, wherein lambda is the regularization term control coefficient, the loss function is minimized, and the optimization objective is as follows:

5. the method for learning label noise for taxpayer industry classification according to claim 4, wherein in step 4), the training device network parameter learning and taxpayer industry classification are as follows:

step1: training device learning based on tag noise data

step2: construction of conditional transfer matrix layer

Step3: tax payer industry classification

6. The method for learning label noise for taxpayer industry classification according to claim 5, wherein in Step2 of Step 4), the specific method is as follows: set noise labelThe true sample label is Y, the total class number is C, and the sample feature X and the noise label are assumedIndependent of each other, for any categoryThe method comprises the following steps:

The original network output g _η (X) is converted by the conditional transfer matrix Q ^T, and the original output can be converted into a new output Q (X) which satisfies Q (X) =Q ^T g (X), wherein the new output Q (X) is the posterior probability of the real label Where q ⁱ (X) (i=1, 2,.., C) is the i-th component of q (X), representing the probability predictor P (y=i|x) that X is the i-th class of real tags.