CN114817546A - Label noise learning method for taxpayer industry classification - Google Patents
Label noise learning method for taxpayer industry classification Download PDFInfo
- Publication number
- CN114817546A CN114817546A CN202210498954.4A CN202210498954A CN114817546A CN 114817546 A CN114817546 A CN 114817546A CN 202210498954 A CN202210498954 A CN 202210498954A CN 114817546 A CN114817546 A CN 114817546A
- Authority
- CN
- China
- Prior art keywords
- taxpayer
- network
- matrix
- text
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 77
- 239000011159 matrix material Substances 0.000 claims abstract description 112
- 238000012549 training Methods 0.000 claims abstract description 109
- 230000007704 transition Effects 0.000 claims abstract description 35
- 238000012546 transfer Methods 0.000 claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 13
- 238000006243 chemical reaction Methods 0.000 claims abstract description 11
- 238000005516 engineering process Methods 0.000 claims abstract description 6
- 239000000203 mixture Substances 0.000 claims description 31
- 238000002156 mixing Methods 0.000 claims description 27
- 230000006870 function Effects 0.000 claims description 23
- 239000013598 vector Substances 0.000 claims description 21
- 238000013507 mapping Methods 0.000 claims description 17
- 238000011176 pooling Methods 0.000 claims description 15
- 238000010276 construction Methods 0.000 claims description 13
- 230000010365 information processing Effects 0.000 claims description 13
- 238000005457 optimization Methods 0.000 claims description 7
- 238000011069 regeneration method Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 6
- 230000008929 regeneration Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000000354 decomposition reaction Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 241000287196 Asthenes Species 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 230000006835 compression Effects 0.000 claims description 3
- 238000005315 distribution function Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- 230000006872 improvement Effects 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000000919 ceramic Substances 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 229910052573 porcelain Inorganic materials 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000001172 regenerating effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/10—Tax strategies
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Finance (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computing Systems (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Probability & Statistics with Applications (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a label noise learning method for taxpayer industry classification, which comprises the following steps: firstly, extracting text information and non-text information in taxpayer business information, and performing text embedding and non-text coding processing respectively based on an XLNET text pre-training network and a coding technology to obtain characteristic information; secondly, constructing a TextCNN network for taxpayer industry classification, determining the number of layers of the network, the shape of a convolution kernel and the input and output dimensions of each layer according to the characteristic information and the target classification number, connecting an XLNet text pre-training network and the TextCNN network in series, and constructing an end-to-end training device by combining noisy taxpayer industry label data as supervision; thirdly, estimating a conditional transition matrix based on an improved mixed proportion estimation method; and finally, learning network parameters in the training device, and taking the conditional transfer matrix as a linear layer behind the TextCNN network, so as to realize the conversion from noise label prediction to real taxpayer industry label prediction and carry out taxpayer industry classification.
Description
Technical Field
The invention belongs to the technical field of text classification with labeled noise, and particularly relates to a label noise learning method for taxpayer industry classification.
Background
In recent years, market economy continues to prosper, the number of enterprises is increasing, and division of labor of the enterprises is refining. Concomitantly, the upgrade and further construction of tax systems has become an urgent need.
The taxpayer industry classification is a precondition for determining the policy of a taxpayer main body and preferential benefit, and is an important link for tax collection. At present, China mainly divides the taxpayer industry into 20 categories and 97 categories. Due to numerous types, the traditional manual classification method needs to consume a large amount of human resources and is limited by professional knowledge and experience of classifiers, and classification errors, namely label noise of taxpayer industry classification, are inevitably introduced, so that a series of adverse effects are caused to national statistics, tax collection and industrial and commercial management.
In recent years, with the acceleration of the "intelligence +" era, the artificial intelligence industry is rapidly developing and applied to various fields, and possibility is provided for the exploration and development of the wisdom tax. Research on enterprise taxpayer industry classification is the fundamental work of tax source classification management and is a key premise of intelligent tax informatization. Therefore, how to train a classifier based on the existing label noise data by a machine learning means to correctly classify the taxpayer industry becomes a problem to be solved urgently.
The invention relates to a related technical scheme of taxpayer industry classification problems, and the related invention patents comprise the following patents:
document 1: tax payer industry two-level classification method (201910024324.1) based on MIMO recurrent neural network
Document 2: taxpayer industry classification method based on noise label learning (202110201214.5)
Document 1 designs a GRU-based multi-input multi-output neural network structure, establishes a mapping relationship from industry major categories to industry details, and constructs a two-layer classification structure for realizing industry classification of taxpayers. However, this method relies on a strict labeling of the data and lacks practical value in the presence of tag noise.
Aiming at the defects of the technical scheme, the invention aims to construct a classifier based on the consistent risks of the tag noise data without depending on additional manual labeling, overcome the classification deviation caused by adopting a semantic clustering method in the prior art, and ensure that the classifier constructed based on the tag noise data and the classifier constructed by adopting real labeling data have consistent classification risks in a statistical sense.
The core of constructing a risk-consistent classifier based on tag noise data is as follows: a statistically consistent classifier is constructed by evaluating a conditional transition matrix (a matrix made up of the conditional probabilities of true labels given noisy labels). The invention creatively converts the problem of the conditional transition matrix estimation into the problem of the mixed proportion estimation, and obtains an approximate conditional transition matrix by estimating the mixed proportion coefficient. However, the conventional mixed proportion estimation method is only suitable for the two-classification scene and depends on an anchor point (a sample definitely belongs to a certain class), and the taxpayer industry classification problem has a plurality of industry categories and belongs to a multi-classification problem, and the anchor point is difficult to label and acquire. Therefore, it is a major challenge of the present invention to scale the mixture ratio estimation problem from the binary analogy to the multi-classification and overcome the anchor point dependency problem.
Disclosure of Invention
The invention aims to provide a label noise learning method for taxpayer industry classification, which is used for constructing a risk consistency classifier by estimating a conditional transition matrix (a matrix formed by the conditional probability of a real label under the condition of giving a noise label) based on label noise data.
The invention is realized by adopting the following technical scheme:
a label noise learning method for taxpayer industry classification comprises the following steps:
firstly, extracting text information and non-text information in taxpayer business information, and performing text embedding and non-text coding processing respectively based on an XLNET text pre-training network and a coding technology to obtain characteristic information; secondly, constructing a TextCNN network for taxpayer industry classification, determining the number of layers of the network, the shape of a convolution kernel and the input and output dimensions of each layer according to the characteristic information and the target classification number, connecting an XLNet text pre-training network and the TextCNN network in series, and constructing an end-to-end training device by combining noisy taxpayer industry label data as supervision; thirdly, estimating a conditional transition matrix based on an improved mixed proportion estimation method; and finally, learning network parameters in the training device, and taking the conditional transfer matrix as a linear layer behind the TextCNN network, so as to realize the conversion from noise label prediction to real taxpayer industry label prediction and carry out taxpayer industry classification.
A further improvement of the invention is that the method comprises in particular the steps of:
1) taxpayer industry information processing
The taxpayer industry information processing comprises text information processing and non-text information processing, firstly, word segmentation and word embedding are carried out on taxpayer text information based on an XLNET text pre-training network to form corresponding word vectors, then text features are generated by splicing, secondly, numerical value features and category features in the taxpayer non-text information are preprocessed by respectively using a standardization process and a one-hot coding technology, then a linear network layer is established for feature mapping to generate non-text features consistent with text feature dimensions, and finally, the text features and the non-text features are spliced to form feature information;
2) taxpayer industry classification network construction and training device initialization
Constructing a TextCNN network for taxpayer industry classification, wherein the network comprises three layers of a convolutional layer, a pooling layer and a full-connection layer, sequentially determining the layer number of the TextCNN network, the shape of a convolutional core and the input and output dimensions of each layer based on the characteristic information and the target classification number obtained in the step 1), then connecting an XLNet pre-training network with the TextCNN network in series, and constructing an end-to-end training device by taking a noisy taxpayer industry information label as supervision;
3) conditional branch matrix estimation
Estimating a probability density function according to noisy taxpayer industry information data based on a kernel density estimation method, converting a conditional transfer matrix estimation problem into a mixed proportion estimation problem, and solving a corresponding mixed proportion coefficient based on an improved mixed proportion estimation method to obtain a conditional transfer matrix;
4) training device network parameter learning and taxpayer industry classification
And learning network parameters of the training device based on the label noise data, and after the training is finished, adding the estimated conditional transition matrix as a linear conversion layer to the training device to finish the conversion from the noise label prediction to the real label prediction, thereby realizing the tax payer industry classification.
The further improvement of the invention is that in the step 1), the taxpayer industry information processing specifically comprises the following steps:
step 1: taxpayer industry text information preprocessing
Extracting text information of taxpayer industry, deleting special symbols, numbers and meaningless symbols of quantifier words in the text information, and finishing preprocessing the text information of taxpayers;
step 2: text word embedding based on XLNET pre-training network
The method comprises the steps that a text is coded based on an XLNT pre-training network to generate word vectors, an XLNT pre-training model is designed based on a transform, and meanwhile, the relation between two-way contexts is captured, so that the problem that a pre-training stage is inconsistent with a fine-tuning stage caused by a mask mechanism of a bert model is solved, and a double-current self-attention mechanism is used, so that the pre-training effect is more obvious; the XLNET model applied to Chinese uses a 24-layer network structure, and adopts sentencepiec to perform word segmentation; encoding the text characteristics obtained in Step1 by using XLNET of a Chinese version, thereby obtaining a slave word vector;
step 3: taxpayer industry text feature generation
Assuming that the taxpayer has k text features in total, a word element is mapped into a t-dimensional word vector by an XLNT pre-training network, and the ith text feature is recorded to have h i For each word element, the ith text feature is mapped to an h i A matrix of x t; splicing the feature matrixes of the text feature mappings, so that the text features of the samples are mapped into oneGenerating a taxpayer text feature matrix;
step 4: tax payer industry value feature processing
Standardizing the numerical characteristics of the nontext characteristics of the taxpayers, assuming that n training samples and m numerical characteristics are shared, and recording the value of the jth numerical characteristic of the ith sample as X ij The mean value of the jth numerical characteristic is mu j To satisfyThe standard deviation of the jth numerical characteristic is sigma j Satisfy the following requirementsThe numerical characteristic after normalization is
Step 5: tax payer industry category feature processing
Coding the category characteristics in the nontextual characteristics of the taxpayer, and coding and representing the category characteristics by adopting an N-dimensional vector if the category characteristics have N possible values; specifically, the corresponding position of the category feature value is set to be 1, and the rest positions are set to be 0, namely, a one-hot coding method is adopted, after all the category features are coded, the longest coding length in the category features is selected for completion, and all vectors after completion are spliced to form a category feature matrix;
step 6: taxpayer industry non-text feature generation
M normalized numerical features and a shape of v × N were obtained after Step4 and Step5, respectively max Class feature matrix of (1), where N max Expressing the longest class code length, then establishing two linear network layers for feature mapping, wherein the first linear network layer is 1 × t in shape and is used for converting the normalized numerical features into a m × t numerical feature matrix, and the second linear network layer is N in shape max The xt is used for mapping the category characteristics into a upsilon xt category characteristic matrix, and splicing the two mapped characteristic matrices to obtain a final non-text characteristic matrix with the shape of (v + m) xt;
step 7: taxpayer characteristic information generation
Splicing the text feature matrix generated at Step3 and the non-text feature matrix generated at Step6 to generate a shape ofAs final characteristic information.
The further improvement of the invention is that in the step 2), the taxpayer industry classification network construction and training device is initialized: establishing a TextCNN network for text classification, wherein the TextCNN network comprises three layers: (1) a convolutional layer, (2) a maximum pooling layer and (3) a full connection layer, wherein an XLNET pre-training network in the step 1) is connected with a TextCNN network in series to construct a training device, and end-to-end training is carried out by taking taxpayer label noise data as supervision; the specific implementation details are as follows:
step 1: taxpayer industry classification network construction
Constructing a TextCNN network for taxpayer industry classification, wherein the TextCNN network comprises three layers, namely a convolution layer, a pooling layer and a full-connection layer; specifically, the convolution layer of the TextCNN uses a convolution kernel with the shape of n × t to perform convolution operation for extracting line features, wherein the values of n are {2, 3,4, 5, 6}, the TextCNN adopts a maximum pooling layer as a pooling layer for maximum value extraction of a feature map after convolution, further compression is performed to extract features, then a full connection layer is established, assuming that the total number of categories to be classified of taxpayer industry classification is c, and if the number of features is s after passing through the maximum pooling layer, the full connection layer with the shape of s × c is established for mapping feature information into a c-dimensional vector, and then taxpayer industry classification is performed;
step 2: training device initialization
Connecting the XLNET text pre-training network in the step 1) with the constructed TextCNN network in series to form a training device; and (3) taking the label noise data of the taxpayer industry as input, predicting the noise label, forming an end-to-end device for training, and initializing the network parameters of the training device.
The invention further improves the method in Step2 in Step 2), wherein the network parameter is alpha, the sample is X, and the noise label isThe set of network parameters is w, and the memory sample X is output asForAndmaking cross entropy loss and adding a regularization term to prevent overfitting, where λ is the regularization term control coefficient, mostThe loss function is minimized and the optimization objective is as follows:
a further improvement of the invention is that, in step 3), the conditional transition matrix estimation: converting a conditional transfer matrix estimation problem in the label noise learning problem into a mixing proportion estimation problem, and solving a mixing proportion coefficient based on an improved mixing proportion estimation method to further obtain a conditional transfer matrix; the specific implementation details are as follows:
step 1: hybrid ratio estimation problem construction
Assume that the noise label in the taxpayer registration information isThe true label of the sample is Y, assuming sample X and noise labelIndependently of each other, for any class C ∈ C there is:
note the bookP i =P(X|Y=i)、Where Q represents the conditional transition probability of a noisy tag to a true tag, the above equation is expressed in matrix form as follows:
further decomposing the matrix to obtain the following form; where H is a c x c matrix and satisfies that the diagonal element is 0, and G is a real diagonal matrix shaped as c x c;
according to the nature of matrix transformation, it can be seen that matrix H, matrix G, and matrix Q satisfy the following relationships:
(i-H) -1 G=Q T
here Q is T The matrix is a conditional transition matrix in label noise learning, and the above relationship indicates that if the matrix H is solved, the conditional transition matrix is further solved, and the decomposition of the matrix is equivalent to the following c equations:
the equation is further expressed in the form:
wherein the following are satisfied:
the standard mixing ratio estimation problem is expressed in the form: f ═ kH + (1-k) G (k ≧ 0), where fh G is the probability distribution function, and samples sampled at distribution F, H are assumed to be known, where F is the mixture and H, G is the composition; the equation obtained by the above matrix decomposition:it is the standard mixing ratio estimation problem, the mixing ratio coefficient H estimated by which is the mixing ratio estimation problem ij It is the elements of matrix H; therefore, by solving a series of mixed proportion estimation problems, the H matrix can be solved, and the conditional transition matrix Q is estimated according to the matrix relation T Therefore, a classifier with consistent risk is constructed based on the tag noise data, and taxpayer industry classification is carried out;
step 2: regeneration of compositions
The solution of the mixed scaling problem relies on the labeling of the anchor point, specifically, the maximum estimator of the mixed scaling coefficients if the anchor point samples are present and knownIs an unbiased estimate of the true mixture scaling factor k;
specifically, first, the mixture F sample is labeled as a positive sample class Y ═ 1, the labeled composition component H sample is labeled as a negative sample class Y ═ 1, an MLP network is constructed to perform binary prediction, and the output of the network is assumed to be F η (X), wherein X is sample characteristics, eta is network parameters, the MLP network is supervised trained by using noisy positive and negative samples, after training, the posterior probability prediction is carried out on the samples of the positive sample class by using the network, a threshold value tau is selected, and the sample set of the positive sample class is recorded asSet of negative sample class samples asInputting samples of the positive sample class into the network for prediction, wherein the sample set with a prediction value smaller than a selected threshold value is recorded asThen there isBringing the samples with the posterior probability ratio smaller than the threshold value into a negative sample set to respectively obtainPositive and negative sample sets after reconstruction:andsatisfy the requirement ofAndtherefore, the regeneration of the composition sample is completed, and the problem of dependence of the traditional mixed proportion estimation method on the anchor point is solved;
step 3: probability density estimation based on kernel density estimation
On the basis of reconstructing the composition at Step2, estimating a probability density function of sample distribution based on a kernel density estimation method; specifically, a kernel function is established for representing the probability density estimation of the existing sample to any point in the feature space, wherein x is a point in the feature space, and x is i Is a known sample; and μ is the sample mean, and Σ is ρ 2 Q is the covariance matrix of the sample, then sample x is the case using a Gaussian kernel i The contribution to the probability density at x represents the form of the kernel function as follows:
then over the entire sample set, the probability density function estimator is:whereinFor sets of samples, from the positive and negative sample sets already obtainedThe probability density function of the reconstructed positive and negative samples is estimated as follows:
step 4: conditional branch matrix estimation
Sequentially solving c mixing proportion estimation problems constructed in Step1, solving corresponding c-1 mixing proportion coefficients for any mixing proportion estimation problem, and setting the noise label of the mixture asNoise signatures of compositions areCollecting original samplesSet of positive and negative samples, respectively, as among the mixture ratio estimation problemMethod based on Step2 generates new positive and negative sample setsAndand carrying out probability density estimation according to the kernel density estimation method of Step3 to respectively obtainAndthen, the maximum estimation quantity of the mixing proportion coefficient is estimated by adopting the method of maximum estimation of the mixing proportion problem in Step1Where G is a legal probability density function, estimatorI.e. element H ij (i ≠ j) through circulation and repetition of the steps 2,3 and 4, all elements of the H matrix can be solved, and then the G matrix and the conditional transition matrix Q can be solved according to the following properties T ;
(I-H) -1 G=Q T 。
The further improvement of the invention is that in the step 4), the training device network parameter learning and taxpayer industry classification comprises the following specific steps:
step 1: training device learning based on tag noise data
Assume that the network parameter in the training apparatus is η and the noise sample isThe network parameter set is combined as w, the network parameter in the training device is learned by using the label noise data as supervision, and the output of the memory sample X under the mapping of the training device is g η (X) for g η (X) andmaking cross entropy loss, and adding a regularization term to prevent overfitting, wherein lambda is a regularization term control coefficient, minimizing a loss function, and an optimization objective is as follows:
under control of the optimization objective, a network of training devices is used to predict noise signatures of input samplesOutput result g η (X) performing operation through a softmax layer, wherein the sotfmax operation performs exponential normalization processing on the original output, and the original output is expressed as a predicted value of posterior probability; specifically, assume that the original network output isPerforming exponential operation on the output vector and performing normalization processing by softmax, wherein the output is in the following form;
step 2: constructing a conditional transition matrix layer
After finishing learning the network parameters of the training device, outputting g of the network η (X) outputting the posterior probability of the sample through softmax operationThe method is used for predicting the noise label, a conditional branch layer is added behind a softmax layer to serve as a branch layer, and the conversion from noise label prediction to real label prediction is realized;
step 3: taxpayer industry classification
On the basis of constructing the conditional transition layer, for a newly input sample X, the output of the TextCNN network is q (X), and the output is calculatedAnd obtaining a subscript r corresponding to the maximum component q (X), namely the trade classification corresponding to the taxpayer.
The further improvement of the invention is that in Step2 of Step 4), the specific method is as follows: let the noise label beThe true sample label is Y, the total number of classes is C, and sample characteristics X and noise labels are assumedIndependent of each other, for any classComprises the following steps:
the original network output g is then η (X) passing through conditional transition matrix Q T The conversion can convert the original output into a new output Q (x) satisfying Q (x) ═ Q T g (X), where the new output q (X) is the posterior probability of the genuine tagWherein q is i (X) (i ═ 1, 2., C) is the i-th component of q (X), representing that X is the probability prediction value P of the i-th class of true tags (Y ═ i | X).
The invention has at least the following beneficial technical effects:
the invention provides a label noise learning method facing taxpayer industry classification, compared with the prior art, the invention has the advantages that:
(1) the invention creatively converts the conditional transition matrix estimation problem in label noise learning into a mixed proportion estimation problem, and constructs classifiers with consistent risks based on label noise data by solving the mixed proportion estimation problem. The method is different from the method that the prior art scheme depends on semantic clustering, and the method does not depend on an additional clustering method, so that a new error caused by the performance limitation of the clustering method is avoided.
(2) The invention expands the traditional mixed proportion estimation method from dichotomy to multi-classification scenes, is different from the situation that the traditional method is limited to two classifications, and the improved mixed proportion estimation method can be applied to the multi-classification situation and has wider application scenes.
(3) The invention overcomes the problem of dependence of the traditional mixed proportion estimation method on the anchor point, is different from the requirement of the traditional method on the anchor point marking, constructs a completely new mixed proportion estimation problem based on the method for regenerating the composition, and realizes the direct estimation of the mixed proportion coefficient under the condition of not depending on the anchor point marking.
Drawings
FIG. 1 is an overall framework flow diagram.
Fig. 2 is a flow chart of taxpayer business information processing.
FIG. 3 is a flowchart of the taxpayer industry classification network construction and training device initialization.
Fig. 4 is a flow chart of conditional branch matrix estimation.
FIG. 5 is a flow chart of network parameter learning and taxpayer industry classification for the training device.
FIG. 6 is a schematic diagram of a tag noise learning network.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in fig. 1, in an embodiment of the present invention, the label noise learning method for taxpayer industry classification according to the present invention includes the following steps:
step 1. taxpayer industry information processing
As shown in fig. 2, the method for extracting the text information and the non-text information of the taxpayer and performing information processing includes the following steps:
s101, taxpayer industry text information preprocessing
Illegal characters such as special symbols, numbers, quantifiers, etc. in the taxpayer text information are deleted (fig. 2S 101). In an embodiment, a total of 3 text features are extracted as the text information features of the taxpayer, including: { taxpayer name, registered address, and operation scope }, wherein the taxpayer name is ' xi ' an city xin Yao ceramic SI limited science and technology company ', a special symbol SI (fig. 2S101) is deleted firstly, and sequence segmentation is performed according to words to obtain { xi, an, city, xin, Yao, pottery, porcelain, limited, department, skill, public, department }.
S102, embedding text words based on XLNET pre-training network
Word embedding is performed on the text based on a text pre-training network XLNET (FIG. 2S102), forming a word vector. In this embodiment, assuming the encoding length is t, the XLNet text pre-training network embeds the original lemmas into word vectors with length t. If the original text sequence length is 13, the XLNet pre-training network may map the text to a 13 × t text feature, specifically, in the embodiment, if t is 528, a 13 × 528 text feature may be obtained (fig. 2S 102).
S103, taxpayer industry text feature generation
Based on an XLNET text pre-training network, repeating the process of S102, performing word embedding on all text characteristic sequences, and further splicing the embedded word vectors to form taxpayer text characteristics (FIG. 2S 103).
In particular, in the embodiment, it is assumed that there are 3 items in total for the taxpayer industry text feature, including: { taxpayer name, registration address, business range }, and 3 text features are respectively mapped to 13 × 528, 7 × 528, and 10 × 528 text features, and the text features are spliced to obtain an overall taxpayer text feature (as shown in fig. 2S103) with a shape of 30 × 528.
S104, tax payer industry value feature processing
And extracting numerical characteristics of the taxpayer industry, including 4 numerical characteristics in total, namely { registered fund, total investment, total asset amount and interest liability } and carrying out standardized operation.
Specifically, in the present embodiment, the sample mean μ of the 4 columns of features is first calculated 1 ,μ 2 ,...,μ 4 And the sample variance σ 1 ,σ 2 ,...,σ 4 Let X be written i The value of the ith numerical characteristic of the sample X is obtained through a z-score formulaThe numerical feature is normalized (fig. 2S 104).
S105, taxpayer industry category feature processing
The category information is encoded based on a one-hot encoding technique. In this embodiment, the selecting 2 category features for encoding specifically includes: { unit property, accounting method }, wherein one item of the unit property includes five cases of enterprises, civil non-enterprise units, public institutions, social groups, and the like. The corresponding one-hot codes are {10000, 01000, 00100, 00010, 00001}, respectively, and the one-hot codes are performed on all the category feature information (fig. 2S 105).
S106, non-text characteristic generation of taxpayer industry
And constructing a linear network mapping layer, mapping the obtained numerical characteristics and the category characteristics into the dimension which is the same as the dimension of the text characteristics, and then splicing the numerical characteristics and the category characteristics to form the nontext characteristics of the taxpayer industry.
Specifically, in an embodiment, linear network mapping layers with shapes of 1 × 528 and 5 × 528 are established, respectively. The method is used for mapping the numerical features and the category features to the same dimension of the text features, and then splicing is carried out to form a non-text feature matrix (fig. 2S 106).
S107, taxpayer characteristic information generation
And splicing the taxpayer text characteristics obtained in the step S103 and the taxpayer non-text characteristics obtained in the step S106 to finally form taxpayer industry characteristic information.
In an embodiment, the text feature with the shape of 30 × 528 and the non-text feature with the shape of 6 × 258 are spliced to form the final taxpayer industry feature information, which has the shape of 36 × 528 (fig. 2S 107).
Step2, taxpayer industry classification network construction and training device initialization
As shown in fig. 3, the TextCNN network is established for taxpayer industry classification, and the shape of the TextCNN convolution kernel and the input and output dimensions are sequentially determined according to the generated taxpayer industry characteristics and the target total number to be classified. And connecting the XLNET text pre-training network and the TextCNN network in series to form a training device, and performing end-to-end training on the training device based on the label noise data for initializing the network parameters of the training device.
S201. taxpayer industry classification network construction
Constructing a TextCNN network for taxpayer industry classification, wherein the TextCNN network comprises three layers: a convolutional layer, a pooling layer, and a fully-connected layer.
Specifically, in the embodiment, according to the situation of the taxpayer text feature, a convolution kernel is established, the line feature of the feature map is extracted, in the embodiment, a convolution kernel with the shape of n × 528 is used, where n is {2, 3,4, 5, 6}, a maximum pooling layer is established, further feature compression and extraction are performed on the features after convolution, a full connection layer is finally established, and it is assumed that the number of the total features of the feature map output after the pooling layer is n 1 If the total number of categories is c, the shape is n 1 And x c, in this embodiment, c is 97 (fig. 3S 201).
S204. training device initialization
Connecting the XLNET text pre-training network in the step 1) and the constructed TextCNN network in series to form a training device. And performing end-to-end training based on the label noise data, and initializing network parameters of the training device.
In an embodiment, taxpayer industry label noise data is used as input, noise label is predicted, and end-to-end training is performed for initializing network parameters (as in fig. 3S 202). Assume a network parameter of α and a noise sample ofThe set of network parameters is w, and the memory sample X is output asFor theAndmaking cross entropy loss, and adding a regularization term to prevent overfitting, wherein lambda is a regularization term control coefficient, minimizing a loss function, and an optimization objective is as follows:
step3, solving conditional transition matrix
As shown in fig. 4, firstly, a mixture ratio estimation problem is constructed, so that the original conditional transition matrix estimation problem is converted into a mixture ratio estimation problem, secondly, a brand-new mixture ratio estimation problem is constructed based on a composition regeneration method, probability density is estimated according to a kernel density estimation method, and then a mixture ratio coefficient is solved, and a conditional transition matrix is estimated. The specific steps are as follows:
s301, construction of mixed proportion estimation problem
In the present embodiment, it is assumed that the noisy label in the taxpayer registration information isThe sample is X, the true label of the sample is Y, if the sample is X and the label with noise is YIndependently of each other, the following relationships are given:
meanwhile, the above relationship may be converted into the following form:
it follows that the above c equations are equivalent to the c standard mixing ratio problem. In the embodiment, the total number of the objects to be classified is c, which is 97, and if the matrix H and the matrix G can be obtained, the original equation can be obtainedFurther, the overall conditional branch matrix is obtained, and therefore the original conditional branch matrix estimation problem is converted into the mixture ratio estimation problem (fig. 4S 301).
S302, regeneration of composition
In an embodiment, the hypothesis corresponds to a noise label classRespectively isAnd using the { i, j } classes as positive and negative sample sets respectivelyAnddesigning a two-class network for prediction, assuming the output of the network is f η (X), wherein X is the sample feature after dimensionality reduction of the input. η is a parameter of the network. Using positive and negative samples to perform sensor networkSupervised training, after training of the network is completed. And carrying out posterior probability prediction on the samples of the positive sample class by using the network. Selecting a threshold value tau, and correcting a sample class sample set intoSet of negative examples class examples asThe sample set of positive sample class whose output is less than the selected threshold value through network prediction is recorded asThen there isWith a posterior probability ratio less than a thresholdCopying the sample set to a negative sample set to obtain a reconstructed positive sample set and a reconstructed negative sample set:andand satisfyAndthereby completing the regeneration of the sample (as in fig. 4S 302).
S303. probability density function estimation
For the new sample obtained in S302Andprobability density function estimation is performed on the sample set, and the estimated functions (as shown in fig. 4S303) obtained by using the kernel density estimation method are respectively as follows:
s304. solving conditional transition matrix
A dual cycle structure is established, and the outer layer and the inner layer are sequentially traversed through cycleAndand if i ≠ j, the processes of S302 and S303 are executed circularly, and the mixing ratio coefficient is further calculatedThe mixing ratio coefficient is H ij Then obtaining a G matrix according to the following relation;
based on the obtained H matrix and G matrix, the following relationships may be followed: (I-H) -1 G=Q T Obtaining a conditional transition matrix Q T (as in fig. 4S 304).
Step4, training device network parameter learning and taxpayer industry classification
As shown in fig. 5, training the training device based on the tag noise data for learning the network parameters of the training device, and adding a conditional transition layer after the training device to complete taxpayer industry classification, which includes the following steps:
s401, learning of a training device based on tag noise data
In this embodiment, it is assumed that the input to the training apparatus is a noisy data sampleWherein X is 36 × 528 input feature vector, and is mapped to g through network η The 97-dimensional output vector of (X). For noise tagAnd network output g η (X) making cross entropy loss, training network parameters according to the loss function, and marking the trained network parameters as eta (FIG. 5S401)
S402, constructing a conditional transfer matrix layer
The conditional branch matrix layer is added after the training device and prediction is performed for the new sample.
Specifically, in the present embodiment, the calculated 97 × 97 conditional transition matrix Q is used T As a conditional transfer layer. Outputs g of the original η (X) is converted to Q (X), i.e. Q (X) ═ Q T g η (x) Here, q (X) denotes the prediction of the true label for sample X. Wherein q is i (X) is the ith component of q (X), representing the probability that sample X belongs to class i (fig. 5S 402).
S403. taxpayer industry classification
As shown in fig. 6, text information and non-text feature information of the taxpayer are respectively extracted, taxpayer industry features are extracted through a feature extraction module, a condition transfer matrix is estimated based on the extracted features to serve as a final condition transfer layer of the training device, and taxpayer industry classification is performed based on the training device. Specifically, in the embodiment, assuming that the taxpayer characteristic information is X, the output of the training device is q (X), where q (X) is the true label prediction of the sample X, and q (X) is recorded i (X) (i ═ 1, 2.., 97) for the ith component of q (X), choosing the subscript corresponding to the largest componentAs classifications for taxpayer industry(fig. 5S 403).
It will be understood by those skilled in the art that the foregoing is only exemplary of the method of the present invention and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (8)
1. A label noise learning method for taxpayer industry classification is characterized by comprising the following steps:
firstly, extracting text information and non-text information in taxpayer business information, and performing text embedding and non-text coding processing respectively based on an XLNET text pre-training network and a coding technology to obtain characteristic information; secondly, constructing a TextCNN network for taxpayer industry classification, determining the number of layers of the network, the shape of a convolution kernel and the input and output dimensions of each layer according to the characteristic information and the target classification number, connecting an XLNet text pre-training network and the TextCNN network in series, and constructing an end-to-end training device by combining noisy taxpayer industry label data as supervision; thirdly, estimating a conditional transition matrix based on an improved mixed proportion estimation method; and finally, learning network parameters in the training device, and taking the conditional transfer matrix as a linear layer behind the TextCNN network, so as to realize the conversion from noise label prediction to real taxpayer industry label prediction and carry out taxpayer industry classification.
2. The label noise learning method for taxpayer industry classification as claimed in claim 1, wherein the method specifically comprises the following steps:
1) taxpayer industry information processing
The taxpayer industry information processing comprises text information processing and non-text information processing, firstly, word segmentation and word embedding are carried out on taxpayer text information based on an XLNET text pre-training network to form corresponding word vectors, then text features are generated by splicing, secondly, numerical value features and category features in the taxpayer non-text information are preprocessed by respectively using a standardization process and a one-hot coding technology, then a linear network layer is established for feature mapping to generate non-text features consistent with text feature dimensions, and finally, the text features and the non-text features are spliced to form feature information;
2) taxpayer industry classification network construction and training device initialization
Constructing a TextCNN network for taxpayer industry classification, wherein the network comprises three layers of a convolutional layer, a pooling layer and a full-connection layer, sequentially determining the layer number of the TextCNN network, the shape of a convolutional core and the input and output dimensions of each layer based on the characteristic information and the target classification number obtained in the step 1), then connecting an XLNet pre-training network with the TextCNN network in series, and constructing an end-to-end training device by taking a noisy taxpayer industry information label as supervision;
3) conditional branch matrix estimation
Estimating a probability density function according to noisy taxpayer industry information data based on a kernel density estimation method, converting a conditional transfer matrix estimation problem into a mixed proportion estimation problem, and solving a corresponding mixed proportion coefficient based on an improved mixed proportion estimation method to obtain a conditional transfer matrix;
4) training device network parameter learning and taxpayer industry classification
And learning network parameters of the training device based on the label noise data, and after the training is finished, adding the estimated conditional transition matrix as a linear conversion layer to the training device to finish the conversion from the noise label prediction to the real label prediction, thereby realizing the tax payer industry classification.
3. The label noise learning method for taxpayer industry classification as claimed in claim 2, wherein in the step 1), the taxpayer industry information processing specifically comprises the following steps:
step 1: taxpayer industry text information preprocessing
Extracting text information of taxpayer industry, deleting special symbols, numbers and meaningless symbols of quantifier words in the text information, and finishing preprocessing the text information of taxpayers;
step 2: text word embedding based on XLNET pre-training network
The method comprises the steps that a text is coded based on an XLNT pre-training network to generate word vectors, an XLNT pre-training model is designed based on a transform, and meanwhile, the relation between two-way contexts is captured, so that the problem that a pre-training stage is inconsistent with a fine-tuning stage caused by a mask mechanism of a bert model is solved, and a double-current self-attention mechanism is used, so that the pre-training effect is more obvious; the XLNET model applied to Chinese uses a 24-layer network structure, and adopts sentencepiec to perform word segmentation; encoding the text characteristics obtained in Step1 by using XLNET of a Chinese version, thereby obtaining a slave word vector;
step 3: taxpayer industry text feature generation
Assuming that the taxpayer has k text features in total, a word element is mapped into a t-dimensional word vector by an XLNT pre-training network, and the ith text feature is recorded to have h i For each word element, the ith text feature is mapped to an h i A matrix of x t; splicing the feature matrixes of the text feature mappings, so that the text features of the samples are mapped into oneGenerating a taxpayer text feature matrix;
step 4: tax payer industry value feature processing
Standardizing the numerical characteristics of the nontext characteristics of the taxpayers, assuming that n training samples and m numerical characteristics are shared, and recording the value of the jth numerical characteristic of the ith sample as X ij The mean value of the jth numerical characteristic is mu j Satisfy the following requirementsThe standard deviation of the jth numerical characteristic is sigma j Satisfy the following requirementsThe numerical characteristic after normalization is
Step 5: tax payer industry category feature processing
Coding the category characteristics in the nontextual characteristics of the taxpayer, and coding and representing the category characteristics by adopting an N-dimensional vector if the category characteristics have N possible values; specifically, the corresponding position of the category feature value is set to be 1, and the rest positions are set to be 0, namely, a one-hot coding method is adopted, after all the category features are coded, the longest coding length in the category features is selected for completion, and all vectors after completion are spliced to form a category feature matrix;
step 6: taxpayer industry non-text feature generation
M normalized numerical features and a shape of v × N were obtained after Step4 and Step5, respectively max Class feature matrix of (1), where N max Expressing the longest class code length, then establishing two linear network layers for feature mapping, wherein the shape of the first linear network layer is 1 × t and is used for converting the numerical features after standardization into a numerical feature matrix of m × t, and the shape of the second linear network layer is N max The xt is used for mapping the category characteristics into a v xt category characteristic matrix, and splicing the two mapped characteristic matrices to obtain a final (v + m) xt non-text characteristic matrix;
step 7: taxpayer characteristic information generation
4. The label noise learning method for taxpayer industry classification as claimed in claim 3, wherein in the step 2), the taxpayer industry classification network construction and training device is initialized: establishing a TextCNN network for text classification, wherein the TextCNN network comprises three layers: (1) a convolutional layer, (2) a maximum pooling layer and (3) a full connection layer, wherein an XLNET pre-training network in the step 1) is connected with a TextCNN network in series to construct a training device, and end-to-end training is carried out by taking taxpayer label noise data as supervision; specific implementation details are as follows:
step 1: taxpayer industry classification network construction
Constructing a TextCNN network for taxpayer industry classification, wherein the TextCNN network comprises three layers, namely a convolution layer, a pooling layer and a full-connection layer; specifically, the convolution layer of the TextCNN uses a convolution kernel with the shape of n × t to perform convolution operation for extracting line features, wherein the values of n are {2, 3,4, 5, 6}, the TextCNN adopts a maximum pooling layer as a pooling layer for maximum value extraction of a feature map after convolution, further compression is performed to extract features, then a full connection layer is established, assuming that the total number of categories to be classified of taxpayer industry classification is c, and if the number of features is s after passing through the maximum pooling layer, the full connection layer with the shape of s × c is established for mapping feature information into a c-dimensional vector, and then taxpayer industry classification is performed;
step 2: training device initialization
Connecting the XLNET text pre-training network in the step 1) with the constructed TextCNN network in series to form a training device; and (3) taking the label noise data of the taxpayer industry as input, predicting the noise label, forming an end-to-end device for training, and initializing the network parameters of the training device.
5. The method as claimed in claim 4, wherein in Step2 of Step 2), let the net parameter be α, the sample be X, and the noise label beThe set of network parameters is w, and the memory sample X is output asForAndmaking cross entropy loss, and adding a regularization term to prevent overfitting, wherein lambda is a regularization term control coefficient, minimizing a loss function, and an optimization objective is as follows:
6. the label noise learning method for taxpayer industry classification as claimed in claim 5, wherein in step 3), the conditional transition matrix estimation: converting a conditional transfer matrix estimation problem in the label noise learning problem into a mixing proportion estimation problem, and solving a mixing proportion coefficient based on an improved mixing proportion estimation method to further obtain a conditional transfer matrix; the specific implementation details are as follows:
step 1: hybrid ratio estimation problem construction
Assume that the noise label in the taxpayer registration information isThe true label of the sample is Y, assuming sample X and noise labelIndependently of each other, for any class C ∈ C there is:
note the bookWhere Q represents the conditional transition probability of a noisy tag to a true tag, the above equation is expressed in matrix form as follows:
further decomposing the matrix to obtain the following form; where H is a c x c matrix and satisfies that the diagonal element is 0, and G is a real diagonal matrix shaped as c x c;
according to the nature of matrix transformation, it can be seen that matrix H, matrix G, and matrix Q satisfy the following relationships:
(I-H) -1 G=Q T
Here Q is T The matrix is a conditional transition matrix in label noise learning, and the above relationship indicates that if the matrix H is solved, the conditional transition matrix is further solved, and the decomposition of the matrix is equivalent to the following c equations:
the equation is further expressed in the form:
wherein the following are satisfied:
the standard mixing ratio estimation problem is expressed in the form: f ═ κ H + (1-k) G (k ≧ 0), where FHG is the probability distribution function, and samples sampled at distribution F, H are assumed to be known, where F is the mixture and H, G is the composition; the equation obtained by the above matrix decomposition:it is the standard mixing ratio estimation problem, the mixing ratio coefficient H estimated by which is the mixing ratio estimation problem ij It is the elements of matrix H; therefore, by solving a series of mixed proportion estimation problems, the H matrix can be solved, and the conditional transition matrix Q is estimated according to the matrix relation T Therefore, a classifier with consistent risk is constructed based on the tag noise data, and taxpayer industry classification is carried out;
step 2: regeneration of compositions
The solution of the mixed scaling problem relies on the labeling of the anchor point, specifically, the maximum estimator of the mixed scaling factor if the anchor point sample is present and knownIs an unbiased estimate of the true mixture scaling factor k;
specifically, first, the mixture F sample is labeled as a positive sample class Y ═ 1, the labeled composition component H sample is labeled as a negative sample class Y ═ 1, an MLP network is constructed to perform binary prediction, and the output of the network is assumed to be F η (X), wherein X is sample characteristics, eta is network parameters, the MLP network is supervised trained by using noisy positive and negative samples, after training, the posterior probability prediction is carried out on the samples of the positive sample class by using the network, a threshold value tau is selected, and the sample set of the positive sample class is recorded asSet of negative sample class samples asInputting samples of the positive sample class into the network for prediction, wherein the sample set with a prediction value smaller than a selected threshold value is recorded asThen there isBringing the samples with the posterior probability ratio smaller than the threshold value into a negative sample set, and respectively obtaining a positive sample set and a negative sample set after reconstruction:andsatisfy the requirements ofAndtherefore, the regeneration of the composition sample is completed, and the problem of dependence of the traditional mixed proportion estimation method on the anchor point is solved;
step 3: probability density estimation based on kernel density estimation
On the basis of reconstructing the composition at Step2, estimating a probability density function of sample distribution based on a kernel density estimation method; specifically, a kernel function is established for representing the probability density estimation of the existing sample to any point in the feature space, wherein x is a point in the feature space, and x is i Is a known sample; and μ is the sample mean, and Σ is ρ 2 Q is the covariance matrix of the sample, then sample x using a Gaussian kernel function i For x place probability densityThe equation for the kernel is as follows:
then over the entire sample set, the probability density function estimator is:whereinFor sets of samples, from the positive and negative sample sets already obtainedThe probability density function of the reconstructed positive and negative samples is estimated as follows:
step 4: conditional branch matrix estimation
Sequentially solving c mixing proportion estimation problems constructed in Step1, solving corresponding c-1 mixing proportion coefficients for any mixing proportion estimation problem, and setting the noise label of the mixture asNoise signatures of compositions areCollecting the original samplesRespectively as sets of positive and negative samples in a mixture ratio estimation problemMethod based on Step2 generates new positive and negative sample setsAndand carrying out probability density estimation according to the kernel density estimation method of Step3 to respectively obtainAndthen, the maximum estimation quantity of the mixing proportion coefficient is estimated by adopting the method of maximum estimation of the mixing proportion problem in Step1Where G is a legal probability density function, estimatorI.e. the element H ij And (i ≠ j) estimating values, repeating the processes Step2,3 and 4 to solve all elements of the H matrix through circulation, then solving the G matrix according to the following properties, and further solving the conditional transition matrix Q T ;
(I-H) -1 G=Q T 。
7. The label noise learning method for taxpayer industry classification as claimed in claim 6, wherein in the step 4), the training device network parameter learning and taxpayer industry classification comprises the following specific steps:
step 1: training device learning based on tag noise data
Assume that the network parameter in the training apparatus is η and the noise sample isThe network parameter set is combined as w, the network parameter in the training device is learned by using the label noise data as supervision, and the output of the memory sample X under the mapping of the training device is g η (X) for g η (X) andmaking cross entropy loss, and adding a regularization term to prevent overfitting, wherein lambda is a regularization term control coefficient, minimizing a loss function, and an optimization objective is as follows:
under the control of the optimization objective, a network of training devices is used to predict the noise signature of the input samplesOutput result g η (X) performing operation through a softmax layer, wherein the sotfmax operation performs exponential normalization processing on the original output, and the original output is expressed as a predicted value of posterior probability; in particular, assume that the original network output isPerforming exponential operation on the output vector and performing normalization processing by softmax, wherein the output is in the following form;
step 2: constructing a conditional transition matrix layer
After finishing learning the network parameters of the training device, outputting g of the network η (X) outputting the posterior probability of the sample through softmax operationThe method is used for predicting the noise label, a conditional branch layer is added behind a softmax layer to serve as a branch layer, and the conversion from noise label prediction to real label prediction is realized;
step 3: taxpayer industry classification
8. The label noise learning method for taxpayer industry classification as claimed in claim 7, wherein in Step2 of Step 4), the specific method is as follows: let the noise label beThe true sample label is Y, the total number of classes is C, and sample characteristics X and noise labels are assumedIndependently of each other, for any class C ∈ C there is:
the original network output g is then η (X) passing through conditional transition matrixQ T The conversion can convert the original output into a new output Q (X) which satisfies Q (X) Q T g (X), where the new output q (X) is the posterior probability of the genuine tagWherein q is i (X) (i ═ 1, 2., C) is the i-th component of q (X), representing that X is the probability prediction value P of the i-th class of true tags (Y ═ i | X).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210498954.4A CN114817546B (en) | 2022-05-09 | 2022-05-09 | Tax payer industry classification-oriented label noise learning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210498954.4A CN114817546B (en) | 2022-05-09 | 2022-05-09 | Tax payer industry classification-oriented label noise learning method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114817546A true CN114817546A (en) | 2022-07-29 |
CN114817546B CN114817546B (en) | 2024-09-10 |
Family
ID=82513012
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210498954.4A Active CN114817546B (en) | 2022-05-09 | 2022-05-09 | Tax payer industry classification-oriented label noise learning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114817546B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118506069A (en) * | 2024-05-15 | 2024-08-16 | 云南联合视觉科技有限公司 | Image classification method for label with noise situation |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070005531A1 (en) * | 2005-06-06 | 2007-01-04 | Numenta, Inc. | Trainable hierarchical memory system and method |
CN109710768A (en) * | 2019-01-10 | 2019-05-03 | 西安交通大学 | A kind of taxpayer's industry two rank classification method based on MIMO recurrent neural network |
CN110705607A (en) * | 2019-09-12 | 2020-01-17 | 西安交通大学 | Industry multi-label noise reduction method based on cyclic re-labeling self-service method |
WO2021057427A1 (en) * | 2019-09-25 | 2021-04-01 | 西安交通大学 | Pu learning based cross-regional enterprise tax evasion recognition method and system |
CN112765358A (en) * | 2021-02-23 | 2021-05-07 | 西安交通大学 | Taxpayer industry classification method based on noise label learning |
CN112860895A (en) * | 2021-02-23 | 2021-05-28 | 西安交通大学 | Tax payer industry classification method based on multistage generation model |
WO2021196520A1 (en) * | 2020-03-30 | 2021-10-07 | 西安交通大学 | Tax field-oriented knowledge map construction method and system |
CN113712511A (en) * | 2021-09-03 | 2021-11-30 | 湖北理工学院 | Stable mode discrimination method for brain imaging fusion features |
-
2022
- 2022-05-09 CN CN202210498954.4A patent/CN114817546B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070005531A1 (en) * | 2005-06-06 | 2007-01-04 | Numenta, Inc. | Trainable hierarchical memory system and method |
CN109710768A (en) * | 2019-01-10 | 2019-05-03 | 西安交通大学 | A kind of taxpayer's industry two rank classification method based on MIMO recurrent neural network |
CN110705607A (en) * | 2019-09-12 | 2020-01-17 | 西安交通大学 | Industry multi-label noise reduction method based on cyclic re-labeling self-service method |
WO2021057427A1 (en) * | 2019-09-25 | 2021-04-01 | 西安交通大学 | Pu learning based cross-regional enterprise tax evasion recognition method and system |
WO2021196520A1 (en) * | 2020-03-30 | 2021-10-07 | 西安交通大学 | Tax field-oriented knowledge map construction method and system |
CN112765358A (en) * | 2021-02-23 | 2021-05-07 | 西安交通大学 | Taxpayer industry classification method based on noise label learning |
CN112860895A (en) * | 2021-02-23 | 2021-05-28 | 西安交通大学 | Tax payer industry classification method based on multistage generation model |
CN113712511A (en) * | 2021-09-03 | 2021-11-30 | 湖北理工学院 | Stable mode discrimination method for brain imaging fusion features |
Non-Patent Citations (4)
Title |
---|
SEONG MIN KYE: "通过有效的转移矩阵估计学习噪声标签以对抗标签错误", MACHINE LEARNING, 19 November 2021 (2021-11-19) * |
施方迤;汪子扬;梁军;: "基于半监督密集阶梯网络的工业故障识别", 化工学报, no. 07, 9 May 2018 (2018-05-09) * |
王丽客;孙媛;夏天赐;: "基于远程监督的藏文实体关系抽取", 中文信息学报, no. 03, 15 March 2020 (2020-03-15) * |
陈季梦;刘杰;黄亚楼;刘天笔;刘才华;: "基于半监督CRF的缩略词扩展解释识别", 计算机工程, no. 04, 15 April 2013 (2013-04-15) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118506069A (en) * | 2024-05-15 | 2024-08-16 | 云南联合视觉科技有限公司 | Image classification method for label with noise situation |
Also Published As
Publication number | Publication date |
---|---|
CN114817546B (en) | 2024-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112116030B (en) | Image classification method based on vector standardization and knowledge distillation | |
Liang et al. | Symbolic graph reasoning meets convolutions | |
CN109299342B (en) | Cross-modal retrieval method based on cycle generation type countermeasure network | |
CN112765358B (en) | Taxpayer industry classification method based on noise label learning | |
CN111552807B (en) | Short text multi-label classification method | |
Jiang et al. | Variational deep embedding: An unsupervised and generative approach to clustering | |
CN111782768B (en) | Fine-grained entity identification method based on hyperbolic space representation and label text interaction | |
Ji et al. | Unsupervised few-shot feature learning via self-supervised training | |
CN114169330A (en) | Chinese named entity identification method fusing time sequence convolution and Transformer encoder | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN112613308A (en) | User intention identification method and device, terminal equipment and storage medium | |
CN117237559B (en) | Digital twin city-oriented three-dimensional model data intelligent analysis method and system | |
CN112784031B (en) | Method and system for classifying customer service conversation texts based on small sample learning | |
CN112733965A (en) | Label-free image classification method based on small sample learning | |
CN109492610B (en) | Pedestrian re-identification method and device and readable storage medium | |
CN118113849A (en) | Information consultation service system and method based on big data | |
CN113591955A (en) | Method, system, equipment and medium for extracting global information of graph data | |
CN117217368A (en) | Training method, device, equipment, medium and program product of prediction model | |
CN116704433A (en) | Self-supervision group behavior recognition method based on context-aware relationship predictive coding | |
CN114817546B (en) | Tax payer industry classification-oriented label noise learning method | |
CN116431813A (en) | Intelligent customer service problem classification method and device, electronic equipment and storage medium | |
CN111783688B (en) | Remote sensing image scene classification method based on convolutional neural network | |
CN109740682B (en) | Image identification method based on domain transformation and generation model | |
CN117496228A (en) | Knowledge distillation and graph model-based small sample increment radiation source individual identification method | |
CN116029394B (en) | Self-adaptive text emotion recognition model training method, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |