[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114817546A - Label noise learning method for taxpayer industry classification - Google Patents

Label noise learning method for taxpayer industry classification Download PDF

Info

Publication number
CN114817546A
CN114817546A CN202210498954.4A CN202210498954A CN114817546A CN 114817546 A CN114817546 A CN 114817546A CN 202210498954 A CN202210498954 A CN 202210498954A CN 114817546 A CN114817546 A CN 114817546A
Authority
CN
China
Prior art keywords
taxpayer
network
matrix
text
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210498954.4A
Other languages
Chinese (zh)
Other versions
CN114817546B (en
Inventor
郑庆华
曹书植
阮建飞
赵锐
董博
师斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202210498954.4A priority Critical patent/CN114817546B/en
Publication of CN114817546A publication Critical patent/CN114817546A/en
Application granted granted Critical
Publication of CN114817546B publication Critical patent/CN114817546B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/10Tax strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Finance (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a label noise learning method for taxpayer industry classification, which comprises the following steps: firstly, extracting text information and non-text information in taxpayer business information, and performing text embedding and non-text coding processing respectively based on an XLNET text pre-training network and a coding technology to obtain characteristic information; secondly, constructing a TextCNN network for taxpayer industry classification, determining the number of layers of the network, the shape of a convolution kernel and the input and output dimensions of each layer according to the characteristic information and the target classification number, connecting an XLNet text pre-training network and the TextCNN network in series, and constructing an end-to-end training device by combining noisy taxpayer industry label data as supervision; thirdly, estimating a conditional transition matrix based on an improved mixed proportion estimation method; and finally, learning network parameters in the training device, and taking the conditional transfer matrix as a linear layer behind the TextCNN network, so as to realize the conversion from noise label prediction to real taxpayer industry label prediction and carry out taxpayer industry classification.

Description

Label noise learning method for taxpayer industry classification
Technical Field
The invention belongs to the technical field of text classification with labeled noise, and particularly relates to a label noise learning method for taxpayer industry classification.
Background
In recent years, market economy continues to prosper, the number of enterprises is increasing, and division of labor of the enterprises is refining. Concomitantly, the upgrade and further construction of tax systems has become an urgent need.
The taxpayer industry classification is a precondition for determining the policy of a taxpayer main body and preferential benefit, and is an important link for tax collection. At present, China mainly divides the taxpayer industry into 20 categories and 97 categories. Due to numerous types, the traditional manual classification method needs to consume a large amount of human resources and is limited by professional knowledge and experience of classifiers, and classification errors, namely label noise of taxpayer industry classification, are inevitably introduced, so that a series of adverse effects are caused to national statistics, tax collection and industrial and commercial management.
In recent years, with the acceleration of the "intelligence +" era, the artificial intelligence industry is rapidly developing and applied to various fields, and possibility is provided for the exploration and development of the wisdom tax. Research on enterprise taxpayer industry classification is the fundamental work of tax source classification management and is a key premise of intelligent tax informatization. Therefore, how to train a classifier based on the existing label noise data by a machine learning means to correctly classify the taxpayer industry becomes a problem to be solved urgently.
The invention relates to a related technical scheme of taxpayer industry classification problems, and the related invention patents comprise the following patents:
document 1: tax payer industry two-level classification method (201910024324.1) based on MIMO recurrent neural network
Document 2: taxpayer industry classification method based on noise label learning (202110201214.5)
Document 1 designs a GRU-based multi-input multi-output neural network structure, establishes a mapping relationship from industry major categories to industry details, and constructs a two-layer classification structure for realizing industry classification of taxpayers. However, this method relies on a strict labeling of the data and lacks practical value in the presence of tag noise.
Document 2 designs a BERT-CNN network for text classification, and constructs classifiers with consistent classifications by using tag noise data based on a semantic clustering method, but the performance limitation of the semantic clustering method introduces new errors into the classifiers.
Aiming at the defects of the technical scheme, the invention aims to construct a classifier based on the consistent risks of the tag noise data without depending on additional manual labeling, overcome the classification deviation caused by adopting a semantic clustering method in the prior art, and ensure that the classifier constructed based on the tag noise data and the classifier constructed by adopting real labeling data have consistent classification risks in a statistical sense.
The core of constructing a risk-consistent classifier based on tag noise data is as follows: a statistically consistent classifier is constructed by evaluating a conditional transition matrix (a matrix made up of the conditional probabilities of true labels given noisy labels). The invention creatively converts the problem of the conditional transition matrix estimation into the problem of the mixed proportion estimation, and obtains an approximate conditional transition matrix by estimating the mixed proportion coefficient. However, the conventional mixed proportion estimation method is only suitable for the two-classification scene and depends on an anchor point (a sample definitely belongs to a certain class), and the taxpayer industry classification problem has a plurality of industry categories and belongs to a multi-classification problem, and the anchor point is difficult to label and acquire. Therefore, it is a major challenge of the present invention to scale the mixture ratio estimation problem from the binary analogy to the multi-classification and overcome the anchor point dependency problem.
Disclosure of Invention
The invention aims to provide a label noise learning method for taxpayer industry classification, which is used for constructing a risk consistency classifier by estimating a conditional transition matrix (a matrix formed by the conditional probability of a real label under the condition of giving a noise label) based on label noise data.
The invention is realized by adopting the following technical scheme:
a label noise learning method for taxpayer industry classification comprises the following steps:
firstly, extracting text information and non-text information in taxpayer business information, and performing text embedding and non-text coding processing respectively based on an XLNET text pre-training network and a coding technology to obtain characteristic information; secondly, constructing a TextCNN network for taxpayer industry classification, determining the number of layers of the network, the shape of a convolution kernel and the input and output dimensions of each layer according to the characteristic information and the target classification number, connecting an XLNet text pre-training network and the TextCNN network in series, and constructing an end-to-end training device by combining noisy taxpayer industry label data as supervision; thirdly, estimating a conditional transition matrix based on an improved mixed proportion estimation method; and finally, learning network parameters in the training device, and taking the conditional transfer matrix as a linear layer behind the TextCNN network, so as to realize the conversion from noise label prediction to real taxpayer industry label prediction and carry out taxpayer industry classification.
A further improvement of the invention is that the method comprises in particular the steps of:
1) taxpayer industry information processing
The taxpayer industry information processing comprises text information processing and non-text information processing, firstly, word segmentation and word embedding are carried out on taxpayer text information based on an XLNET text pre-training network to form corresponding word vectors, then text features are generated by splicing, secondly, numerical value features and category features in the taxpayer non-text information are preprocessed by respectively using a standardization process and a one-hot coding technology, then a linear network layer is established for feature mapping to generate non-text features consistent with text feature dimensions, and finally, the text features and the non-text features are spliced to form feature information;
2) taxpayer industry classification network construction and training device initialization
Constructing a TextCNN network for taxpayer industry classification, wherein the network comprises three layers of a convolutional layer, a pooling layer and a full-connection layer, sequentially determining the layer number of the TextCNN network, the shape of a convolutional core and the input and output dimensions of each layer based on the characteristic information and the target classification number obtained in the step 1), then connecting an XLNet pre-training network with the TextCNN network in series, and constructing an end-to-end training device by taking a noisy taxpayer industry information label as supervision;
3) conditional branch matrix estimation
Estimating a probability density function according to noisy taxpayer industry information data based on a kernel density estimation method, converting a conditional transfer matrix estimation problem into a mixed proportion estimation problem, and solving a corresponding mixed proportion coefficient based on an improved mixed proportion estimation method to obtain a conditional transfer matrix;
4) training device network parameter learning and taxpayer industry classification
And learning network parameters of the training device based on the label noise data, and after the training is finished, adding the estimated conditional transition matrix as a linear conversion layer to the training device to finish the conversion from the noise label prediction to the real label prediction, thereby realizing the tax payer industry classification.
The further improvement of the invention is that in the step 1), the taxpayer industry information processing specifically comprises the following steps:
step 1: taxpayer industry text information preprocessing
Extracting text information of taxpayer industry, deleting special symbols, numbers and meaningless symbols of quantifier words in the text information, and finishing preprocessing the text information of taxpayers;
step 2: text word embedding based on XLNET pre-training network
The method comprises the steps that a text is coded based on an XLNT pre-training network to generate word vectors, an XLNT pre-training model is designed based on a transform, and meanwhile, the relation between two-way contexts is captured, so that the problem that a pre-training stage is inconsistent with a fine-tuning stage caused by a mask mechanism of a bert model is solved, and a double-current self-attention mechanism is used, so that the pre-training effect is more obvious; the XLNET model applied to Chinese uses a 24-layer network structure, and adopts sentencepiec to perform word segmentation; encoding the text characteristics obtained in Step1 by using XLNET of a Chinese version, thereby obtaining a slave word vector;
step 3: taxpayer industry text feature generation
Assuming that the taxpayer has k text features in total, a word element is mapped into a t-dimensional word vector by an XLNT pre-training network, and the ith text feature is recorded to have h i For each word element, the ith text feature is mapped to an h i A matrix of x t; splicing the feature matrixes of the text feature mappings, so that the text features of the samples are mapped into one
Figure BDA0003634472650000041
Generating a taxpayer text feature matrix;
step 4: tax payer industry value feature processing
Standardizing the numerical characteristics of the nontext characteristics of the taxpayers, assuming that n training samples and m numerical characteristics are shared, and recording the value of the jth numerical characteristic of the ith sample as X ij The mean value of the jth numerical characteristic is mu j To satisfy
Figure BDA0003634472650000042
The standard deviation of the jth numerical characteristic is sigma j Satisfy the following requirements
Figure BDA0003634472650000043
The numerical characteristic after normalization is
Figure BDA0003634472650000051
Step 5: tax payer industry category feature processing
Coding the category characteristics in the nontextual characteristics of the taxpayer, and coding and representing the category characteristics by adopting an N-dimensional vector if the category characteristics have N possible values; specifically, the corresponding position of the category feature value is set to be 1, and the rest positions are set to be 0, namely, a one-hot coding method is adopted, after all the category features are coded, the longest coding length in the category features is selected for completion, and all vectors after completion are spliced to form a category feature matrix;
step 6: taxpayer industry non-text feature generation
M normalized numerical features and a shape of v × N were obtained after Step4 and Step5, respectively max Class feature matrix of (1), where N max Expressing the longest class code length, then establishing two linear network layers for feature mapping, wherein the first linear network layer is 1 × t in shape and is used for converting the normalized numerical features into a m × t numerical feature matrix, and the second linear network layer is N in shape max The xt is used for mapping the category characteristics into a upsilon xt category characteristic matrix, and splicing the two mapped characteristic matrices to obtain a final non-text characteristic matrix with the shape of (v + m) xt;
step 7: taxpayer characteristic information generation
Splicing the text feature matrix generated at Step3 and the non-text feature matrix generated at Step6 to generate a shape of
Figure BDA0003634472650000052
As final characteristic information.
The further improvement of the invention is that in the step 2), the taxpayer industry classification network construction and training device is initialized: establishing a TextCNN network for text classification, wherein the TextCNN network comprises three layers: (1) a convolutional layer, (2) a maximum pooling layer and (3) a full connection layer, wherein an XLNET pre-training network in the step 1) is connected with a TextCNN network in series to construct a training device, and end-to-end training is carried out by taking taxpayer label noise data as supervision; the specific implementation details are as follows:
step 1: taxpayer industry classification network construction
Constructing a TextCNN network for taxpayer industry classification, wherein the TextCNN network comprises three layers, namely a convolution layer, a pooling layer and a full-connection layer; specifically, the convolution layer of the TextCNN uses a convolution kernel with the shape of n × t to perform convolution operation for extracting line features, wherein the values of n are {2, 3,4, 5, 6}, the TextCNN adopts a maximum pooling layer as a pooling layer for maximum value extraction of a feature map after convolution, further compression is performed to extract features, then a full connection layer is established, assuming that the total number of categories to be classified of taxpayer industry classification is c, and if the number of features is s after passing through the maximum pooling layer, the full connection layer with the shape of s × c is established for mapping feature information into a c-dimensional vector, and then taxpayer industry classification is performed;
step 2: training device initialization
Connecting the XLNET text pre-training network in the step 1) with the constructed TextCNN network in series to form a training device; and (3) taking the label noise data of the taxpayer industry as input, predicting the noise label, forming an end-to-end device for training, and initializing the network parameters of the training device.
The invention further improves the method in Step2 in Step 2), wherein the network parameter is alpha, the sample is X, and the noise label is
Figure BDA0003634472650000061
The set of network parameters is w, and the memory sample X is output as
Figure BDA0003634472650000062
For
Figure BDA0003634472650000063
And
Figure BDA0003634472650000064
making cross entropy loss and adding a regularization term to prevent overfitting, where λ is the regularization term control coefficient, mostThe loss function is minimized and the optimization objective is as follows:
Figure BDA0003634472650000065
a further improvement of the invention is that, in step 3), the conditional transition matrix estimation: converting a conditional transfer matrix estimation problem in the label noise learning problem into a mixing proportion estimation problem, and solving a mixing proportion coefficient based on an improved mixing proportion estimation method to further obtain a conditional transfer matrix; the specific implementation details are as follows:
step 1: hybrid ratio estimation problem construction
Assume that the noise label in the taxpayer registration information is
Figure BDA0003634472650000066
The true label of the sample is Y, assuming sample X and noise label
Figure BDA0003634472650000071
Independently of each other, for any class C ∈ C there is:
Figure BDA0003634472650000072
note the book
Figure BDA0003634472650000073
P i =P(X|Y=i)、
Figure BDA0003634472650000074
Where Q represents the conditional transition probability of a noisy tag to a true tag, the above equation is expressed in matrix form as follows:
Figure BDA0003634472650000075
further decomposing the matrix to obtain the following form; where H is a c x c matrix and satisfies that the diagonal element is 0, and G is a real diagonal matrix shaped as c x c;
Figure BDA0003634472650000076
according to the nature of matrix transformation, it can be seen that matrix H, matrix G, and matrix Q satisfy the following relationships:
Figure BDA0003634472650000077
(i-H) -1 G=Q T
here Q is T The matrix is a conditional transition matrix in label noise learning, and the above relationship indicates that if the matrix H is solved, the conditional transition matrix is further solved, and the decomposition of the matrix is equivalent to the following c equations:
Figure BDA0003634472650000078
the equation is further expressed in the form:
Figure BDA0003634472650000079
wherein the following are satisfied:
Figure BDA0003634472650000081
the standard mixing ratio estimation problem is expressed in the form: f ═ kH + (1-k) G (k ≧ 0), where fh G is the probability distribution function, and samples sampled at distribution F, H are assumed to be known, where F is the mixture and H, G is the composition; the equation obtained by the above matrix decomposition:
Figure BDA0003634472650000082
it is the standard mixing ratio estimation problem, the mixing ratio coefficient H estimated by which is the mixing ratio estimation problem ij It is the elements of matrix H; therefore, by solving a series of mixed proportion estimation problems, the H matrix can be solved, and the conditional transition matrix Q is estimated according to the matrix relation T Therefore, a classifier with consistent risk is constructed based on the tag noise data, and taxpayer industry classification is carried out;
step 2: regeneration of compositions
The solution of the mixed scaling problem relies on the labeling of the anchor point, specifically, the maximum estimator of the mixed scaling coefficients if the anchor point samples are present and known
Figure BDA0003634472650000083
Is an unbiased estimate of the true mixture scaling factor k;
specifically, first, the mixture F sample is labeled as a positive sample class Y ═ 1, the labeled composition component H sample is labeled as a negative sample class Y ═ 1, an MLP network is constructed to perform binary prediction, and the output of the network is assumed to be F η (X), wherein X is sample characteristics, eta is network parameters, the MLP network is supervised trained by using noisy positive and negative samples, after training, the posterior probability prediction is carried out on the samples of the positive sample class by using the network, a threshold value tau is selected, and the sample set of the positive sample class is recorded as
Figure BDA0003634472650000084
Set of negative sample class samples as
Figure BDA0003634472650000085
Inputting samples of the positive sample class into the network for prediction, wherein the sample set with a prediction value smaller than a selected threshold value is recorded as
Figure BDA0003634472650000086
Then there is
Figure BDA0003634472650000087
Bringing the samples with the posterior probability ratio smaller than the threshold value into a negative sample set to respectively obtainPositive and negative sample sets after reconstruction:
Figure BDA0003634472650000088
and
Figure BDA0003634472650000089
satisfy the requirement of
Figure BDA00036344726500000810
And
Figure BDA00036344726500000811
therefore, the regeneration of the composition sample is completed, and the problem of dependence of the traditional mixed proportion estimation method on the anchor point is solved;
step 3: probability density estimation based on kernel density estimation
On the basis of reconstructing the composition at Step2, estimating a probability density function of sample distribution based on a kernel density estimation method; specifically, a kernel function is established for representing the probability density estimation of the existing sample to any point in the feature space, wherein x is a point in the feature space, and x is i Is a known sample; and μ is the sample mean, and Σ is ρ 2 Q is the covariance matrix of the sample, then sample x is the case using a Gaussian kernel i The contribution to the probability density at x represents the form of the kernel function as follows:
Figure BDA0003634472650000091
then over the entire sample set, the probability density function estimator is:
Figure BDA0003634472650000092
wherein
Figure BDA0003634472650000093
For sets of samples, from the positive and negative sample sets already obtained
Figure BDA0003634472650000094
The probability density function of the reconstructed positive and negative samples is estimated as follows:
Figure BDA0003634472650000095
Figure BDA0003634472650000096
step 4: conditional branch matrix estimation
Sequentially solving c mixing proportion estimation problems constructed in Step1, solving corresponding c-1 mixing proportion coefficients for any mixing proportion estimation problem, and setting the noise label of the mixture as
Figure BDA0003634472650000097
Noise signatures of compositions are
Figure BDA0003634472650000098
Collecting original samples
Figure BDA0003634472650000099
Set of positive and negative samples, respectively, as among the mixture ratio estimation problem
Figure BDA00036344726500000910
Method based on Step2 generates new positive and negative sample sets
Figure BDA00036344726500000911
And
Figure BDA00036344726500000912
and carrying out probability density estimation according to the kernel density estimation method of Step3 to respectively obtain
Figure BDA00036344726500000913
And
Figure BDA00036344726500000914
then, the maximum estimation quantity of the mixing proportion coefficient is estimated by adopting the method of maximum estimation of the mixing proportion problem in Step1
Figure BDA00036344726500000915
Where G is a legal probability density function, estimator
Figure BDA00036344726500000916
I.e. element H ij (i ≠ j) through circulation and repetition of the steps 2,3 and 4, all elements of the H matrix can be solved, and then the G matrix and the conditional transition matrix Q can be solved according to the following properties T
Figure BDA00036344726500000917
(I-H) -1 G=Q T
The further improvement of the invention is that in the step 4), the training device network parameter learning and taxpayer industry classification comprises the following specific steps:
step 1: training device learning based on tag noise data
Assume that the network parameter in the training apparatus is η and the noise sample is
Figure BDA0003634472650000101
The network parameter set is combined as w, the network parameter in the training device is learned by using the label noise data as supervision, and the output of the memory sample X under the mapping of the training device is g η (X) for g η (X) and
Figure BDA0003634472650000102
making cross entropy loss, and adding a regularization term to prevent overfitting, wherein lambda is a regularization term control coefficient, minimizing a loss function, and an optimization objective is as follows:
Figure BDA0003634472650000103
under control of the optimization objective, a network of training devices is used to predict noise signatures of input samples
Figure BDA0003634472650000104
Output result g η (X) performing operation through a softmax layer, wherein the sotfmax operation performs exponential normalization processing on the original output, and the original output is expressed as a predicted value of posterior probability; specifically, assume that the original network output is
Figure BDA0003634472650000105
Performing exponential operation on the output vector and performing normalization processing by softmax, wherein the output is in the following form;
Figure BDA0003634472650000106
step 2: constructing a conditional transition matrix layer
After finishing learning the network parameters of the training device, outputting g of the network η (X) outputting the posterior probability of the sample through softmax operation
Figure BDA0003634472650000107
The method is used for predicting the noise label, a conditional branch layer is added behind a softmax layer to serve as a branch layer, and the conversion from noise label prediction to real label prediction is realized;
step 3: taxpayer industry classification
On the basis of constructing the conditional transition layer, for a newly input sample X, the output of the TextCNN network is q (X), and the output is calculated
Figure BDA0003634472650000108
And obtaining a subscript r corresponding to the maximum component q (X), namely the trade classification corresponding to the taxpayer.
The further improvement of the invention is that in Step2 of Step 4), the specific method is as follows: let the noise label be
Figure BDA0003634472650000111
The true sample label is Y, the total number of classes is C, and sample characteristics X and noise labels are assumed
Figure BDA0003634472650000112
Independent of each other, for any class
Figure BDA0003634472650000113
Comprises the following steps:
Figure BDA0003634472650000114
the original network output g is then η (X) passing through conditional transition matrix Q T The conversion can convert the original output into a new output Q (x) satisfying Q (x) ═ Q T g (X), where the new output q (X) is the posterior probability of the genuine tag
Figure BDA0003634472650000115
Wherein q is i (X) (i ═ 1, 2., C) is the i-th component of q (X), representing that X is the probability prediction value P of the i-th class of true tags (Y ═ i | X).
The invention has at least the following beneficial technical effects:
the invention provides a label noise learning method facing taxpayer industry classification, compared with the prior art, the invention has the advantages that:
(1) the invention creatively converts the conditional transition matrix estimation problem in label noise learning into a mixed proportion estimation problem, and constructs classifiers with consistent risks based on label noise data by solving the mixed proportion estimation problem. The method is different from the method that the prior art scheme depends on semantic clustering, and the method does not depend on an additional clustering method, so that a new error caused by the performance limitation of the clustering method is avoided.
(2) The invention expands the traditional mixed proportion estimation method from dichotomy to multi-classification scenes, is different from the situation that the traditional method is limited to two classifications, and the improved mixed proportion estimation method can be applied to the multi-classification situation and has wider application scenes.
(3) The invention overcomes the problem of dependence of the traditional mixed proportion estimation method on the anchor point, is different from the requirement of the traditional method on the anchor point marking, constructs a completely new mixed proportion estimation problem based on the method for regenerating the composition, and realizes the direct estimation of the mixed proportion coefficient under the condition of not depending on the anchor point marking.
Drawings
FIG. 1 is an overall framework flow diagram.
Fig. 2 is a flow chart of taxpayer business information processing.
FIG. 3 is a flowchart of the taxpayer industry classification network construction and training device initialization.
Fig. 4 is a flow chart of conditional branch matrix estimation.
FIG. 5 is a flow chart of network parameter learning and taxpayer industry classification for the training device.
FIG. 6 is a schematic diagram of a tag noise learning network.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in fig. 1, in an embodiment of the present invention, the label noise learning method for taxpayer industry classification according to the present invention includes the following steps:
step 1. taxpayer industry information processing
As shown in fig. 2, the method for extracting the text information and the non-text information of the taxpayer and performing information processing includes the following steps:
s101, taxpayer industry text information preprocessing
Illegal characters such as special symbols, numbers, quantifiers, etc. in the taxpayer text information are deleted (fig. 2S 101). In an embodiment, a total of 3 text features are extracted as the text information features of the taxpayer, including: { taxpayer name, registered address, and operation scope }, wherein the taxpayer name is ' xi ' an city xin Yao ceramic SI limited science and technology company ', a special symbol SI (fig. 2S101) is deleted firstly, and sequence segmentation is performed according to words to obtain { xi, an, city, xin, Yao, pottery, porcelain, limited, department, skill, public, department }.
S102, embedding text words based on XLNET pre-training network
Word embedding is performed on the text based on a text pre-training network XLNET (FIG. 2S102), forming a word vector. In this embodiment, assuming the encoding length is t, the XLNet text pre-training network embeds the original lemmas into word vectors with length t. If the original text sequence length is 13, the XLNet pre-training network may map the text to a 13 × t text feature, specifically, in the embodiment, if t is 528, a 13 × 528 text feature may be obtained (fig. 2S 102).
S103, taxpayer industry text feature generation
Based on an XLNET text pre-training network, repeating the process of S102, performing word embedding on all text characteristic sequences, and further splicing the embedded word vectors to form taxpayer text characteristics (FIG. 2S 103).
In particular, in the embodiment, it is assumed that there are 3 items in total for the taxpayer industry text feature, including: { taxpayer name, registration address, business range }, and 3 text features are respectively mapped to 13 × 528, 7 × 528, and 10 × 528 text features, and the text features are spliced to obtain an overall taxpayer text feature (as shown in fig. 2S103) with a shape of 30 × 528.
S104, tax payer industry value feature processing
And extracting numerical characteristics of the taxpayer industry, including 4 numerical characteristics in total, namely { registered fund, total investment, total asset amount and interest liability } and carrying out standardized operation.
Specifically, in the present embodiment, the sample mean μ of the 4 columns of features is first calculated 1 ,μ 2 ,...,μ 4 And the sample variance σ 1 ,σ 2 ,...,σ 4 Let X be written i The value of the ith numerical characteristic of the sample X is obtained through a z-score formula
Figure BDA0003634472650000131
The numerical feature is normalized (fig. 2S 104).
S105, taxpayer industry category feature processing
The category information is encoded based on a one-hot encoding technique. In this embodiment, the selecting 2 category features for encoding specifically includes: { unit property, accounting method }, wherein one item of the unit property includes five cases of enterprises, civil non-enterprise units, public institutions, social groups, and the like. The corresponding one-hot codes are {10000, 01000, 00100, 00010, 00001}, respectively, and the one-hot codes are performed on all the category feature information (fig. 2S 105).
S106, non-text characteristic generation of taxpayer industry
And constructing a linear network mapping layer, mapping the obtained numerical characteristics and the category characteristics into the dimension which is the same as the dimension of the text characteristics, and then splicing the numerical characteristics and the category characteristics to form the nontext characteristics of the taxpayer industry.
Specifically, in an embodiment, linear network mapping layers with shapes of 1 × 528 and 5 × 528 are established, respectively. The method is used for mapping the numerical features and the category features to the same dimension of the text features, and then splicing is carried out to form a non-text feature matrix (fig. 2S 106).
S107, taxpayer characteristic information generation
And splicing the taxpayer text characteristics obtained in the step S103 and the taxpayer non-text characteristics obtained in the step S106 to finally form taxpayer industry characteristic information.
In an embodiment, the text feature with the shape of 30 × 528 and the non-text feature with the shape of 6 × 258 are spliced to form the final taxpayer industry feature information, which has the shape of 36 × 528 (fig. 2S 107).
Step2, taxpayer industry classification network construction and training device initialization
As shown in fig. 3, the TextCNN network is established for taxpayer industry classification, and the shape of the TextCNN convolution kernel and the input and output dimensions are sequentially determined according to the generated taxpayer industry characteristics and the target total number to be classified. And connecting the XLNET text pre-training network and the TextCNN network in series to form a training device, and performing end-to-end training on the training device based on the label noise data for initializing the network parameters of the training device.
S201. taxpayer industry classification network construction
Constructing a TextCNN network for taxpayer industry classification, wherein the TextCNN network comprises three layers: a convolutional layer, a pooling layer, and a fully-connected layer.
Specifically, in the embodiment, according to the situation of the taxpayer text feature, a convolution kernel is established, the line feature of the feature map is extracted, in the embodiment, a convolution kernel with the shape of n × 528 is used, where n is {2, 3,4, 5, 6}, a maximum pooling layer is established, further feature compression and extraction are performed on the features after convolution, a full connection layer is finally established, and it is assumed that the number of the total features of the feature map output after the pooling layer is n 1 If the total number of categories is c, the shape is n 1 And x c, in this embodiment, c is 97 (fig. 3S 201).
S204. training device initialization
Connecting the XLNET text pre-training network in the step 1) and the constructed TextCNN network in series to form a training device. And performing end-to-end training based on the label noise data, and initializing network parameters of the training device.
In an embodiment, taxpayer industry label noise data is used as input, noise label is predicted, and end-to-end training is performed for initializing network parameters (as in fig. 3S 202). Assume a network parameter of α and a noise sample of
Figure BDA0003634472650000151
The set of network parameters is w, and the memory sample X is output as
Figure BDA0003634472650000152
For the
Figure BDA0003634472650000153
And
Figure BDA0003634472650000154
making cross entropy loss, and adding a regularization term to prevent overfitting, wherein lambda is a regularization term control coefficient, minimizing a loss function, and an optimization objective is as follows:
Figure BDA0003634472650000155
step3, solving conditional transition matrix
As shown in fig. 4, firstly, a mixture ratio estimation problem is constructed, so that the original conditional transition matrix estimation problem is converted into a mixture ratio estimation problem, secondly, a brand-new mixture ratio estimation problem is constructed based on a composition regeneration method, probability density is estimated according to a kernel density estimation method, and then a mixture ratio coefficient is solved, and a conditional transition matrix is estimated. The specific steps are as follows:
s301, construction of mixed proportion estimation problem
In the present embodiment, it is assumed that the noisy label in the taxpayer registration information is
Figure BDA0003634472650000156
The sample is X, the true label of the sample is Y, if the sample is X and the label with noise is Y
Figure BDA0003634472650000157
Independently of each other, the following relationships are given:
Figure BDA0003634472650000161
meanwhile, the above relationship may be converted into the following form:
Figure BDA0003634472650000162
it follows that the above c equations are equivalent to the c standard mixing ratio problem. In the embodiment, the total number of the objects to be classified is c, which is 97, and if the matrix H and the matrix G can be obtained, the original equation can be obtained
Figure BDA0003634472650000163
Further, the overall conditional branch matrix is obtained, and therefore the original conditional branch matrix estimation problem is converted into the mixture ratio estimation problem (fig. 4S 301).
S302, regeneration of composition
In an embodiment, the hypothesis corresponds to a noise label class
Figure BDA0003634472650000164
Respectively is
Figure BDA0003634472650000165
And
Figure BDA0003634472650000166
Figure BDA0003634472650000167
using the { i, j } classes as positive and negative sample sets respectively
Figure BDA0003634472650000168
And
Figure BDA0003634472650000169
designing a two-class network for prediction, assuming the output of the network is f η (X), wherein X is the sample feature after dimensionality reduction of the input. η is a parameter of the network. Using positive and negative samples to perform sensor networkSupervised training, after training of the network is completed. And carrying out posterior probability prediction on the samples of the positive sample class by using the network. Selecting a threshold value tau, and correcting a sample class sample set into
Figure BDA00036344726500001610
Set of negative examples class examples as
Figure BDA00036344726500001611
The sample set of positive sample class whose output is less than the selected threshold value through network prediction is recorded as
Figure BDA00036344726500001612
Then there is
Figure BDA00036344726500001613
With a posterior probability ratio less than a threshold
Figure BDA00036344726500001614
Copying the sample set to a negative sample set to obtain a reconstructed positive sample set and a reconstructed negative sample set:
Figure BDA00036344726500001615
and
Figure BDA00036344726500001616
and satisfy
Figure BDA00036344726500001617
And
Figure BDA00036344726500001618
thereby completing the regeneration of the sample (as in fig. 4S 302).
S303. probability density function estimation
For the new sample obtained in S302
Figure BDA00036344726500001619
And
Figure BDA00036344726500001620
probability density function estimation is performed on the sample set, and the estimated functions (as shown in fig. 4S303) obtained by using the kernel density estimation method are respectively as follows:
Figure BDA00036344726500001621
Figure BDA0003634472650000171
s304. solving conditional transition matrix
A dual cycle structure is established, and the outer layer and the inner layer are sequentially traversed through cycle
Figure BDA0003634472650000172
And
Figure BDA0003634472650000173
and if i ≠ j, the processes of S302 and S303 are executed circularly, and the mixing ratio coefficient is further calculated
Figure BDA0003634472650000174
The mixing ratio coefficient is H ij Then obtaining a G matrix according to the following relation;
Figure BDA0003634472650000175
based on the obtained H matrix and G matrix, the following relationships may be followed: (I-H) -1 G=Q T Obtaining a conditional transition matrix Q T (as in fig. 4S 304).
Step4, training device network parameter learning and taxpayer industry classification
As shown in fig. 5, training the training device based on the tag noise data for learning the network parameters of the training device, and adding a conditional transition layer after the training device to complete taxpayer industry classification, which includes the following steps:
s401, learning of a training device based on tag noise data
In this embodiment, it is assumed that the input to the training apparatus is a noisy data sample
Figure BDA0003634472650000176
Wherein X is 36 × 528 input feature vector, and is mapped to g through network η The 97-dimensional output vector of (X). For noise tag
Figure BDA0003634472650000177
And network output g η (X) making cross entropy loss, training network parameters according to the loss function, and marking the trained network parameters as eta (FIG. 5S401)
S402, constructing a conditional transfer matrix layer
The conditional branch matrix layer is added after the training device and prediction is performed for the new sample.
Specifically, in the present embodiment, the calculated 97 × 97 conditional transition matrix Q is used T As a conditional transfer layer. Outputs g of the original η (X) is converted to Q (X), i.e. Q (X) ═ Q T g η (x) Here, q (X) denotes the prediction of the true label for sample X. Wherein q is i (X) is the ith component of q (X), representing the probability that sample X belongs to class i (fig. 5S 402).
S403. taxpayer industry classification
As shown in fig. 6, text information and non-text feature information of the taxpayer are respectively extracted, taxpayer industry features are extracted through a feature extraction module, a condition transfer matrix is estimated based on the extracted features to serve as a final condition transfer layer of the training device, and taxpayer industry classification is performed based on the training device. Specifically, in the embodiment, assuming that the taxpayer characteristic information is X, the output of the training device is q (X), where q (X) is the true label prediction of the sample X, and q (X) is recorded i (X) (i ═ 1, 2.., 97) for the ith component of q (X), choosing the subscript corresponding to the largest component
Figure BDA0003634472650000181
As classifications for taxpayer industry(fig. 5S 403).
It will be understood by those skilled in the art that the foregoing is only exemplary of the method of the present invention and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A label noise learning method for taxpayer industry classification is characterized by comprising the following steps:
firstly, extracting text information and non-text information in taxpayer business information, and performing text embedding and non-text coding processing respectively based on an XLNET text pre-training network and a coding technology to obtain characteristic information; secondly, constructing a TextCNN network for taxpayer industry classification, determining the number of layers of the network, the shape of a convolution kernel and the input and output dimensions of each layer according to the characteristic information and the target classification number, connecting an XLNet text pre-training network and the TextCNN network in series, and constructing an end-to-end training device by combining noisy taxpayer industry label data as supervision; thirdly, estimating a conditional transition matrix based on an improved mixed proportion estimation method; and finally, learning network parameters in the training device, and taking the conditional transfer matrix as a linear layer behind the TextCNN network, so as to realize the conversion from noise label prediction to real taxpayer industry label prediction and carry out taxpayer industry classification.
2. The label noise learning method for taxpayer industry classification as claimed in claim 1, wherein the method specifically comprises the following steps:
1) taxpayer industry information processing
The taxpayer industry information processing comprises text information processing and non-text information processing, firstly, word segmentation and word embedding are carried out on taxpayer text information based on an XLNET text pre-training network to form corresponding word vectors, then text features are generated by splicing, secondly, numerical value features and category features in the taxpayer non-text information are preprocessed by respectively using a standardization process and a one-hot coding technology, then a linear network layer is established for feature mapping to generate non-text features consistent with text feature dimensions, and finally, the text features and the non-text features are spliced to form feature information;
2) taxpayer industry classification network construction and training device initialization
Constructing a TextCNN network for taxpayer industry classification, wherein the network comprises three layers of a convolutional layer, a pooling layer and a full-connection layer, sequentially determining the layer number of the TextCNN network, the shape of a convolutional core and the input and output dimensions of each layer based on the characteristic information and the target classification number obtained in the step 1), then connecting an XLNet pre-training network with the TextCNN network in series, and constructing an end-to-end training device by taking a noisy taxpayer industry information label as supervision;
3) conditional branch matrix estimation
Estimating a probability density function according to noisy taxpayer industry information data based on a kernel density estimation method, converting a conditional transfer matrix estimation problem into a mixed proportion estimation problem, and solving a corresponding mixed proportion coefficient based on an improved mixed proportion estimation method to obtain a conditional transfer matrix;
4) training device network parameter learning and taxpayer industry classification
And learning network parameters of the training device based on the label noise data, and after the training is finished, adding the estimated conditional transition matrix as a linear conversion layer to the training device to finish the conversion from the noise label prediction to the real label prediction, thereby realizing the tax payer industry classification.
3. The label noise learning method for taxpayer industry classification as claimed in claim 2, wherein in the step 1), the taxpayer industry information processing specifically comprises the following steps:
step 1: taxpayer industry text information preprocessing
Extracting text information of taxpayer industry, deleting special symbols, numbers and meaningless symbols of quantifier words in the text information, and finishing preprocessing the text information of taxpayers;
step 2: text word embedding based on XLNET pre-training network
The method comprises the steps that a text is coded based on an XLNT pre-training network to generate word vectors, an XLNT pre-training model is designed based on a transform, and meanwhile, the relation between two-way contexts is captured, so that the problem that a pre-training stage is inconsistent with a fine-tuning stage caused by a mask mechanism of a bert model is solved, and a double-current self-attention mechanism is used, so that the pre-training effect is more obvious; the XLNET model applied to Chinese uses a 24-layer network structure, and adopts sentencepiec to perform word segmentation; encoding the text characteristics obtained in Step1 by using XLNET of a Chinese version, thereby obtaining a slave word vector;
step 3: taxpayer industry text feature generation
Assuming that the taxpayer has k text features in total, a word element is mapped into a t-dimensional word vector by an XLNT pre-training network, and the ith text feature is recorded to have h i For each word element, the ith text feature is mapped to an h i A matrix of x t; splicing the feature matrixes of the text feature mappings, so that the text features of the samples are mapped into one
Figure FDA0003634472640000021
Generating a taxpayer text feature matrix;
step 4: tax payer industry value feature processing
Standardizing the numerical characteristics of the nontext characteristics of the taxpayers, assuming that n training samples and m numerical characteristics are shared, and recording the value of the jth numerical characteristic of the ith sample as X ij The mean value of the jth numerical characteristic is mu j Satisfy the following requirements
Figure FDA0003634472640000031
The standard deviation of the jth numerical characteristic is sigma j Satisfy the following requirements
Figure FDA0003634472640000032
The numerical characteristic after normalization is
Figure FDA0003634472640000033
Step 5: tax payer industry category feature processing
Coding the category characteristics in the nontextual characteristics of the taxpayer, and coding and representing the category characteristics by adopting an N-dimensional vector if the category characteristics have N possible values; specifically, the corresponding position of the category feature value is set to be 1, and the rest positions are set to be 0, namely, a one-hot coding method is adopted, after all the category features are coded, the longest coding length in the category features is selected for completion, and all vectors after completion are spliced to form a category feature matrix;
step 6: taxpayer industry non-text feature generation
M normalized numerical features and a shape of v × N were obtained after Step4 and Step5, respectively max Class feature matrix of (1), where N max Expressing the longest class code length, then establishing two linear network layers for feature mapping, wherein the shape of the first linear network layer is 1 × t and is used for converting the numerical features after standardization into a numerical feature matrix of m × t, and the shape of the second linear network layer is N max The xt is used for mapping the category characteristics into a v xt category characteristic matrix, and splicing the two mapped characteristic matrices to obtain a final (v + m) xt non-text characteristic matrix;
step 7: taxpayer characteristic information generation
Splicing the text feature matrix generated at Step3 and the non-text feature matrix generated at Step6 to generate a shape of
Figure FDA0003634472640000041
As final characteristic information.
4. The label noise learning method for taxpayer industry classification as claimed in claim 3, wherein in the step 2), the taxpayer industry classification network construction and training device is initialized: establishing a TextCNN network for text classification, wherein the TextCNN network comprises three layers: (1) a convolutional layer, (2) a maximum pooling layer and (3) a full connection layer, wherein an XLNET pre-training network in the step 1) is connected with a TextCNN network in series to construct a training device, and end-to-end training is carried out by taking taxpayer label noise data as supervision; specific implementation details are as follows:
step 1: taxpayer industry classification network construction
Constructing a TextCNN network for taxpayer industry classification, wherein the TextCNN network comprises three layers, namely a convolution layer, a pooling layer and a full-connection layer; specifically, the convolution layer of the TextCNN uses a convolution kernel with the shape of n × t to perform convolution operation for extracting line features, wherein the values of n are {2, 3,4, 5, 6}, the TextCNN adopts a maximum pooling layer as a pooling layer for maximum value extraction of a feature map after convolution, further compression is performed to extract features, then a full connection layer is established, assuming that the total number of categories to be classified of taxpayer industry classification is c, and if the number of features is s after passing through the maximum pooling layer, the full connection layer with the shape of s × c is established for mapping feature information into a c-dimensional vector, and then taxpayer industry classification is performed;
step 2: training device initialization
Connecting the XLNET text pre-training network in the step 1) with the constructed TextCNN network in series to form a training device; and (3) taking the label noise data of the taxpayer industry as input, predicting the noise label, forming an end-to-end device for training, and initializing the network parameters of the training device.
5. The method as claimed in claim 4, wherein in Step2 of Step 2), let the net parameter be α, the sample be X, and the noise label be
Figure FDA0003634472640000042
The set of network parameters is w, and the memory sample X is output as
Figure FDA0003634472640000043
For
Figure FDA0003634472640000044
And
Figure FDA0003634472640000045
making cross entropy loss, and adding a regularization term to prevent overfitting, wherein lambda is a regularization term control coefficient, minimizing a loss function, and an optimization objective is as follows:
Figure FDA0003634472640000051
6. the label noise learning method for taxpayer industry classification as claimed in claim 5, wherein in step 3), the conditional transition matrix estimation: converting a conditional transfer matrix estimation problem in the label noise learning problem into a mixing proportion estimation problem, and solving a mixing proportion coefficient based on an improved mixing proportion estimation method to further obtain a conditional transfer matrix; the specific implementation details are as follows:
step 1: hybrid ratio estimation problem construction
Assume that the noise label in the taxpayer registration information is
Figure FDA0003634472640000052
The true label of the sample is Y, assuming sample X and noise label
Figure FDA0003634472640000053
Independently of each other, for any class C ∈ C there is:
Figure FDA0003634472640000054
note the book
Figure FDA0003634472640000055
Where Q represents the conditional transition probability of a noisy tag to a true tag, the above equation is expressed in matrix form as follows:
Figure FDA0003634472640000056
further decomposing the matrix to obtain the following form; where H is a c x c matrix and satisfies that the diagonal element is 0, and G is a real diagonal matrix shaped as c x c;
Figure FDA0003634472640000057
according to the nature of matrix transformation, it can be seen that matrix H, matrix G, and matrix Q satisfy the following relationships:
Figure FDA0003634472640000058
(I-H) -1 G=Q T
Here Q is T The matrix is a conditional transition matrix in label noise learning, and the above relationship indicates that if the matrix H is solved, the conditional transition matrix is further solved, and the decomposition of the matrix is equivalent to the following c equations:
Figure FDA0003634472640000061
the equation is further expressed in the form:
Figure FDA0003634472640000062
wherein the following are satisfied:
Figure FDA0003634472640000063
the standard mixing ratio estimation problem is expressed in the form: f ═ κ H + (1-k) G (k ≧ 0), where FHG is the probability distribution function, and samples sampled at distribution F, H are assumed to be known, where F is the mixture and H, G is the composition; the equation obtained by the above matrix decomposition:
Figure FDA0003634472640000064
it is the standard mixing ratio estimation problem, the mixing ratio coefficient H estimated by which is the mixing ratio estimation problem ij It is the elements of matrix H; therefore, by solving a series of mixed proportion estimation problems, the H matrix can be solved, and the conditional transition matrix Q is estimated according to the matrix relation T Therefore, a classifier with consistent risk is constructed based on the tag noise data, and taxpayer industry classification is carried out;
step 2: regeneration of compositions
The solution of the mixed scaling problem relies on the labeling of the anchor point, specifically, the maximum estimator of the mixed scaling factor if the anchor point sample is present and known
Figure FDA0003634472640000065
Is an unbiased estimate of the true mixture scaling factor k;
specifically, first, the mixture F sample is labeled as a positive sample class Y ═ 1, the labeled composition component H sample is labeled as a negative sample class Y ═ 1, an MLP network is constructed to perform binary prediction, and the output of the network is assumed to be F η (X), wherein X is sample characteristics, eta is network parameters, the MLP network is supervised trained by using noisy positive and negative samples, after training, the posterior probability prediction is carried out on the samples of the positive sample class by using the network, a threshold value tau is selected, and the sample set of the positive sample class is recorded as
Figure FDA0003634472640000071
Set of negative sample class samples as
Figure FDA0003634472640000072
Inputting samples of the positive sample class into the network for prediction, wherein the sample set with a prediction value smaller than a selected threshold value is recorded as
Figure FDA0003634472640000073
Then there is
Figure FDA0003634472640000074
Bringing the samples with the posterior probability ratio smaller than the threshold value into a negative sample set, and respectively obtaining a positive sample set and a negative sample set after reconstruction:
Figure FDA0003634472640000075
and
Figure FDA0003634472640000076
satisfy the requirements of
Figure FDA0003634472640000077
And
Figure FDA0003634472640000078
therefore, the regeneration of the composition sample is completed, and the problem of dependence of the traditional mixed proportion estimation method on the anchor point is solved;
step 3: probability density estimation based on kernel density estimation
On the basis of reconstructing the composition at Step2, estimating a probability density function of sample distribution based on a kernel density estimation method; specifically, a kernel function is established for representing the probability density estimation of the existing sample to any point in the feature space, wherein x is a point in the feature space, and x is i Is a known sample; and μ is the sample mean, and Σ is ρ 2 Q is the covariance matrix of the sample, then sample x using a Gaussian kernel function i For x place probability densityThe equation for the kernel is as follows:
Figure FDA0003634472640000079
then over the entire sample set, the probability density function estimator is:
Figure FDA00036344726400000710
wherein
Figure FDA00036344726400000711
For sets of samples, from the positive and negative sample sets already obtained
Figure FDA00036344726400000712
The probability density function of the reconstructed positive and negative samples is estimated as follows:
Figure FDA00036344726400000713
Figure FDA00036344726400000714
step 4: conditional branch matrix estimation
Sequentially solving c mixing proportion estimation problems constructed in Step1, solving corresponding c-1 mixing proportion coefficients for any mixing proportion estimation problem, and setting the noise label of the mixture as
Figure FDA00036344726400000715
Noise signatures of compositions are
Figure FDA00036344726400000716
Collecting the original samples
Figure FDA00036344726400000717
Respectively as sets of positive and negative samples in a mixture ratio estimation problem
Figure FDA00036344726400000718
Method based on Step2 generates new positive and negative sample sets
Figure FDA0003634472640000081
And
Figure FDA0003634472640000082
and carrying out probability density estimation according to the kernel density estimation method of Step3 to respectively obtain
Figure FDA0003634472640000083
And
Figure FDA0003634472640000084
then, the maximum estimation quantity of the mixing proportion coefficient is estimated by adopting the method of maximum estimation of the mixing proportion problem in Step1
Figure FDA0003634472640000085
Where G is a legal probability density function, estimator
Figure FDA0003634472640000086
I.e. the element H ij And (i ≠ j) estimating values, repeating the processes Step2,3 and 4 to solve all elements of the H matrix through circulation, then solving the G matrix according to the following properties, and further solving the conditional transition matrix Q T
Figure FDA0003634472640000087
(I-H) -1 G=Q T
7. The label noise learning method for taxpayer industry classification as claimed in claim 6, wherein in the step 4), the training device network parameter learning and taxpayer industry classification comprises the following specific steps:
step 1: training device learning based on tag noise data
Assume that the network parameter in the training apparatus is η and the noise sample is
Figure FDA0003634472640000088
The network parameter set is combined as w, the network parameter in the training device is learned by using the label noise data as supervision, and the output of the memory sample X under the mapping of the training device is g η (X) for g η (X) and
Figure FDA0003634472640000089
making cross entropy loss, and adding a regularization term to prevent overfitting, wherein lambda is a regularization term control coefficient, minimizing a loss function, and an optimization objective is as follows:
Figure FDA00036344726400000810
under the control of the optimization objective, a network of training devices is used to predict the noise signature of the input samples
Figure FDA00036344726400000811
Output result g η (X) performing operation through a softmax layer, wherein the sotfmax operation performs exponential normalization processing on the original output, and the original output is expressed as a predicted value of posterior probability; in particular, assume that the original network output is
Figure FDA00036344726400000812
Performing exponential operation on the output vector and performing normalization processing by softmax, wherein the output is in the following form;
Figure FDA00036344726400000813
step 2: constructing a conditional transition matrix layer
After finishing learning the network parameters of the training device, outputting g of the network η (X) outputting the posterior probability of the sample through softmax operation
Figure FDA0003634472640000091
The method is used for predicting the noise label, a conditional branch layer is added behind a softmax layer to serve as a branch layer, and the conversion from noise label prediction to real label prediction is realized;
step 3: taxpayer industry classification
On the basis of constructing the conditional transition layer, for a newly input sample X, the output of the TextCNN network is q (X), and the output is calculated
Figure FDA0003634472640000092
And obtaining a subscript r corresponding to the maximum component q (X), namely the trade classification corresponding to the taxpayer.
8. The label noise learning method for taxpayer industry classification as claimed in claim 7, wherein in Step2 of Step 4), the specific method is as follows: let the noise label be
Figure FDA0003634472640000093
The true sample label is Y, the total number of classes is C, and sample characteristics X and noise labels are assumed
Figure FDA0003634472640000094
Independently of each other, for any class C ∈ C there is:
Figure FDA0003634472640000095
the original network output g is then η (X) passing through conditional transition matrixQ T The conversion can convert the original output into a new output Q (X) which satisfies Q (X) Q T g (X), where the new output q (X) is the posterior probability of the genuine tag
Figure FDA0003634472640000096
Wherein q is i (X) (i ═ 1, 2., C) is the i-th component of q (X), representing that X is the probability prediction value P of the i-th class of true tags (Y ═ i | X).
CN202210498954.4A 2022-05-09 2022-05-09 Tax payer industry classification-oriented label noise learning method Active CN114817546B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210498954.4A CN114817546B (en) 2022-05-09 2022-05-09 Tax payer industry classification-oriented label noise learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210498954.4A CN114817546B (en) 2022-05-09 2022-05-09 Tax payer industry classification-oriented label noise learning method

Publications (2)

Publication Number Publication Date
CN114817546A true CN114817546A (en) 2022-07-29
CN114817546B CN114817546B (en) 2024-09-10

Family

ID=82513012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210498954.4A Active CN114817546B (en) 2022-05-09 2022-05-09 Tax payer industry classification-oriented label noise learning method

Country Status (1)

Country Link
CN (1) CN114817546B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118506069A (en) * 2024-05-15 2024-08-16 云南联合视觉科技有限公司 Image classification method for label with noise situation

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070005531A1 (en) * 2005-06-06 2007-01-04 Numenta, Inc. Trainable hierarchical memory system and method
CN109710768A (en) * 2019-01-10 2019-05-03 西安交通大学 A kind of taxpayer's industry two rank classification method based on MIMO recurrent neural network
CN110705607A (en) * 2019-09-12 2020-01-17 西安交通大学 Industry multi-label noise reduction method based on cyclic re-labeling self-service method
WO2021057427A1 (en) * 2019-09-25 2021-04-01 西安交通大学 Pu learning based cross-regional enterprise tax evasion recognition method and system
CN112765358A (en) * 2021-02-23 2021-05-07 西安交通大学 Taxpayer industry classification method based on noise label learning
CN112860895A (en) * 2021-02-23 2021-05-28 西安交通大学 Tax payer industry classification method based on multistage generation model
WO2021196520A1 (en) * 2020-03-30 2021-10-07 西安交通大学 Tax field-oriented knowledge map construction method and system
CN113712511A (en) * 2021-09-03 2021-11-30 湖北理工学院 Stable mode discrimination method for brain imaging fusion features

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070005531A1 (en) * 2005-06-06 2007-01-04 Numenta, Inc. Trainable hierarchical memory system and method
CN109710768A (en) * 2019-01-10 2019-05-03 西安交通大学 A kind of taxpayer's industry two rank classification method based on MIMO recurrent neural network
CN110705607A (en) * 2019-09-12 2020-01-17 西安交通大学 Industry multi-label noise reduction method based on cyclic re-labeling self-service method
WO2021057427A1 (en) * 2019-09-25 2021-04-01 西安交通大学 Pu learning based cross-regional enterprise tax evasion recognition method and system
WO2021196520A1 (en) * 2020-03-30 2021-10-07 西安交通大学 Tax field-oriented knowledge map construction method and system
CN112765358A (en) * 2021-02-23 2021-05-07 西安交通大学 Taxpayer industry classification method based on noise label learning
CN112860895A (en) * 2021-02-23 2021-05-28 西安交通大学 Tax payer industry classification method based on multistage generation model
CN113712511A (en) * 2021-09-03 2021-11-30 湖北理工学院 Stable mode discrimination method for brain imaging fusion features

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SEONG MIN KYE: "通过有效的转移矩阵估计学习噪声标签以对抗标签错误", MACHINE LEARNING, 19 November 2021 (2021-11-19) *
施方迤;汪子扬;梁军;: "基于半监督密集阶梯网络的工业故障识别", 化工学报, no. 07, 9 May 2018 (2018-05-09) *
王丽客;孙媛;夏天赐;: "基于远程监督的藏文实体关系抽取", 中文信息学报, no. 03, 15 March 2020 (2020-03-15) *
陈季梦;刘杰;黄亚楼;刘天笔;刘才华;: "基于半监督CRF的缩略词扩展解释识别", 计算机工程, no. 04, 15 April 2013 (2013-04-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118506069A (en) * 2024-05-15 2024-08-16 云南联合视觉科技有限公司 Image classification method for label with noise situation

Also Published As

Publication number Publication date
CN114817546B (en) 2024-09-10

Similar Documents

Publication Publication Date Title
CN112116030B (en) Image classification method based on vector standardization and knowledge distillation
Liang et al. Symbolic graph reasoning meets convolutions
CN109299342B (en) Cross-modal retrieval method based on cycle generation type countermeasure network
CN112765358B (en) Taxpayer industry classification method based on noise label learning
CN111552807B (en) Short text multi-label classification method
Jiang et al. Variational deep embedding: An unsupervised and generative approach to clustering
CN111782768B (en) Fine-grained entity identification method based on hyperbolic space representation and label text interaction
Ji et al. Unsupervised few-shot feature learning via self-supervised training
CN114169330A (en) Chinese named entity identification method fusing time sequence convolution and Transformer encoder
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN112613308A (en) User intention identification method and device, terminal equipment and storage medium
CN117237559B (en) Digital twin city-oriented three-dimensional model data intelligent analysis method and system
CN112784031B (en) Method and system for classifying customer service conversation texts based on small sample learning
CN112733965A (en) Label-free image classification method based on small sample learning
CN109492610B (en) Pedestrian re-identification method and device and readable storage medium
CN118113849A (en) Information consultation service system and method based on big data
CN113591955A (en) Method, system, equipment and medium for extracting global information of graph data
CN117217368A (en) Training method, device, equipment, medium and program product of prediction model
CN116704433A (en) Self-supervision group behavior recognition method based on context-aware relationship predictive coding
CN114817546B (en) Tax payer industry classification-oriented label noise learning method
CN116431813A (en) Intelligent customer service problem classification method and device, electronic equipment and storage medium
CN111783688B (en) Remote sensing image scene classification method based on convolutional neural network
CN109740682B (en) Image identification method based on domain transformation and generation model
CN117496228A (en) Knowledge distillation and graph model-based small sample increment radiation source individual identification method
CN116029394B (en) Self-adaptive text emotion recognition model training method, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant