[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114817546B - Tax payer industry classification-oriented label noise learning method - Google Patents

Tax payer industry classification-oriented label noise learning method Download PDF

Info

Publication number
CN114817546B
CN114817546B CN202210498954.4A CN202210498954A CN114817546B CN 114817546 B CN114817546 B CN 114817546B CN 202210498954 A CN202210498954 A CN 202210498954A CN 114817546 B CN114817546 B CN 114817546B
Authority
CN
China
Prior art keywords
network
text
matrix
sample
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210498954.4A
Other languages
Chinese (zh)
Other versions
CN114817546A (en
Inventor
郑庆华
曹书植
阮建飞
赵锐
董博
师斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202210498954.4A priority Critical patent/CN114817546B/en
Publication of CN114817546A publication Critical patent/CN114817546A/en
Application granted granted Critical
Publication of CN114817546B publication Critical patent/CN114817546B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/10Tax strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Biology (AREA)
  • Accounting & Taxation (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a label noise learning method for taxpayer industry classification, which comprises the following steps: firstly, extracting text information and non-text information in tax-paying industry information, and respectively performing text embedding and non-text coding processing based on XLNet text pre-training network and coding technology to obtain characteristic information; secondly, a TextCNN network for taxpayer industry classification is constructed, the number of layers, convolution kernel shape and input and output dimensions of each layer of the network are determined according to the characteristic information and the target classification number, a XLNet text pre-training network and a TextCNN network are connected in series, noisy taxpayer industry label data are combined as supervision, and an end-to-end training device is constructed; thirdly, estimating a conditional transfer matrix based on the improved mixing proportion estimation method; and finally, learning network parameters in the training device, taking the conditional transfer matrix as a linear layer behind TextCNN networks, realizing conversion from noise label prediction to real tax-paying industry label prediction, and carrying out tax-paying industry classification.

Description

Tax payer industry classification-oriented label noise learning method
Technical Field
The invention belongs to the technical field of text classification with label noise, and particularly relates to a label noise learning method for taxpayer industry classification.
Background
In recent years, market economy continues to flourish, the number of enterprises is increasing, and the division of enterprises is continuously refined. Along with this, upgrades and further construction of tax systems have become urgent.
Tax payer industry classification is a precondition for determining tax main policy and preference, and is an important link for tax collection. Currently, china divides the tax payer industry into 20 categories and 97 major categories. Because of the vast majority of the categories, the traditional manual classification method consumes a great deal of human resources, is limited by the expertise and experience of the classifier, inevitably introduces classification errors, namely label noise of tax payer industry classification, and causes a series of adverse effects on statistics, tax and business management of the country.
In recent years, with the acceleration of the 'intelligent+' age, the artificial intelligent industry rapidly develops and is applied to various fields, and the development of intelligent tax exploration and development is possible. The research enterprise taxpayer industry classification is the basic work of tax source classification management, and is a key premise of intelligent tax informatization. Therefore, how to train a classifier based on the existing label noise data by means of machine learning and correctly classify the tax payer industry has become a problem to be solved urgently.
Related technical schemes of tax payer industry classification problems, the related invention patents are as follows:
Document 1: tax payer industry two-level classification method (201910024324.1) based on MIMO recurrent neural network
Document 2: tax payer industry classification method (202110201214.5) based on noise label learning
The literature 1 designs a GRU-based multi-input multi-output neural network structure, establishes a mapping relation from industry major classes to industry details, and constructs a two-layer classification structure for realizing industry classification of tax payers. However, this method relies on strict labeling of the data, lacking practical value in the presence of tag noise.
Document 2 has designed a BERT-CNN network for text classification, a semantic clustering-based method, and constructed a classifier with consistent classification by using label noise data, however, the performance limitation of the semantic clustering method introduces new errors into the classifier.
Aiming at the defects of the technical scheme, the invention aims to overcome the classification deviation caused by the adoption of a semantic clustering method without depending on additional manual labeling, and construct a classifier only based on label noise data, so that the classifier constructed based on the label noise data has consistent classification risk with the classifier constructed by adopting real labeling data in a statistical sense.
The core of constructing a risk-consistent classifier based on tag noise data is: a statistically consistent classifier is constructed by estimating a conditional transition matrix (a matrix of conditional probabilities of real labels given noisy labels). The invention creatively converts the problem of estimating the conditional transfer matrix into the problem of estimating the mixing proportion, and obtains the approximate conditional transfer matrix by estimating the mixing proportion coefficient. However, the traditional mixed proportion estimation method is only suitable for a two-class scene and depends on an anchor point (a sample explicitly belonging to a certain class), while the taxpayer industry classification problem has a plurality of industry classes, belongs to a multi-class problem, and is difficult to mark and acquire. Thus, it is a major solution challenge of the present invention to extend the mixing ratio estimation problem from binary analogy to multiple classifications and overcome the anchor point dependence problem.
Disclosure of Invention
The invention aims to provide a label noise learning method for taxpayer industry classification, which constructs a risk consistency classifier based on a label noise data estimation condition transfer matrix (a matrix formed by the conditional probabilities of real labels under the condition of given noise labels).
The invention is realized by adopting the following technical scheme:
a label noise learning method for taxpayer industry classification comprises the following steps:
Firstly, extracting text information and non-text information in tax-paying industry information, and respectively performing text embedding and non-text coding processing based on XLNet text pre-training network and coding technology to obtain characteristic information; secondly, a TextCNN network for taxpayer industry classification is constructed, the number of layers, convolution kernel shape and input and output dimensions of each layer of the network are determined according to the characteristic information and the target classification number, a XLNet text pre-training network and a TextCNN network are connected in series, noisy taxpayer industry label data are combined as supervision, and an end-to-end training device is constructed; thirdly, estimating a conditional transfer matrix based on an improved mixing proportion estimation method; and finally, learning network parameters in the training device, taking the conditional transfer matrix as a linear layer behind TextCNN networks, realizing conversion from noise label prediction to real tax-paying industry label prediction, and carrying out tax-paying industry classification.
The invention is further improved in that the method specifically comprises the following steps:
1) Tax payer industry information processing
The tax payer information processing comprises text information processing and non-text information processing, firstly, word segmentation and word embedding are carried out on tax payer text information based on XLNet text pre-training network to form corresponding word vectors, then text characteristics are generated by splicing, secondly, numerical characteristics and category characteristics in the tax payer non-text information are respectively preprocessed by using standardized processing and independent heat coding technology, then a linear network layer is established to carry out characteristic mapping to generate non-text characteristics consistent with text characteristic dimensions, and finally, the text characteristics and the non-text characteristics are spliced to form characteristic information;
2) Tax payer industry classification network construction and training device initialization
Constructing TextCNN a network for tax payer industry classification, wherein the network comprises three layers of a convolution layer, a pooling layer and a full connection layer, sequentially determining the number of layers of the TextCNN network, the shape of a convolution kernel and the input and output dimensions of each layer based on the characteristic information and the target classification number obtained in the step 1), connecting a XLNet pre-training network with a TextCNN network in series, combining noisy tax payer industry information labels as supervision, and constructing an end-to-end training device;
3) Conditional transition matrix estimation
Based on a nuclear density estimation method, estimating a probability density function according to noisy taxpayer industry information data, converting a conditional transfer matrix estimation problem into a mixed proportion estimation problem, solving a corresponding mixed proportion coefficient based on an improved mixed proportion estimation method, and further obtaining a conditional transfer matrix;
4) Training device network parameter learning and tax payer industry classification
Based on the label noise data, the network parameters of the training device are learned, after training is completed, the estimated conditional transfer matrix is used as a linear conversion layer to be added after the training device, and conversion from noise label prediction to real label prediction is completed, so that tax payer industry classification is realized.
The invention further improves that in the step 1), tax payer industry information processing specifically comprises the following steps:
Step1: taxpayer industry text information preprocessing
Extracting text information of the tax payer industry, deleting special symbols, numbers and meaningless symbols of the graduated words in the text information, and finishing preprocessing of the tax payer text information;
Step2: text word embedding based on XLNet pre-training network
Encoding a text based on XLNet pre-training network to generate word vectors, wherein a XLNet pre-training model is based on a transducer design, and simultaneously captures the relationship between two contexts, so that the problem that a pre-training stage and a fine tuning stage are inconsistent due to a mask mechanism of a bert model is improved, and a double-flow self-attention mechanism is used, so that the pre-training effect is more remarkable; the XLNet model applied to Chinese uses a 24-layer network structure and adopts SENTENCEPIEC for word segmentation; encoding the text features obtained in Step1 with XLNet of the chinese version, thereby obtaining a slave word vector;
step3: tax payer industry text feature generation
Assuming that the taxpayer has k text features in total, mapping a word element into a word vector of t dimension by XLNet pre-training network, and recording that the ith text feature has h i word elements, mapping the ith text feature into a matrix of h i ×t; splicing feature matrixes mapped by each text feature, so that the text feature of the sample is mapped into oneGenerating a taxpayer text feature matrix;
Step4: tax administration industry numerical value characteristic processing
For the standardized operation of the numerical characteristics of non-text characteristics of taxpayers, n training samples and m numerical characteristics are assumed, the value of the j-th numerical characteristic of the i-th sample is recorded as X ij, the average value of the j-th numerical characteristic is mu j, and the method meets the requirements ofThe standard deviation of the jth numerical value characteristic is sigma j, which satisfiesThe numerical characteristics after normalization are
Step5: tax administration industry class feature processing
Coding the category characteristics in the non-text characteristics of the taxpayer, and assuming that the category characteristics have N possible values, coding the category characteristics by adopting an N-dimensional vector; specifically, setting the corresponding position of the class feature value as 1, setting the rest positions as 0, namely adopting a one-hot coding method, selecting the longest coding length in the class features to complement after coding is completed on all the class features, and splicing the vectors after the complement to form a class feature matrix;
Step6: non-text feature generation for taxpayer industry
Respectively obtaining m standardized numerical characteristics and a class characteristic matrix with a shape of v multiplied by N max after Step4 and Step5, wherein N max represents the longest class coding length, then establishing two linear network layers for characteristic mapping, wherein the first linear network layer has a network shape of 1 multiplied by t and is used for converting the standardized numerical characteristics into the m multiplied numerical characteristic matrix, the second linear network layer has a network shape of N max multiplied by t and is used for mapping the class characteristics into a v multiplied by t class characteristic matrix, and splicing the two mapped characteristic matrices to obtain a final non-text characteristic matrix with a shape of (v+m) multiplied by t;
Step7: taxpayer characteristic information generation
Splicing the text feature matrix generated by Step3 and the non-text feature matrix generated by Step6 to generate a shape ofAs final characteristic information.
The invention is further improved in that in the step 2), the tax payer industry classification network construction and training device is initialized: a TextCNN network was built for text classification, and a TextCNN network comprised three layers, respectively: the method comprises the steps of (1) a convolution layer, (2) a maximum pooling layer and (3) a full connection layer, connecting XLNet pre-training networks in the step 1) with TextCNN networks in series, constructing a training device, and performing end-to-end training by taking tax payer tag noise data as supervision; specific implementation details are as follows:
step1: taxpayer industry classification network construction
The TextCNN network is constructed for tax administration industry classification, and the TextCNN network comprises three layers, namely a convolution layer, a pooling layer and a full connection layer; specifically, the TextCNN convolution layer uses a convolution kernel with the shape of n×t to perform convolution operation for extracting row characteristics, n takes a maximum pooling layer as a pooling layer, textCNN is used for extracting the maximum value of a characteristic diagram after convolution, further compresses and extracts characteristics, then establishes a full-connection layer, and supposes that the total number of categories to be classified in the taxpayer industry classification is c, if the number of the characteristics is s after passing through the maximum pooling layer, establishes a full-connection layer with the shape of s×c for mapping characteristic information into a vector with c dimension, and further performs taxpayer industry classification;
Step2: training device initialization
Connecting XLNet text pre-training networks in the step 1) and the constructed TextCNN networks in series to form a training device; and taking tax-paying pedestrian label noise data as input, predicting the noise label, forming an end-to-end device for training, and initializing training device network parameters.
The invention is further improved in that in Step 2) of the Step 2), the network parameter is alpha, the sample is X, and the noise label isThe network parameter set is w, and the output of the sample X under the mapping of the training device is recorded asFor the followingAndCross entropy loss is made, and a regularization term is added to prevent overfitting, wherein lambda is the regularization term control coefficient, the loss function is minimized, and the optimization objective is as follows:
The invention is further improved in that in step 3), the conditional transfer matrix is estimated: converting a conditional transfer matrix estimation problem in a label noise learning problem into a mixed proportion estimation problem, and solving a mixed proportion coefficient based on an improved mixed proportion estimation method to obtain a conditional transfer matrix; specific implementation details are as follows:
Step1: mixing ratio estimation problem construction
Assume that the noise label in the taxpayer registration information isThe true label of the sample is Y, assuming sample X and noise labelIndependent of each other, then for any class c∈C there is:
Recording device Pi=P(X|Y=i)、Where Q represents the conditional migration probability of a noisy tag to a real tag, the equation above is expressed in the form of a matrix as follows:
Further decomposing the matrix to obtain the following form; where H is a c×c matrix and satisfies the diagonal element 0, and G is a real diagonal matrix of c×c shape;
According to the nature of the matrix transformation, it can be seen that the matrix H, the matrix G, and the matrix Q satisfy the following relationships, respectively:
(i-H)-1G=QT
The Q T matrix is the conditional transfer matrix in the label noise learning, and the above relation indicates that if the matrix H is solved, the conditional transfer matrix is further solved, and the decomposition of the matrix is equivalent to the following c equations:
the equation is further expressed as follows:
Wherein the following are satisfied:
The standard mixing ratio estimation problem is expressed in the form: f=kh+ (1-k) G (k+.gtoreq.0), where fhg is a probability distribution function and assuming that samples sampled at distribution F, H are known, where F is mixture and H, G is composition; equation obtained by the above matrix decomposition: It is the standard mixing ratio estimation problem that the estimated mixing ratio coefficient H ij is the element of the matrix H; therefore, by solving a series of mixed proportion estimation problems, the H matrix can be solved, and then the matrix Q T is transferred according to the matrix relation estimation condition, so that a classifier with consistent risk is constructed based on label noise data, and tax-paying industry classification is carried out;
Step2: regeneration of the composition
Solving the problem of mixed proportion estimation, depending on the labeling of the anchor point, in particular, if the anchor point sample is present and known, the maximum estimated amount of mixed proportion coefficientIs an unbiased estimate of the true mixing proportionality coefficient k;
Specifically, firstly, marking a mixture F sample as positive sample class Y=1, marking a composition component H sample as negative sample class Y= -1, constructing an MLP network for two-class prediction, assuming that the output of the network is F η (X), wherein X is a sample characteristic, eta is a parameter of the network, performing supervised training on the MLP network by using the noisy positive and negative samples, performing posterior probability prediction on the positive sample class sample by using the network after training, selecting a threshold tau, and marking the positive sample class sample set as The negative sample class sample set isPredicting the sample input network of the positive sample class, wherein the sample set with the predicted value smaller than the selected threshold value is recorded asThen there isTaking samples with posterior probability rate smaller than a threshold value into a negative sample set, and respectively obtaining positive and negative sample sets after reconstruction: And Satisfy the following requirementsAndThereby completing regeneration of the composition sample and solving the problem of dependence of the traditional mixing proportion estimation method on anchor points;
step3: probability density estimation based on kernel density estimation
Estimating a probability density function of sample distribution based on a kernel density estimation method on the basis of the Step2 reconstruction composition; specifically, a kernel function is established for representing probability density estimation of an existing sample for any point in the feature space, wherein x is taken as a point in the feature space, and x i is a known sample; and μ is the sample mean, Σ=ρ 2 Q is the covariance matrix of the sample, then in case a gaussian kernel is used, the contribution of sample x i to the probability density at x is represented by the form of the kernel:
The probability density function estimator over the entire sample set is: Wherein the method comprises the steps of Is a set of samples, and is based on the positive and negative sample sets obtainedThe probability density function of the reconstructed positive and negative samples is estimated as follows:
Step4: conditional transition matrix estimation
Sequentially solving c mixing proportion estimation problems constructed in Step1, solving corresponding c-1 mixing proportion coefficients for any one mixing proportion estimation problem, and setting the noise label of the mixture asThe noise label of the composition isCollecting the original samplesRespectively as positive and negative sample sets in the mixing proportion estimation problemStep 2-based method for generating new positive and negative sample setsAndAnd estimating probability density according to the kernel density estimation method of Step3 to obtain respectivelyAndThen estimating the maximum estimation amount of the mixing proportion coefficient by adopting a method for estimating the maximum value of the mixing proportion problem in Step1Where G is a legal probability density function, an estimatorNamely, the estimated value of the element H ij (i not equal to j), all the elements of the H matrix are solved through the cyclic and repeated processes Step2,3 and 4, and then the G matrix can be obtained according to the following properties, so as to obtain the condition transition matrix Q T;
(I-H)-1G=QT
The invention is further improved in that in the step 4), the training device network parameter learning and tax payer industry classification are carried out, and the specific steps are as follows:
step1: training device learning based on tag noise data
Assuming that the network parameter in the training device is eta, the noise sample isThe network parameter set is w, the label noise data is used as supervision, the network parameters in the training device are learned, the sample X is recorded and output as g η (X) under the mapping of the training device, and for g η (X) andCross entropy loss is made, and a regularization term is added to prevent overfitting, wherein lambda is the regularization term control coefficient, the loss function is minimized, and the optimization objective is as follows:
Under control of the optimization objective, training device network is used for predicting noise label of input sample The output result g η (X) is calculated through a softmax layer, sotfmax is calculated to carry out exponential normalization on the original output, and the original output is expressed as a predicted value of posterior probability; specifically, assume that the original network output isThe softmax performs exponential operation and normalization on the output vector, and outputs the output vector in the following form;
step2: construction of conditional transfer matrix layer
After the training device network parameter learning is completed, the output g η (X) of the network is subjected to softmax operation to output the posterior probability of the sampleThe method comprises the steps of adding a conditional transfer layer as a transfer layer after a softmax layer for predicting a noise label, and realizing conversion from noise label prediction to real label prediction;
Step3: tax payer industry classification
Based on the construction of the conditional transfer layer, for the newly input sample X, textCNN the output of the network is q (X), by calculationAnd obtaining a subscript r corresponding to the maximum component of q (X), namely the industry classification corresponding to the taxpayer.
The invention is further improved in that in Step2 of the Step 4), the specific method is as follows: set noise labelThe true sample label is Y, the total class number is C, and the sample feature X and the noise label are assumedIndependent of each other, for any categoryThe method comprises the following steps:
The original network output g η (X) is converted by the conditional transfer matrix Q T to convert the original output into a new output Q (X) which satisfies Q (X) =Q T g (X), wherein the new output Q (X) is the posterior probability of the real label Where q i (X) (i=1, 2,.., C) is the i-th component of q (X), representing the probability predictor P (y=i|x) that X is the i-th class of real tags.
The invention has at least the following beneficial technical effects:
the invention provides a label noise learning method for taxpayer industry classification, which is oriented to the taxpayer industry classification, and has the advantages that compared with the prior art, the invention has the following advantages:
(1) The invention creatively converts the condition transition matrix estimation problem in label noise learning into the mixed proportion estimation problem, and constructs the classifier with consistent risk based on label noise data by solving the mixed proportion estimation problem. Unlike the prior art scheme which relies on semantic clustering, the method does not depend on an additional clustering method, so that new errors caused by the limitation of the performance of the clustering method are avoided.
(2) The invention expands the traditional mixing proportion estimation method from two minutes to multiple classification scenes, is different from the situation that the traditional method is limited to two classifications, and the improved mixing proportion estimation method can be applied to the situation of multiple classifications, and has wider application scenes.
(3) The invention solves the problem of dependence of the traditional mixing proportion estimation method on the anchor point, is different from the requirement of the traditional method on the anchor point marking, constructs a totally new mixing proportion estimation problem based on the method for regenerating the composition, and realizes the direct estimation of the mixing proportion coefficient under the condition of not depending on the anchor point marking.
Drawings
Fig. 1 is a flow chart of an overall framework.
Fig. 2 is a flow chart of tax payer industry information processing.
FIG. 3 is a flow chart of the tax payer industry classification network construction and training device initialization.
Fig. 4 is a conditional transition matrix estimation flow chart.
FIG. 5 is a flow chart for training the device to learn network parameters and classify tax payers.
Fig. 6 is a schematic diagram of a tag noise learning network.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.
As shown in fig. 1, in the implementation of the present invention, the label noise learning method for taxpayer industry classification of the present invention includes the following steps:
step 1, tax payer industry information processing
As shown in fig. 2, text information and non-text information of the tax payer are respectively extracted, and information processing is performed, specifically including the following steps:
S101, preprocessing text information in taxpayer industry
Illegal characters such as special symbols, numbers, and graduated words in the tax payer text information are deleted (fig. 2 s 101). In an embodiment, extracting 3 text features as text information features of the taxpayer includes: { tax payer name, registration address, operating Range }, assuming that one of the tax payer names is "Xin ceramic SI limited science and technology company in Xishan", special symbol SI is deleted first (FIG. 2S 101), and the sequence division is performed according to the words, so that { Xin, an, xin, , ceramic, limited, ke, skill, gong, si }.
S102, text word embedding based on XLNet pre-training network
Word embedding (fig. 2 s 102) is performed on the text based on the text pre-training network XLNet to form word vectors. In this embodiment, assuming the encoding length is t, XLNet text pre-training network embeds the original tokens into a word vector of length t. If the original text sequence length is 13, then XLNet the pre-training network may map the text to 13×t text features, specifically, in an embodiment, selecting t=528, then a 13×528 text feature may be obtained (s 102 of fig. 2).
S103, generating text features of taxpayer industry
Based on XLNet text pre-training network, repeating S102, performing word embedding on all text feature sequences, and then splicing the word vectors obtained by embedding to form tax payer text features (S103 in FIG. 2).
In particular, in an embodiment, assuming that the taxpayer industry text features have 3 items in total, the method comprises the following steps: { taxpayer name, registration address, business scope }, and 3 text features are mapped into text features of 13× 528,7 ×528 and 10×528, respectively, and the text features are spliced to obtain an overall taxpayer text feature (as in fig. 2s 103) with a shape of 30×528.
S104, tax payer industry numerical value characteristic processing
And extracting the numerical characteristics of the tax payer industry, wherein the numerical characteristics comprise 4 numerical characteristics of { registered funds, investment sum, asset sum, interest liabilities }, and carrying out standardized operation.
Specifically, in this embodiment, first, the sample mean μ 12,...,μ4 and the sample variance σ 12,...,σ4 of the 4-column features are calculated, and then X i is recorded as the value of the ith numerical feature of the sample X, and the z-score formula is passedThe normalization process is performed on the numerical features (s 104 of fig. 2).
S105, tax administration industry category feature processing
The category information is encoded based on a one-hot encoding technique. In this embodiment, selecting 2 category features for encoding specifically includes: { Unit Property, accounting means }, wherein one Unit property includes five cases of enterprise, non-governmental non-enterprise units, public institutions, social groups, and others. The corresponding one-hot codes are {10000, 01000, 00100, 00010, 00001}, and one-hot codes are performed on all the category characteristic information (fig. 2 s 105).
S106, non-text feature generation in taxpayer industry
And constructing a linear network mapping layer, mapping the obtained numerical characteristics and category characteristics into the same dimension as the text characteristic dimension, and then splicing the numerical characteristics and the category characteristics to form non-text characteristics of the taxpayer industry.
Specifically, in an embodiment, linear network mapping layers in the shape of 1×528 and 5×528 are respectively established. For mapping numeric features and category features to the same dimension of text features, and then stitching to form a non-text feature matrix (s 106 of fig. 2).
S107, generating taxpayer characteristic information
And (3) splicing the taxpayer text features obtained in the step (S103) and the taxpayer non-text features obtained in the step (S106), and finally forming the taxpayer industry feature information.
In an embodiment, the text feature with the shape of 30×528 and the non-text feature with the shape of 6×258 are spliced to form the final taxpayer industry feature information, which has the shape of 36×528 (s 107 of fig. 2).
Step 2, initializing a tax administration industry classification network construction and training device
As shown in fig. 3, a TextCNN network is established for tax payer industry classification, and the shape of the TextCNN convolution kernel and the dimensions of input and output are sequentially determined according to the generated tax payer industry characteristics and the total number of targets to be classified. And concatenating XLNet the text pre-training network and TextCNN network to form a training device, and performing end-to-end training on the training device based on the tag noise data for initializing network parameters of the training device.
S201, construction of taxpayer industry classification network
The TextCNN network is constructed for tax administration industry classification, and the TextCNN network comprises three layers, namely: convolution layer, pooling layer and full connection layer.
Specifically, in the embodiment, according to the situation of the characteristics of the taxpayer text, a convolution kernel is established, the row characteristics of the characteristic map are extracted, in this embodiment, a convolution kernel with a shape of n×528 is used, where n= {2,3,4,5,6}, a maximum pooling layer is established, further feature compression and extraction are performed on the characteristics after convolution, finally, a full connection layer is established, assuming that the number of total characteristics of the characteristic map output after pooling layer is n 1 and the number of total categories is c, a full connection layer with a shape of n 1 ×c is established, and in this embodiment, c=97 (as in fig. 3 s 201).
S204, initializing a training device
And (3) connecting the XLNet text pre-training network in the step (1) and the constructed TextCNN network in series to form a training device. And performing end-to-end training based on the label noise data, and initializing network parameters of the training device.
In an embodiment, tax administration industry tag noise data is used as input, noise tags are predicted, and end-to-end training is performed for initializing network parameters (e.g., s202 of fig. 3). Assuming that the network parameter is alpha, the noise sample isThe network parameter set is w, and the output of the sample X under the mapping of the training device is recorded asFor the followingAndCross entropy loss is made, and a regularization term is added to prevent overfitting, wherein lambda is the regularization term control coefficient, the loss function is minimized, and the optimization objective is as follows:
Step 3, solving the conditional transfer matrix
As shown in fig. 4, firstly, a mixing proportion estimation problem is constructed, so that an original conditional transition matrix estimation problem is converted into a mixing proportion estimation problem, secondly, a brand new mixing proportion estimation problem is constructed based on a method for regenerating a composition, probability density is estimated according to a method for estimating nuclear density, and then a mixing proportion coefficient is solved, and a conditional transition matrix is estimated. The specific steps are as follows:
s301, construction of mixing proportion estimation problem
In this embodiment, it is assumed that the noisy tag in the taxpayer registration information isSample X, sample true label Y, if sample X and noisy labelIndependent of each other, the following relationship is provided:
Meanwhile, the above relationship may be converted into the following form:
From this, the above c equations are equivalent to the mixing ratio problem of c standards. In the embodiment, the total number to be classified is c=97, and if the matrix H and the matrix G can be found, the original equation can be found Further, the overall conditional transfer matrix is obtained, so that the original conditional transfer matrix estimation problem is converted into a mixed ratio estimation problem (s 301 of fig. 4).
S302, regenerating the composition
In an embodiment, it is assumed that the noise label class corresponds toRespectively is set asAnd The { i, j } class is respectively used as positive and negative sample setAndA two-class network is designed to predict, assuming the output of the network is f η (X), where X is the sample feature after the dimension reduction of the input. η is a parameter of the network. And performing supervised training on the sensor network by using the positive and negative samples, and after the training of the network is completed. And (3) performing posterior probability prediction on the samples of the positive sample class by using a network. Selecting a threshold tau and marking the positive sample class sample set asThe negative sample class sample set isThe positive sample class is recorded as a sample set with the output of the network prediction being smaller than a selected threshold valueThen there isThe posterior probability rate is less than the thresholdThe sample set is copied to the negative sample set, and a reconstructed positive sample set and negative sample set can be obtained: And And satisfy the followingAndThereby completing the regeneration of the samples (s 302 of fig. 4).
S303, probability density function estimation
For the new sample obtained in S302AndThe probability density function estimation is performed on the sample set, and the kernel density estimation method is adopted to obtain estimated functions (as shown in fig. 4s 303) as follows:
S304, solving a conditional transfer matrix
Establishing a double circulation structure, and traversing the outer layer and the inner layer in turnAndAnd satisfies i.noteq.j. the processes of S302 and S303 are cyclically performed to determine the mixing ratio coefficientThe mixing proportion coefficient is H ij, and then a G matrix is obtained according to the following relation;
Based on the H matrix and the G matrix, the following relationships can be obtained: (I-H) -1G=QT to obtain a conditional transfer matrix Q T (see FIG. 4S 304).
Step 4, training the device to learn network parameters and classify tax payer industries
As shown in fig. 5, training the training device based on the label noise data, for learning the network parameters of the training device, and adding a condition transfer layer after the training device, to complete the tax payer industry classification, specifically comprising the following steps:
s401 training device learning based on label noise data
In the present embodiment, it is assumed that the input of the training device is a noise data sampleWhere X is the 36X 528 input feature vector, and is mapped to the 97-dimensional output vector g η (X) via the network. For noisy labelsAnd network output g η (X) as cross entropy loss, training network parameters according to the loss function, the trained network parameters being noted as eta (FIG. 5S 401)
S402, constructing a conditional transfer matrix layer
A conditional transfer matrix layer is added after the training device to predict new samples.
Specifically, in this embodiment, the calculated 97×97 conditional transfer matrix Q T is used as the conditional transfer layer. The original output g η (X) is converted to Q (X), i.e., Q (X) =q Tgη (X), where Q (X) represents the prediction of the true label for sample X. Where q i (X) is the ith component of q (X), representing the probability that sample X belongs to class i (FIG. 5S 402).
S403 taxpayer industry classification
As shown in fig. 6, the text information and the non-text feature information of the taxpayer are extracted respectively, the taxpayer industry features are extracted through the feature extraction module, the condition transfer matrix is estimated based on the extracted features, and the taxpayer industry classification is performed based on the training device as the final condition transfer layer of the training device. Specifically, in the embodiment, assuming that the taxpayer feature information is X, the output of the training device is q (X), where q (X) is a real label prediction of the sample X, q i (X) (i=1, 2..97) is an ith component of q (X), and a subscript corresponding to the largest component is selectedAs a classification of the tax payer industry (s 403 of fig. 5).
It will be readily appreciated by those skilled in the art that the foregoing is merely illustrative of the present invention and is not intended to limit the invention, but any modifications, equivalents, improvements or the like which fall within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims (6)

1. A label noise learning method for taxpayer industry classification is characterized by comprising the following steps:
Firstly, extracting text information and non-text information in tax-paying industry information, and respectively performing text embedding and non-text coding processing based on XLNet text pre-training network and coding technology to obtain characteristic information; secondly, a TextCNN network for taxpayer industry classification is constructed, the number of layers, convolution kernel shape and input and output dimensions of each layer of the network are determined according to the characteristic information and the target classification number, a XLNet text pre-training network and a TextCNN network are connected in series, noisy taxpayer industry label data are combined as supervision, and an end-to-end training device is constructed; thirdly, estimating a conditional transfer matrix based on an improved mixing proportion estimation method; finally, network parameters in the training device are learned, and a conditional transfer matrix is used as a linear layer behind TextCNN networks, so that conversion from noise label prediction to real tax-paying industry label prediction is realized, and tax-paying industry classification is carried out;
the method specifically comprises the following steps:
1) Tax payer industry information processing
The tax payer information processing comprises text information processing and non-text information processing, firstly, word segmentation and word embedding are carried out on tax payer text information based on XLNet text pre-training network to form corresponding word vectors, then text characteristics are generated by splicing, secondly, numerical characteristics and category characteristics in the tax payer non-text information are respectively preprocessed by using standardized processing and independent heat coding technology, then a linear network layer is established to carry out characteristic mapping to generate non-text characteristics consistent with text characteristic dimensions, and finally, the text characteristics and the non-text characteristics are spliced to form characteristic information;
2) Tax payer industry classification network construction and training device initialization
Constructing TextCNN a network for tax payer industry classification, wherein the network comprises three layers of a convolution layer, a pooling layer and a full connection layer, sequentially determining the number of layers of the TextCNN network, the shape of a convolution kernel and the input and output dimensions of each layer based on the characteristic information and the target classification number obtained in the step 1), connecting a XLNet pre-training network with a TextCNN network in series, combining noisy tax payer industry information labels as supervision, and constructing an end-to-end training device;
3) Conditional transition matrix estimation
Based on a nuclear density estimation method, estimating a probability density function according to noisy taxpayer industry information data, converting a conditional transfer matrix estimation problem into a mixed proportion estimation problem, solving a corresponding mixed proportion coefficient based on an improved mixed proportion estimation method, and further obtaining a conditional transfer matrix; conditional transfer matrix estimation: converting a conditional transfer matrix estimation problem in a label noise learning problem into a mixed proportion estimation problem, and solving a mixed proportion coefficient based on an improved mixed proportion estimation method to obtain a conditional transfer matrix; specific implementation details are as follows:
Step1: mixing ratio estimation problem construction
Assume that the noise label in the taxpayer registration information isThe true label of the sample is Y, assuming sample X and noise labelIndependent of each other, then for any class c∈C there is:
Recording device Pi=P(X|Y=i)、Where Q represents the conditional migration probability of a noisy tag to a real tag, the equation above is expressed in the form of a matrix as follows:
Further decomposing the matrix to obtain the following form; where H is a c×c matrix and satisfies the diagonal element 0, and G is a real diagonal matrix of c×c shape;
According to the nature of the matrix transformation, it can be seen that the matrix H, the matrix G, and the matrix Q satisfy the following relationships, respectively:
(I-H)-1G=QT
The Q T matrix is the conditional transfer matrix in the label noise learning, and the above relation indicates that if the matrix H is solved, the conditional transfer matrix is further solved, and the decomposition of the matrix is equivalent to the following c equations:
the equation is further expressed as follows:
Wherein the following are satisfied:
The standard mixing ratio estimation problem is expressed in the form: f=kh+ (1-k) G (k+.gtoreq.0), where fhg is a probability distribution function and assuming that samples sampled at distribution F, H are known, where F is mixture and H, G is composition; equation obtained by the above matrix decomposition: It is the standard mixing ratio estimation problem that the estimated mixing ratio coefficient H ij is the element of the matrix H; therefore, by solving a series of mixed proportion estimation problems, the H matrix can be solved, and then the matrix Q T is transferred according to the matrix relation estimation condition, so that a classifier with consistent risk is constructed based on label noise data, and tax-paying industry classification is carried out;
Step2: regeneration of the composition
Solving the problem of mixed proportion estimation, depending on the labeling of the anchor point, in particular, if the anchor point sample is present and known, the maximum estimated amount of mixed proportion coefficientIs an unbiased estimate of the true mixing proportionality coefficient k;
Specifically, firstly, marking a mixture F sample as positive sample class Y=1, marking a composition component H sample as negative sample class Y= -1, constructing an MLP network for two-class prediction, assuming that the output of the network is F η (X), wherein X is a sample characteristic, eta is a parameter of the network, performing supervised training on the MLP network by using the noisy positive and negative samples, performing posterior probability prediction on the positive sample class sample by using the network after training, selecting a threshold tau, and marking the positive sample class sample set as The negative sample class sample set isPredicting the sample input network of the positive sample class, wherein the sample set with the predicted value smaller than the selected threshold value is recorded asThen there isTaking samples with posterior probability rate smaller than a threshold value into a negative sample set, and respectively obtaining positive and negative sample sets after reconstruction: And Satisfy the following requirementsAndThereby completing regeneration of the composition sample and solving the problem of dependence of the traditional mixing proportion estimation method on anchor points;
step3: probability density estimation based on kernel density estimation
Estimating a probability density function of sample distribution based on a kernel density estimation method on the basis of the Step2 reconstruction composition; specifically, a kernel function is established for representing probability density estimation of an existing sample for any point in the feature space, wherein x is taken as a point in the feature space, and x i is a known sample; and μ is the sample mean, Σ=ρ 2 Q is the covariance matrix of the sample, then in case a gaussian kernel is used, the contribution of sample x i to the probability density at x is represented by the form of the kernel:
The probability density function estimator over the entire sample set is: Wherein the method comprises the steps of Is a set of samples, and is based on the positive and negative sample sets obtainedThe probability density function of the reconstructed positive and negative samples is estimated as follows:
Step4: conditional transition matrix estimation
Sequentially solving c mixing proportion estimation problems constructed in Step1, solving corresponding c-1 mixing proportion coefficients for any one mixing proportion estimation problem, and setting the noise label of the mixture asThe noise label of the composition isCollecting the original samplesRespectively as positive and negative sample sets in the mixing proportion estimation problemStep 2-based method for generating new positive and negative sample setsAndAnd estimating probability density according to the kernel density estimation method of Step3 to obtain respectivelyAndThen estimating the maximum estimation amount of the mixing proportion coefficient by adopting a method for estimating the maximum value of the mixing proportion problem in Step1Where G is a legal probability density function, an estimatorNamely, the estimated value of the element H ij (i not equal to j), all the elements of the H matrix are solved through the cyclic and repeated processes Step2,3 and 4, and then the G matrix can be obtained according to the following properties, so as to obtain the condition transition matrix Q T;
(I-H)-1G=QT
4) Training device network parameter learning and tax payer industry classification
Based on the label noise data, the network parameters of the training device are learned, after training is completed, the estimated conditional transfer matrix is used as a linear conversion layer to be added after the training device, and conversion from noise label prediction to real label prediction is completed, so that tax payer industry classification is realized.
2. The method for learning label noise for taxpayer industry classification according to claim 1, wherein in step 1), taxpayer industry information processing specifically comprises the following steps:
Step1: taxpayer industry text information preprocessing
Extracting text information of the tax payer industry, deleting special symbols, numbers and meaningless symbols of the graduated words in the text information, and finishing preprocessing of the tax payer text information;
Step2: text word embedding based on XLNet pre-training network
Encoding a text based on XLNet pre-training network to generate word vectors, wherein a XLNet pre-training model is based on a transducer design, and simultaneously captures the relationship between two contexts, so that the problem that a pre-training stage and a fine tuning stage are inconsistent due to a mask mechanism of a bert model is improved, and a double-flow self-attention mechanism is used, so that the pre-training effect is more remarkable; the XLNet model applied to Chinese uses a 24-layer network structure and adopts SENTENCEPIEC for word segmentation; encoding the text features obtained in Step1 with XLNet of the chinese version, thereby obtaining a slave word vector;
step3: tax payer industry text feature generation
Assuming that the taxpayer has k text features in total, mapping a word element into a word vector of t dimension by XLNet pre-training network, and recording that the ith text feature has h i word elements, mapping the ith text feature into a matrix of h i ×t; splicing feature matrixes mapped by each text feature, so that the text feature of the sample is mapped into oneGenerating a taxpayer text feature matrix;
Step4: tax administration industry numerical value characteristic processing
For the standardized operation of the numerical characteristics of non-text characteristics of taxpayers, n training samples and m numerical characteristics are assumed, the value of the j-th numerical characteristic of the i-th sample is recorded as X ij, the average value of the j-th numerical characteristic is mu j, and the method meets the requirements ofThe standard deviation of the jth numerical value characteristic is sigma j, which satisfiesThe numerical characteristics after normalization are
Step5: tax administration industry class feature processing
Coding the category characteristics in the non-text characteristics of the taxpayer, and assuming that the category characteristics have N possible values, coding the category characteristics by adopting an N-dimensional vector; specifically, setting the corresponding position of the class feature value as 1, setting the rest positions as 0, namely adopting a one-hot coding method, selecting the longest coding length in the class features to complement after coding is completed on all the class features, and splicing the vectors after the complement to form a class feature matrix;
Step6: non-text feature generation for taxpayer industry
Respectively obtaining m standardized numerical characteristics and a class characteristic matrix with a shape of v multiplied by N max after Step4 and Step5, wherein N max represents the longest class coding length, then establishing two linear network layers for characteristic mapping, wherein the first linear network layer has a network shape of 1 multiplied by t and is used for converting the standardized numerical characteristics into m multiplied numerical characteristic matrices, the second linear network layer has a network shape of N max multiplied by t and is used for mapping the class characteristics into a v multiplied by t class characteristic matrix, and splicing the mapped two characteristic matrices to obtain a non-text characteristic matrix with a final shape of (v+m) multiplied by t;
Step7: taxpayer characteristic information generation
Splicing the text feature matrix generated by Step3 and the non-text feature matrix generated by Step6 to generate a shape ofAs final characteristic information.
3. The method for learning label noise for taxpayer industry classification according to claim 2, wherein in step 2), the taxpayer industry classification network construction and training device is initialized: a TextCNN network was built for text classification, and a TextCNN network comprised three layers, respectively: the method comprises the steps of (1) a convolution layer, (2) a maximum pooling layer and (3) a full connection layer, connecting XLNet pre-training networks in the step 1) with TextCNN networks in series, constructing a training device, and performing end-to-end training by taking tax payer tag noise data as supervision; specific implementation details are as follows:
step1: taxpayer industry classification network construction
The TextCNN network is constructed for tax administration industry classification, and the TextCNN network comprises three layers, namely a convolution layer, a pooling layer and a full connection layer; specifically, the TextCNN convolution layer uses a convolution kernel with the shape of n×t to perform convolution operation for extracting row characteristics, n takes a maximum pooling layer as a pooling layer, textCNN is used for extracting the maximum value of a characteristic diagram after convolution, further compresses and extracts characteristics, then establishes a full-connection layer, and supposes that the total number of categories to be classified in the taxpayer industry classification is c, if the number of the characteristics is s after passing through the maximum pooling layer, establishes a full-connection layer with the shape of s×c for mapping characteristic information into a vector with c dimension, and further performs taxpayer industry classification;
Step2: training device initialization
Connecting XLNet text pre-training networks in the step 1) and the constructed TextCNN networks in series to form a training device; and taking tax-paying pedestrian label noise data as input, predicting the noise label, forming an end-to-end device for training, and initializing training device network parameters.
4. The method for learning label noise for taxpayer industry classification according to claim 3, wherein in Step 2) of Step2, the network parameter is set to be α, the sample is set to be X, and the noise label is set to be XThe network parameter set is w, and the output of the sample X under the mapping of the training device is recorded asFor the followingAndCross entropy loss is made, and a regularization term is added to prevent overfitting, wherein lambda is the regularization term control coefficient, the loss function is minimized, and the optimization objective is as follows:
5. the method for learning label noise for taxpayer industry classification according to claim 4, wherein in step 4), the training device network parameter learning and taxpayer industry classification are as follows:
step1: training device learning based on tag noise data
Assuming that the network parameter in the training device is eta, the noise sample isThe network parameter set is w, the label noise data is used as supervision, the network parameters in the training device are learned, the sample X is recorded and output as g η (X) under the mapping of the training device, and for g η (X) andCross entropy loss is made, and a regularization term is added to prevent overfitting, wherein lambda is the regularization term control coefficient, the loss function is minimized, and the optimization objective is as follows:
Under control of the optimization objective, training device network is used for predicting noise label of input sample The output result g η (X) is calculated through a softmax layer, sotfmax is calculated to carry out exponential normalization on the original output, and the original output is expressed as a predicted value of posterior probability; specifically, assume that the original network output isThe softmax performs exponential operation and normalization on the output vector, and outputs the output vector in the following form;
step2: construction of conditional transfer matrix layer
After the training device network parameter learning is completed, the output g η (X) of the network is subjected to softmax operation to output the posterior probability of the sampleThe method comprises the steps of adding a conditional transfer layer as a transfer layer after a softmax layer for predicting a noise label, and realizing conversion from noise label prediction to real label prediction;
Step3: tax payer industry classification
Based on the construction of the conditional transfer layer, for the newly input sample X, textCNN the output of the network is q (X), by calculationAnd obtaining a subscript r corresponding to the maximum component of q (X), namely the industry classification corresponding to the taxpayer.
6. The method for learning label noise for taxpayer industry classification according to claim 5, wherein in Step2 of Step 4), the specific method is as follows: set noise labelThe true sample label is Y, the total class number is C, and the sample feature X and the noise label are assumedIndependent of each other, for any categoryThe method comprises the following steps:
The original network output g η (X) is converted by the conditional transfer matrix Q T, and the original output can be converted into a new output Q (X) which satisfies Q (X) =Q T g (X), wherein the new output Q (X) is the posterior probability of the real label Where q i (X) (i=1, 2,.., C) is the i-th component of q (X), representing the probability predictor P (y=i|x) that X is the i-th class of real tags.
CN202210498954.4A 2022-05-09 2022-05-09 Tax payer industry classification-oriented label noise learning method Active CN114817546B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210498954.4A CN114817546B (en) 2022-05-09 2022-05-09 Tax payer industry classification-oriented label noise learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210498954.4A CN114817546B (en) 2022-05-09 2022-05-09 Tax payer industry classification-oriented label noise learning method

Publications (2)

Publication Number Publication Date
CN114817546A CN114817546A (en) 2022-07-29
CN114817546B true CN114817546B (en) 2024-09-10

Family

ID=82513012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210498954.4A Active CN114817546B (en) 2022-05-09 2022-05-09 Tax payer industry classification-oriented label noise learning method

Country Status (1)

Country Link
CN (1) CN114817546B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118506069A (en) * 2024-05-15 2024-08-16 云南联合视觉科技有限公司 Image classification method for label with noise situation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710768A (en) * 2019-01-10 2019-05-03 西安交通大学 A kind of taxpayer's industry two rank classification method based on MIMO recurrent neural network
CN110705607A (en) * 2019-09-12 2020-01-17 西安交通大学 Industry multi-label noise reduction method based on cyclic re-labeling self-service method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7739208B2 (en) * 2005-06-06 2010-06-15 Numenta, Inc. Trainable hierarchical memory system and method
CN110866536B (en) * 2019-09-25 2022-06-07 西安交通大学 Cross-regional enterprise tax evasion identification method based on PU learning
CN111428053B (en) * 2020-03-30 2023-10-20 西安交通大学 Construction method of tax field-oriented knowledge graph
CN112860895B (en) * 2021-02-23 2023-03-28 西安交通大学 Tax payer industry classification method based on multistage generation model
CN112765358B (en) * 2021-02-23 2023-04-07 西安交通大学 Taxpayer industry classification method based on noise label learning
CN113712511B (en) * 2021-09-03 2023-05-30 湖北理工学院 Stable mode discrimination method for brain imaging fusion characteristics

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710768A (en) * 2019-01-10 2019-05-03 西安交通大学 A kind of taxpayer's industry two rank classification method based on MIMO recurrent neural network
CN110705607A (en) * 2019-09-12 2020-01-17 西安交通大学 Industry multi-label noise reduction method based on cyclic re-labeling self-service method

Also Published As

Publication number Publication date
CN114817546A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN109299342B (en) Cross-modal retrieval method based on cycle generation type countermeasure network
CN112115995B (en) Image multi-label classification method based on semi-supervised learning
CN108520780B (en) Medical data processing and system based on transfer learning
CN109492026B (en) Telecommunication fraud classification detection method based on improved active learning technology
Ji et al. Unsupervised few-shot feature learning via self-supervised training
CN111079847B (en) Remote sensing image automatic labeling method based on deep learning
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN110442721B (en) Neural network language model, training method, device and storage medium
CN112784031B (en) Method and system for classifying customer service conversation texts based on small sample learning
CN112749274A (en) Chinese text classification method based on attention mechanism and interference word deletion
CN117237559B (en) Digital twin city-oriented three-dimensional model data intelligent analysis method and system
CN115659254A (en) Power quality disturbance analysis method for power distribution network with bimodal feature fusion
CN113705215A (en) Meta-learning-based large-scale multi-label text classification method
CN117333146A (en) Manpower resource management system and method based on artificial intelligence
CN117557886A (en) Noise-containing tag image recognition method and system integrating bias tags and passive learning
CN114817546B (en) Tax payer industry classification-oriented label noise learning method
CN116663540A (en) Financial event extraction method based on small sample
CN116521863A (en) Tag anti-noise text classification method based on semi-supervised learning
CN111783688B (en) Remote sensing image scene classification method based on convolutional neural network
CN113657473A (en) Web service classification method based on transfer learning
CN117496228A (en) Knowledge distillation and graph model-based small sample increment radiation source individual identification method
CN116029394B (en) Self-adaptive text emotion recognition model training method, electronic equipment and storage medium
Liu et al. Extending ordinary-label learning losses to complementary-label learning
CN115797642A (en) Self-adaptive image semantic segmentation algorithm based on consistency regularization and semi-supervision field
CN113553917B (en) Office equipment identification method based on pulse transfer learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant