CN114722189A

CN114722189A - Multi-label unbalanced text classification method in budget execution audit

Info

Publication number: CN114722189A
Application number: CN202111534284.9A
Authority: CN
Inventors: 伍之昂; 张璐; 方昌健
Original assignee: Guangdong Weishen Information Technology Co ltd; NANJING AUDIT UNIVERSITY
Current assignee: Guangdong Weishen Information Technology Co ltd; NANJING AUDIT UNIVERSITY
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-07-08
Anticipated expiration: 2041-12-15
Also published as: CN114722189B

Abstract

The invention discloses a method for classifying multi-label unbalanced texts in budget execution audit, which comprises the following steps: constructing a keyword library in the budget execution and audit field, selecting seed words from the keyword library as label descriptions, then performing word segmentation based on a word segmentation tool and the keyword library, and calculating labels and embedded matrixes corresponding to the word segmentation; building a similarity matrix of the neural network calculation words, phrases and labels (namely label description), solving word weight based on the built pooling layer, solving a sentence embedding matrix by combining the word embedding matrix, and outputting the sentence embedding matrix to a classifier to obtain a prediction result; unbalanced data weight is introduced into the loss function, label description is added into the loss function to strengthen learning of small categories and labels, a model is obtained by training with the minimum loss function as a target, and payment abstract text data of unknown labels can be effectively classified. The invention effectively solves the problem of multi-label unbalanced classification of the payment voucher abstract text in budget execution audit.

Description

Multi-label unbalanced text classification method in budget execution audit

Technical Field

The invention relates to the field of text classification, in particular to a multi-label unbalanced text classification method in budget execution audit.

Background

In financial budget performance audits, payment summaries of money need to be sorted to identify whether their use is consistent with the budget items, to review payment compliance, and even to identify high-risk transactions. At present, a large amount of text classification work still depends on manual labeling of auditors, and the explosive growth of audit data under a large data environment is difficult to deal with more and more. Although the research on the text classification problem has been long, it is still clear that audit is performed completely facing budget so as to develop research and application of payment summary text classification in a targeted manner, and a general text classification algorithm and a general text classification tool are obviously difficult to be completely applied to the field with extremely strong specialty. The problems that text professional vocabularies in the audit field are more, budget subject categories are more, sample sizes are unbalanced and the like exist in a text analysis scene in budget execution audit, and meanwhile, the traditional text classification method is difficult to capture the importance degree of different words influencing a classification model by using an unsupervised sentence representation mechanism based on average word vectors. Aiming at the problems, the invention provides a multi-label unbalanced text classification method in budget execution audit, which integrates sentence representation learning and multi-label unbalanced classification model training in a supervised learning mode, is expected to quickly and accurately solve the classification problem of payment purpose abstract and improves the efficiency of audit work.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a method for classifying multi-label unbalanced texts in budget execution audit, which can solve the problem of classifying the multi-label unbalanced texts in the budget execution audit.

The technical scheme is as follows:

a multi-label unbalanced text classification method in budget execution audit comprises the following steps:

the method comprises the following steps: data preprocessing and word embedding training to obtainInput data for the model: giving abstract text data of the payment voucher with the label, wherein the number of samples is different among different categories, and the number of the categories in the data is K; constructing a keyword library for budget execution and audit from a given text, namely proper nouns in the field, and selecting a representative seed word from the keyword library as a description of a label; performing word segmentation on the text by using a word bank and a word segmentation tool, completing pre-training of word embedding vectors on full audit text data, and obtaining a word matrix E_i＝[e_i1,…,e_iL]^TWherein i is the serial number of the sentence, L is the serial number of the word in the sentence, and L is the length of the sentence, mapping the seed words to the word embedding matrix, and then averaging the word embedding matrix of each category seed word to obtain the embedding matrix L of all the tags₁,…,l_K]^T；

Step two: constructing a model, and constructing a classification framework of the multi-label unbalanced text: firstly, model construction is carried out, a similarity matrix is obtained by utilizing words and labels in sentences, then, the neural network is used for calculating the similarity of context information, namely the phrases and the labels, wherein 2 groups of parameters W are provided₁And b₁Training is required; then, a newly constructed chiny pooling layer is used for solving weight vectors between phrases and all category labels, finally, the weight vectors are used for weighting original words, a proper sentence embedding matrix can be obtained after the training process is completed, namely, the sentence embedding matrix fused with the domain knowledge, and the formula is as follows:

wherein Z_iFor the embedding matrix of the ith sentence, f₁To be E_iL input, Z_iA mapping function that is an output;

the sentence embedding matrix is then used as input to classify the sentence using a classifier where 2 sets of parameters need to be trained, i.e., W₂And b₂(ii) a The formula is as follows:

wherein

As a sentence Z_iProbability demonstration of the corresponding category of the prediction, f₂Is represented by Z_iInput,

A mapping function that is an output;

step three: constructing a target function with sentence embedding and unbalanced multi-classification unification, and guiding neural network training; using a cross entropy loss function as a basic objective function, introducing weight data to make the loss function biased to a small category, strengthening the training of the small category by a classifier, finally, embedding a label word into a matrix, introducing the label word into the loss function to strengthen the learning of a label, and realizing the training of a model by taking a minimized currently constructed unbalanced objective function as a target; after training, effectively classifying the payment abstract text data of the unknown label;

further, in the second step, a model is constructed, and a classification framework of the multi-label unbalanced text is built: firstly, model construction is carried out, a similarity matrix is obtained by utilizing words and labels in sentences, then, the neural network is used for calculating the similarity of context information, namely the phrases and the labels, wherein 2 groups of parameters W are provided₁And b₁Training is required; then, a newly constructed chiny pooling layer is used for solving weight vectors between phrases and all category labels, finally, the weight vectors are used for weighting original words, and a proper sentence embedding matrix can be obtained after the training process is finished, namely the sentence embedding matrix with the domain knowledge fused;

the method specifically comprises the following steps: in the first stage, firstly, a similarity matrix is obtained, and the formula is as follows:

similarity matrix G_iIs L × KWherein | | · | | represents L₂And (4) norm.

Then, the similarity between the phrases containing context semantics in the sentence and the tags is calculated, and the formula is as follows:

wherein j represents the sequence number of the word at the center of the phrase, j-p, j + p is the sequence number of the leftmost and rightmost words of the phrase, W₁And b₁Performing iterative training for two groups of parameters in the neural network in the training process;

then, calculating a related weight value matrix of the word:

wherein c is_jkSimilarity of the jth word with the corresponding kth category label;

for beta again_jA normalization calculation was performed, the formula is as follows:

where exp represents an exponential function with e as base, beta_j′The similarity value of the jth word in the sentence is shown;

finally, an embedding matrix of the sentence is obtained, and the formula is as follows:

the above process is expressed as equation (1) as a whole;

in the second stage, a three-layer full-connection layer neural network classifier is built, and an embedding matrix Z of sentences is constructed_iInputting the classifier, training to obtain effective prediction output

The overall process is expressed as formula (2);

further, in the third step, a sentence embedding and unbalanced multi-classification unified target function is constructed, and neural network training is guided. And finally, embedding the label words into the loss function to strengthen the learning of the labels, and realizing the training of the model by taking the minimized unbalanced objective function constructed at present as a target 99as standard. After training, the payment abstract text data of the unknown label can be effectively classified;

the method specifically comprises the following steps: first, the inverse weight of each category is calculated, and the formula is as follows:

where c (-) is the number of samples in the class, mean (-) represents the median, y_kA label vector representing class k, the number of samples of class k' being the median of the number of all classes, y_k′A label vector representing class k';

and then smoothing the reverse weight to obtain a final weight vector, wherein the formula is as follows:

wherein S (-) represents a sigmoid function, r_kIs the inverse weight of the kth class, r_k′The inverse weight for the kth' category;

then, a weight vector is introduced to construct a loss function, and the formula is as follows:

wherein N is a sentence in the data setTotal number, CE (·) is a cross-entropy loss function;

the meaning of (d) is that the function f can be decomposed into two parts: f. of₁And f₂By a function f₁As a function f₂The input of (1); y is_iIs the actual label matrix of the ith sentence, sigma is the weight vector, sigma^TRepresenting the transpose of the weight vector, y_ikThe value of the k-th tag, which represents the ith sentence, is 1 corresponding to the actual tag position, 0 for the remaining positions,

a predicted probability of a k-th tag representing an ith sentence;

in order to improve the importance of the label in training, a special label loss function is added, and the formula is as follows:

where k is the serial number of the corresponding category, α is the penalty coefficient, y_kIs a category label matrix;

finally, the model is trained based on the Adam algorithm with the objective of minimizing equation (11).

Has the advantages that: the invention effectively solves the problem of multi-label unbalanced classification of the payment voucher abstract text in budget execution audit, obviously improves the recall rate and the overall performance on subclasses in the introduction of label similarity calculation, and greatly improves the efficiency of auditors for checking budget execution compliance and identifying high-risk transactions.

Drawings

Fig. 1 is a flowchart of an unbalanced text classification method for the audit field according to an embodiment of the present invention.

Fig. 2 is a diagram illustrating a neural network framework according to a first embodiment of the present invention.

FIG. 3 is a schematic diagram of a model training process according to an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings. Fig. 1 is a diagram illustrating an unbalanced text classification method for the audit field according to an embodiment of the present invention. As shown in fig. 1, the present embodiment includes the following steps:

the method comprises the following steps: data preprocessing and word embedding training are carried out to obtain input data of the model; giving abstract text data of the payment voucher with the label, wherein the number of samples is different among different categories, and the number of the categories in the data is K; constructing a keyword library for budget execution and audit from a given text, namely proper nouns in the field, and selecting a representative seed word from the keyword library as a description of a label; performing word segmentation on the text by using a word bank and a word segmentation tool, completing pre-training of word embedding vectors on full audit text data, and obtaining a word matrix E_i＝[e_i1,…,e_iL]^TWherein i is the serial number of the sentence, L is the serial number of the word in the sentence, the seed words are mapped to the word embedding matrix, and then the word embedding matrix of each class seed word is averaged to obtain the embedding matrix L of all the labels [ L ═ L₁,…,l_K]^T；

Step two: constructing a model, and constructing a classification frame of the multi-label unbalanced text; firstly, model construction is carried out, as shown in FIG. 2, a similarity matrix is solved by using words and labels in sentences; then using neural network to calculate similarity of context information, i.e. phrase and label, there are 2 sets of parameters W₁And b₁Training is required; then, a newly constructed chiny pooling layer is used for solving weight vectors between phrases and all category labels, finally, the weight vectors are used for weighting original words, a proper sentence embedding matrix can be obtained after the training process is completed, namely, the sentence embedding matrix fused with the domain knowledge, and the formula is as follows:

finally, the sentence embedding matrix is used as input to classify the sentences by using a classifier, wherein 2 sets of parameters need to be trained, namely W₂And b₂The formula is as follows:

wherein

As a sentence Z_iPredicted corresponding class probability matrix, f₂Is represented by Z_iInput,

A mapping function that is an output;

step three: and constructing a sentence embedding and unbalanced multi-classification unified target function to guide neural network training. And finally, embedding the label words into the loss function to strengthen the learning of the label and realize the training of the model by taking the currently constructed unbalanced objective function as a target. After training, the payment abstract text data of the unknown label can be effectively classified;

in a specific embodiment, a method for classifying text with multiple labels and imbalance in budget execution audit is described in detail:

firstly, executing audit text data according to the existing budget, segmenting sentences by utilizing a segmentation tool LAC (lexical Analysis of Chinese), counting corresponding word frequencies in all categories, and constructing a budget execution and audit field keyword library and seed words according to segmentation results and a collected professional field word library:

the key word library and the seed words in the field of budget execution and audit are shown in the following table:

performing word segmentation results obtained by using LAC based on budget execution audit field word stock and conventional stop words, as shown in the following table;

serial number	Sentence	Word segmentation result
			1	Lodging fee for Shenzhen specialist attending zhushai following project	Lodging fee for Shenzhen specialist attending zhushai following project

And characterizing the seed words by using CBOW (contents Bag of words) to obtain an embedded matrix corresponding to the label. Taking the travel category as an example, the embedded matrix of the seed words and the embedded matrix of the tags are shown in the following table:

the average value of the seed word embedding matrix in the traveling fare category is obtained to obtain the embedding matrix of the label, which is shown in the following table:

then, the word segmentation result is characterized by CBOW to obtain an embedding matrix corresponding to the word, which is shown in the following table:

the data is divided into a training set and a test set according to the scores, the training set is input into a model for training, and the training process is shown in fig. 3.

After the training is finished, inputting the test set into the trained model, and calculating to obtain beta_jAfter the sentence embedding matrix is introduced as a weight, a sentence embedding matrix is obtained by calculation, and the following table shows that:

the final prediction result obtained after the sentence is embedded into the matrix input classifier is shown in the following table:

overall predicted results, as shown in the following table:

	Precision	Recall	F1-score	support
					five-risk one-gold	0.965	0.971	0.968	17573
Salary and subsidy of personnel	0.905	0.907	0.906	11075
					Office expenses	0.931	0.905	0.918	3955
Property management fee	0.874	0.873	0.874	1983
					Cost of infrastructure	0.896	0.791	0.840	826
Travelling fee	0.780	0.751	0.765	719
					Special procurement	0.697	0.685	0.677	691
Official expenses	0.645	0.690	0.667	519
					Others	0.500	0.757	0.602	189
Macro Avg	0.799	0.811	0.805	37530
					Weigthed Avg	0.922	0.921	0.921	37530
Big Avg	0.911	0.867	0.888	15856
					Small Avg	0.743	0.783	0.759	21674

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A multi-label unbalanced text classification method in budget execution audit is characterized by comprising the following steps:

the method comprises the following steps: data preprocessing and word embedding training are carried out, and input data of a model are obtained: giving abstract text data of the payment voucher with the label, wherein the number of samples is different among different types, and the number of the types in the data is K; constructing a keyword library for budget execution and audit from a given text, namely proper nouns in the field, and selecting a representative seed word from the keyword library as a description of a label; performing word segmentation on the text by using a word bank and a word segmentation tool, completing pre-training of word embedding vectors on full audit text data, and obtaining a word matrix E_i＝[e_i1,…,e_iL]^TWherein i is the serial number of the sentence, L is the serial number of the word in the sentence, and L is the length of the sentence, mapping the seed words to the word embedding matrix, and then averaging the word embedding matrix of each category seed word to obtain the embedding matrix L of all the tags₁,…,l_K]^T；

Step two: constructing a model, and constructing a classification frame of the multi-label unbalanced text; firstly, model construction is carried outEstablishing, using words and labels in sentences to find a similarity matrix, and then using a neural network to calculate the similarity of context information, namely phrases and labels, wherein 2 groups of parameters W are provided₁And b₁Training is required; then, a newly constructed chiny pooling layer is used for solving weight vectors between phrases and all category labels, finally, the weight vectors are used for weighting original words, a proper sentence embedding matrix can be obtained after the training process is completed, namely, the sentence embedding matrix fused with the domain knowledge, and the formula is as follows:

the sentence embedding matrix is then used as input to classify the sentence using a classifier where 2 sets of parameters need to be trained, i.e., W₂And b₂The formula is as follows:

wherein

A mapping function that is an output;

step three: constructing a target function with sentence embedding and unbalanced multi-classification unification, and guiding neural network training; using a cross entropy loss function as a basic objective function, introducing weight data to make the loss function biased to a small category, strengthening the training of the small category by a classifier, finally, embedding a label word into a matrix, introducing the label word into the loss function to strengthen the learning of a label, and realizing the training of a model by taking a minimized currently constructed unbalanced objective function as a target; and after training, effectively classifying the payment abstract text data of the unknown label.

2. The method for classifying the multi-label unbalanced text in the budget execution audit as claimed in claim 1, wherein in the second step, a model is constructed, and a classification framework of the multi-label unbalanced text is built: firstly, model construction is carried out, a similarity matrix is obtained by utilizing words and labels in sentences, then, the neural network is used for calculating the similarity of context information, namely the phrases and the labels, wherein 2 groups of parameters W are provided₁And b₁Training is required; then, a newly constructed chiny pooling layer is used for solving weight vectors between phrases and all category labels, finally, the weight vectors are used for weighting original words, and a proper sentence embedding matrix can be obtained after the training process is finished, namely the sentence embedding matrix with the domain knowledge fused;

c_i＝ReLU(G_i,j-p:j+pW₁ ^T+b₁),1≤j≤L (4)

then, calculating a related weight value matrix of the word:

wherein c is_jkSimilarity of the phrase corresponding to the jth word and the corresponding kth category label is obtained;

the above process is expressed as equation (1) as a whole;

in the second stage, a three-layer full-connection layer neural network classifier is built, and an embedding matrix Z of sentences is constructed_iInputting into classifier, training, and obtaining effective prediction output

The overall process is represented as equation (2).

3. The method for classifying the multi-label unbalanced text in budget execution audit, according to the claim 1, is characterized in that in the third step, a sentence is constructed and embedded into an objective function unified with the unbalanced multi-classification, and neural network training is guided; using a cross entropy loss function as a basic objective function, introducing weight data to make the loss function biased to a small category, strengthening the training of a classifier on the small category, finally, embedding a label word into the loss function to strengthen the learning of the label, and realizing the training of a model by taking a minimized currently constructed unbalanced objective function as a target; after training, effectively classifying the payment abstract text data of the unknown label;

where N is the total number of sentences in the dataset and CE (-) is the cross entropy loss function;

the meaning of (d) is that the function f can be decomposed into two parts: f. of₁And f₂By a function f₁As a function f₂The input of (1); y is_iAs the fact of the ith sentenceThe inter-label matrix, Σ, is a weight vector, Σ^TRepresenting the transpose of the weight vector, y_ikThe value of the k-th tag, which represents the ith sentence, is 1 corresponding to the actual tag position, 0 for the remaining positions,

a predicted probability of a k-th tag representing an ith sentence;

in order to improve the importance of the label in training, a label loss function is added, and the formula is as follows:

and finally training the model based on the Adam algorithm and aiming at minimizing the formula (11).