Cross-language news topic similarity comparison method based on parallel corpus
Technical Field
The patent provides a cross-language news topic similarity comparison method based on a parallel corpus. The patent method can automatically screen foreign language articles with specific themes without translation. The method is characterized in that a bilingual parallel corpus is provided, a bilingual LDA topic model is invented on the basis of an LDA topic model, parallelization processing is realized by means of a parallel computing framework, and automatic screening of multilingual news event reports can be quickly and efficiently realized by using the method. Relates to the fields of corpus, word frequency analysis, similarity calculation and the like.
Background
How to rapidly complete automatic comparison of topic similarity of cross-language news events without manual translation, and further realize automatic screening of multi-language news event reports with the same topic, reduce manual direct translation cost, and timely and accurately master foreign media news public opinions is a problem to be solved urgently.
Machine translation has made great improvements and advances in the field of language translation in recent years. Machine translation is the process of converting one natural language into another using a computer. After the statistical method is applied to the machine translation method, the accuracy of the machine translation result is gradually improved. However, in the practical application under the current technical conditions, a method combining machine translation and manual translation is generally used: the method is characterized in that a machine translation method is used for translating an article into another language, and then a complete and accurate article can be obtained after manual modification and calibration. However, the manual translation process increases the consumption of labor and time, reduces efficiency, and increases cost. This makes traditional machine translation appear untimely, inaccurate inadequacy when dealing with the translation of a vast multilingual news story. Therefore, the patent provides a corpus-based similarity comparison method for screening news articles, which is used for a preprocessing process of news topic identification and screening before machine translation, and achieves the purposes of reducing irrelevant report noise, saving manpower and material resources and improving efficiency.
In order to solve the problems, the patent provides a cross-language news topic similarity comparison method based on a parallel corpus under the condition of no translation. The method proposed by the patent is based on a parallel corpus. A corpus is a large-scale electronic text library. It uses electronic computer as carrier, and makes the language material which is actually appeared in the practical application undergo the processes of scientific sampling and reasonable analysis and treatment so as to obtain the true useful electronic text library. A parallel corpus is a set of texts, each of which contains one or more than one translated language text in addition to its own text, and the simplest parallel corpus designs two language texts — an original text and a translated text. The method is based on a parallel corpus, and word segmentation rules and word frequency distribution rules of two languages are obtained through analysis. Firstly, selecting news reports about a certain theme event in the Chinese, and generating a theme model of the theme article by utilizing an LDA algorithm based on a Chinese general corpus; then selecting foreign language articles to be screened, and extracting text features based on a foreign language general language library; finally, the text features obtained from the foreign language news articles are compared with the topic model of a certain topic article obtained from Chinese, and if the text features are similar to the topic model, the foreign language article is judged to be the foreign language article about the topic.
The method is expanded into a bilingual LDA theme model on the basis of the LDA theme model. Different from the traditional LDA topic model, each document has independent topic distribution, and the bilingual LDA model utilizes a bilingual parallel corpus to share the topic distribution, and different languages describe the same topic. In addition, the parallel corpus is described by different languages, and the word frequency distribution can be different. The method comprises the steps of establishing a bilingual LDA topic model in a generalized space on the basis of a bilingual parallel corpus, generating a new LDA model when a new corpus exists, and comparing the new LDA model with the bilingual LDA model to judge topic classification of the new corpus.
In the method, Gibbs is used for sampling distribution, but in order to improve the generation efficiency of the LDA model by considering the condition of massive training samples, a parallel computing framework spark is used for realizing a parallel LDA algorithm. Compare traditional LDA algorithm, some improvements have been made to the parallel LDA algorithm in this patent and have made its realization parallelization, add this characteristic of timestamp, carry out the sampling process in distributed environment, can improve the efficiency and the rate of accuracy of whole process.
Disclosure of Invention
The invention provides a method for screening foreign language articles about specific topic events based on a parallel corpus. The premise is that a general parallel corpus of Chinese and F languages is provided. Assume that the target news topic is a T topic and the foreign language is F. Screening out F language articles about T topics from a large number of F language unknown topic articles without translation:
1) each document in the parallel corpus has independent topic distribution, and the language describes the same topic, shared topic distribution. Firstly, retrieving an article set about a T topic in Chinese, and obtaining a Chinese LDA topic model of the article set by an LDA topic model algorithm based on a Chinese general corpus in a parallel corpus;
2) then mapping the Chinese T theme LDA theme model to a generalized theme model space to obtain a T theme LDA theme model shared by Chinese and F languages, and obtaining an F language LDA theme model by using an LDA algorithm and an article of an unknown theme to be screened in the F language and an F language corpus in a parallel corpus;
3) and comparing the LDA topic model in the generalized space with the LDA topic model in the F language, and if the LDA topic model is similar to the LDA topic model, considering that the article to be screened is an article about the T topic.
4) A cross-language news topic similarity comparison method based on a parallel corpus is characterized in that,
the method comprises the following three steps:
the premise is that a general parallel language database of Chinese and F languages is provided; assuming that the target news topic is a T topic and the foreign language is F, in the case of no translation, the articles about the T topic in the F language are screened out from the articles with unknown topics in the F language:
(1) each document in the parallel corpus has independent theme distribution, and the language describes the same theme and shared theme distribution; firstly, retrieving an article set about a T topic in Chinese, and obtaining a Chinese LDA topic model of the article set by an LDA topic model algorithm based on a Chinese general corpus in a parallel corpus;
(2) then mapping the Chinese T theme LDA theme model to a generalized theme model space to obtain a T theme LDA theme model shared by Chinese and F languages, and obtaining an F language LDA theme model by using an LDA algorithm and an article of an unknown theme to be screened in the F language and an F language corpus in a parallel corpus;
(3) and comparing the LDA topic model in the generalized space with the LDA topic model in the F language, and if the LDA topic model is similar to the LDA topic model, considering that the article to be screened is an article about the T topic.
The same subject in different languages contains similar semantic information, so the texts in different languages are represented in the same generalized subject space; after the training data is labeled by a language, namely after a topic model of some Chinese news articles is generated, the training data is mapped in a generalized topic space, the generalized topic class is used for generating a topic model of the news articles with unknown topics in foreign languages, and the topic model is compared with the topic model in the generalized topic space to obtain a topic result, wherein the steps are as follows:
for the subject k to be presented,
(1) word probability distribution of a generalized corpus of sampled chinese
Word probability distribution for a generalized corpus of sampled foreign language F
(2) For the mth pair of Chinese and F language documents in the parallel corpus, M belongs to [1, M ∈]Sampling the subject probability distribution θm~Dirichlet(α),
For Chinese documents
N of (2)
CThe number of the individual terms is,
selecting implicit topic Z
C~Dirichlet(θ
m) Generating a term
② for foreign language documents
N of (2)
FThe number of the individual terms is,
selecting implicit topic Z
F~Dirichlet(θ
m) Generating a term
WhereinC represents Chinese, F represents foreign language; theta
mRepresenting the topic probability distribution of the mth pair of bilingual parallel documents;
and
respectively represent a subject Z
kVocabulary distribution probability in Chinese and foreign languages; z
CAnd Z
FRespectively representing implicit themes of the nth terms of the source language and the target language of the mth pair of bilingual parallel documents; omega
CAnd ω
FRespectively representing the nth terms of the Chinese and foreign languages of the mth pair of bilingual parallel documents; m represents a document lumped logarithm;
and
respectively representing the total number of Chinese and foreign documents of the mth pair of bilingual parallel documents; theta
mObey Dirichlet distribution and α is its prior parameter and is used to generate a topic;
and
obey Dirichlet distribution and beta
CAnd beta
FIs a priori parameter of its pair and is used to generate a term; wherein, alpha, beta
C、β
FAre maximum likelihood estimators representing a "document-topic" probability distribution and a "topic-term" probability distribution, respectively; the probability of the whole corpus is selected as an optimization objective function, and alpha and beta are obtained by performing maximum estimation on the objective function
C、β
FAnd obtaining the LDA model.
The parameter estimation of the LDA topic model, the parameters to be estimated, namely the distribution of the document topics and the distribution of the topic words, is realized by the following steps:
suppose a news corpusThe method comprises the steps of (1) including m documents, wherein the number of threads is p, and an algorithm initializes sampling parameters at first; then, dividing the training corpus into p data sets equally, and dividing the samples into the same data set according to the basis of dividing the samples with high similarity; secondly, sampling the distribution by using a Gibbs sampling algorithm, and performing a Gibbs sampling process on the distributed data set in parallel by each thread; since the term w is obtained by observing known data, and only the subject z in the distribution is an implicit variable, only the distribution P (z | w) needs to be sampled; by means of parameter estimation of Dirichlet posterior distribution under Bayes framework, the observed word omega is assumediAnd (3) if t, estimating the parameters:
thus, the Gibbs sampling formula of the LDA model is obtained:
wherein
Parameter estimation of Dirichlet probability distribution of k subjects of mth pair of bilingual parallel documents under Bayesian framework, alpha
kIs its prior parameter, representing the probability distribution of the kth topic,
the nth word representing the mth document for the k topic,
parameter estimation of Dirichlet probability distribution of term t representing subject k under Bayesian framework, beta
tIs its prior probability, representing the word score of the tth topicCloth; alpha is alpha
kAnd beta
tIs maximum likelihood estimator, selects the probability of the whole corpus as the optimization objective function, and obtains alpha by the maximum estimation of the objective function
kAnd beta
tA value of (d);
the left side of the formula represents P (topic | document) · P (word | topic), the probability is the path probability of document topic → term, because the topics are K manually selected, the Gibbs sampling formula represents sampling in the K paths;
and after all threads finish sampling, obtaining a corresponding document-theme matrix and a corresponding theme-word matrix, calculating a global parameter by averaging all local parameters, distributing the global parameter to each parallel LDA algorithm as an initial sampling parameter of the next iteration, and sampling again, so that the process is repeated repeatedly until the parameters are converged.
In order to ensure that the topics corresponding to each data set are independent from each other, the patent adopts a classification strategy according to time elements; because the parallel LDA algorithm can not update the global sampling parameter in time in the process of each iterative sampling, the precision of the final model has certain loss compared with the traditional LDA algorithm, the precision is mainly caused by the average randomness of the initial blocking algorithm, namely, the whole data set is evenly divided into p parts, the correlation among the documents in each part is not considered, and thus, the theme in each data set tends to be averaged;
the s text similarity measurement method is based on a TF-IDF algorithm, wherein the TF-IDF algorithm is a statistical method and is used for evaluating the importance degree of a word to one of files in a file set or a corpus; firstly, the TF-IDF value of each document is calculated, then a time scale is added to each document according to the publication time of the document, and the similarity of the two documents is solved based on the time scale, wherein the formula is as follows:
wherein, t
aAnd t
bRepresenting the time scales of document a and document b, the left molecular denominator plus one is also to prevent the denominator from being 0,
representing the vector product of the word TF-IDF vectors for document a and document b. The effect and the benefit of this patent:
1. the bilingual LDA topic model can automatically screen out articles with a certain specific topic without translation;
2. the method is used for a preprocessing process of news theme recognition and screening before machine translation, and achieves the purposes of reducing irrelevant report noise, saving manpower and material resources and improving efficiency;
3. when data is processed, the parallel processing is realized by means of a parallel computing framework, and screening can be realized quickly and at high speed;
4. on the basis of a parallel LDA algorithm, the method can improve the precision of a final result by partitioning data according to a classification strategy of time elements.
Drawings
FIG. 1 foreign article screening method description
FIG. 2 Chinese article processing of topic News
FIG. 3 dictionary-based word segmentation process
FIG. 4 graphical model of bilingual LDA model
FIG. 5 parallel LDA Algorithm flow chart
FIG. 6 probability computation paths
FIG. 7 Algorithm parameter List
FIG. 8 initialization algorithm pseudo code
FIG. 9 partitioning pseudo code
FIG. 10 Gibbs sampling pseudo code
FIG. 11F language LDA topic model
The technical scheme of the invention is described in detail in the following with reference to the accompanying drawings of the invention.
1. Fig. 2 is a process of processing a chinese article for a topic news. Firstly, selecting a target topic T, searching and screening a plurality of Chinese articles about the T topic in each Chinese portal website as Chinese linguistic data, and preprocessing the Chinese linguistic data into participles and stop words, wherein the preprocessing comprises the following two steps:
(1) and (5) word segmentation processing. Since this patent is based on a corpus, it is sufficient to adopt a dictionary-based word segmentation method using a corpus. And comparing the character strings of the Chinese news articles with words in the word segmentation dictionary according to a strategy specified by the corpus, and if the character strings to be segmented are found in the dictionary, the matching is successful. The process of word segmentation is shown in figure 3 below.
(2) To stop word processing. Stop words refer to words that appear more frequently and carry less textual information. 2. And generating a bilingual LDA model of the T topic based on the LDA model by using the T topic Chinese corpus and the Chinese general corpus. This patent assumes that the same topic in different languages contains similar semantic information, so that text in different languages can be represented in the same generalized topic space. After the training data is labeled by a language, namely after topic models of some Chinese news articles are generated, the training data can be mapped in a generalized topic space, the generalized topic classes are used for generating topic models of the foreign language news articles with unknown topics, and the topic models are compared with the topic models in the generalized topic space to obtain topic results. The steps are as follows:
FIG. 4 is a pictorial model representation of a bilingual LDA topic model, whereby the pictorial model illustrates the process of generating a document based on the LDA topic model. C represents Chinese, F represents foreign language; theta
mRepresenting the topic probability distribution of the mth pair of bilingual parallel documents;
and
respectively representing the vocabulary distribution probability of the subject Z in Chinese and foreign languages; z
CAnd Z
FRespectively representing implicit themes of the nth terms of the source language and the target language of the mth pair of bilingual parallel documents; omega
CAnd ω
FAre respectively provided withThe nth term representing Chinese and foreign languages of the mth pair of bilingual parallel documents; m represents a document lumped logarithm;
and
respectively representing the total number of Chinese and foreign documents of the mth pair of bilingual parallel documents; theta
mDirichlet prior parameters that obey Dirichlet distribution and α is the multinomial distribution of topics under each document;
and
obey Dirichlet distribution and beta
CAnd beta
FIs a priori the parameters it pairs and is used to generate the term.
Alpha reflects the relative relation between the implied topics in the document set, namely the probability distribution of 'document-topic', and beta reflects the probability distribution of 'topic-feature words'. The probability of the whole corpus is selected as an optimization objective function, and alpha and beta are obtained by performing maximum estimation on the objective functionC、βFAnd obtaining the LDA model. Assuming that there are M documents in the corpus, all terms and corresponding topics are represented as follows:
w=(ω1,......,ωM)
z=(z1,......,zM)
wherein, ω ismRepresenting words in the mth document, zmDenotes the topic number, n, corresponding to these wordsmIndicating the number of words generated by the kth topic in the mth document.
2.1 Generation Process representation of LDA model, for topic k:
(1) sampling ChineseWord probability distribution of a generalized corpus
Word probability distribution for a generalized corpus of sampled foreign language F
(2) For the mth pair of Chinese and F language documents in the parallel corpus, M belongs to [1, M ∈]Sampling the subject probability distribution θm~Dirichlet(α),
For Chinese documents
N of (2)
CThe number of the individual terms is,
selecting implicit topic Z
C~Dirichlet(θ
m) Generating a term
② for foreign language documents
N of (2)
FThe number of the individual terms is,
selecting implicit topic Z
F~Dirichlet(θ
m) Generating a term
The generation probability of the subject in the whole corpus is as follows:
the generation probability of the terms in the whole corpus is as follows:
two formulas are combined to obtain:
2.2 Gibbs sampling in MCMC is a common method to derive an LDA topic model. This patent uses Gibbs to sample above distribution, but considers the condition of magnanimity training sample, in order to improve LDA model generation efficiency, here realizes a parallelization LDA algorithm with the help of parallel computing frame, compares traditional LDA algorithm, and some improvements have been done to the parallel LDA algorithm in this patent and make it realize the parallelization, and concrete implementation process is:
supposing that a news corpus contains m documents, the number of threads is p, and an algorithm initializes sampling parameters at first; then, dividing the training corpus into p data sets equally, and dividing the samples into the same data set according to the basis of dividing the samples with high similarity (the similarity measurement method will be described in detail later); the distribution is then sampled using a Gibbs sampling algorithm, with each thread performing a Gibbs sampling process on the assigned data set in parallel. Since the word w is observed from known data, only the subject z in this distribution is an implicit variable, so only the distribution P (z | w) needs to be sampled. By means of parameter estimation of Dirichlet posterior distribution under Bayes framework, the observed word omega is assumediAnd (3) if t, estimating the parameters:
thus, the Gibbs sampling formula of the LDA model is obtained:
wherein
Parameter estimation of Dirichlet probability distribution of k subjects of mth pair of bilingual parallel documents under Bayesian framework, alpha
kIs its prior parameter, representing the probability distribution of the kth topic,
the nth word representing the mth document for the k topic,
parameter estimation of Dirichlet probability distribution of term t representing subject k under Bayesian framework, beta
tIs its prior probability, representing the word distribution of the tth topic. Alpha is alpha
kAnd beta
tIs maximum likelihood estimator, selects the probability of the whole corpus as the optimization objective function, and obtains alpha by the maximum estimation of the objective function
kAnd beta
tThe value of (c).
After all threads finish sampling, obtaining a corresponding document-theme matrix and a corresponding theme-word matrix, calculating a global parameter by averaging all local parameters, then distributing the global parameter to each parallel LDA algorithm as an initial sampling parameter of the next iteration, and sampling again, so that the above processes are repeated repeatedly until the parameters are converged, and the flow chart of the whole process is shown in FIG. 5.
2.3 classification strategy according to time elements. On the basis of the parallel LDA algorithm, the patent performs some improvements on the data block part so as to improve the precision of the final result. Because the parallel LDA algorithm cannot update the global sampling parameters in time in the process of each iterative sampling, the precision of the final model has certain loss compared with the traditional LDA algorithm, and the method is mainly caused by the average randomness of the initial blocking algorithm, namely, the whole data set is evenly divided into p parts, the correlation among documents in each part is not considered, and thus, the subjects in each data set tend to be averaged.
After the parallel LDA algorithm is observed, the comparison with the traditional serial LDA algorithm shows that the Gibbs sampling parameter cannot be updated in time by the theme-word matrix of the parallel algorithm, so that certain deviation occurs in the calculation of the document theme probability, and therefore, the theme corresponding to each data set is independent by the algorithm and is not influenced by other data sets, and the deviation can be reduced.
For text similarity calculation, the patent introduces a time-element-based classification strategy based on the TF-IDF algorithm, which is a statistical method for evaluating the importance of a word to one of a set of documents or a corpus, the importance of a word being proportional to the number of occurrences of the word in the document but inversely proportional to the frequency of occurrences of the word in the corpus as a whole.
The TF-IDF consists of two parts: term Frequency (TF) and Inverse Document Frequency (IDF), where TF refers to the number of times a given word appears in a document, and the parameter is normalized based on the number of words in order to prevent it from being biased toward long text.
tfi,jThe word frequency TF value of the ith word in the jth document in the corpus is represented, and the realization formula is as follows:
in the above formula, ni,jRepresents the number of occurrences of the word in document j, and the denominator Σknk,jThen represents the sum of the number of times all words in document j appear. The meaning of the whole formula is as follows:
inverse document frequency idfi,jIs oneThe importance of a word to the entire corpus is measured, and the implementation formula of this parameter is as follows:
where D represents the total number of documents in the corpus, D (w)i) The number of documents containing the word i is represented, the minimum value of the number is 1, in order to prevent the situation that the denominator takes 0, therefore, the denominator is increased by one, and the meaning of the whole formula is as follows:
with TF and IDF, the value of TF-IDF is the product of TF and IDF, i.e.:
the document similarity is calculated according to TF-IDF values, the document similarity is expressed by using cosine values of vectors, the document is expressed as a vector W formed by TF-IDF forming all words of the document, in order to embody time elements, the life cycle of a topic discussed by news is assumed to be one month, then a time scale t is added to the news according to the time information of each document in a corpus, for example, the time scale t of the ith news published in the earliest one monthiThe similarity is weighted according to the time attribute, news documents published in similar life cycles are higher in similarity, and conversely, the longer the publication time interval of the two documents is, the lower the similarity is, so that the similarity s of the document a and the document b isa,bExpressed as:
wherein, t
aAnd t
bTime scales representing documents a and b, left formulaThe addition of one to the denominator also prevents the denominator from being 0,
representing the vector product of the word TF-IDF vectors for document a and document b.
After the similarity matrix among the documents is calculated according to the method, the whole corpus is divided into p parts according to the similarity among the documents, and the documents with higher similarity are distributed to the same data set as much as possible. 2.4 implementation of the parallel LDA algorithm on the basis of the above-mentioned partitioned data set, the implementation process of the parallel LDA algorithm and corresponding pseudo code are given next, and the algorithm implementation includes three parts: initializing parameters, segmenting a data set and Gibbs sampling.
Parameter name parameter description
X language database
XiSegmented small data set
Number of P data blocks
Number of K topics
Number of M documents
Number of words in V word list
NmNumber of words of mth document
N number of words in all documents
nmA count array representing the total number of words contained in the mth document in X
nkA count array representing the total number of words contained in the kth topic in X
A count matrix representing the number of words belonging to topic k in the mth document in X
A count matrix representing the number of words in all words belonging to subject k in X
Count matrix, representing X
iNumber of words in all words belonging to topic k
Count matrix, representing X
iNumber of words belonging to topic k in mth document
zm,nTopic distribution of nth word in mth document
wm,nNth word in mth document
ta,tbTime scale representation of documents a, b
Sim(i,j)Similarity of ith document and jth document
The upper diagram is a parameter description of the parallel LDA algorithm, and the following describes each part of the algorithm in detail.
The first is the initialization process of the parameters, the initialization is to each word w in the document
m,nRandomly assigning a theme and then updating the count matrix
And a count array n
m、n
k. The pseudo code for the algorithm implementation is shown in FIG. 8 below.
The second part is data set partitioning, which has been described in detail in section 2.3 with respect to the data set partitioning principle, and implementation code is given here, whose pseudo code is shown below in fig. 9.
Finally, the Gibbs sampling process is divided into two parts: sampling and updating. For each data set X in the sample
iSampling is carried out, in the process, a counting matrix in sampling parameters is initialized firstly
And
in order to avoid access conflict between different data sets, the matrixes are respectively stored under each local thread to obtain a local counting matrix
Therefore, only the local counting matrix needs to be updated in the sampling process; when one iteration is completed, all threads need to update the content of the counting matrix to the global counting matrix
And this matrix is used as the initial count matrix for the next iteration. Repeating the iteration process until Gibbs sampling of the whole data set is converged, ending the iteration, and obtaining a counting matrix
And
and count array n
m、n
kTo calculate the word probability distribution under each topic and the topic probability distribution of each document, the pseudo code of the specific implementation is as shown in fig. 10 below.
2.5 model training. And training the LDA model based on the corpus according to the Gibbs sampling formula obtained above. The training process is to obtain (z, w) samples in the corpus through Gibbs sampling, and all parameters in the model can be estimated based on the finally sampled samples. The training process is as follows:
(1) random initialization, namely randomly attaching a theme number z to each word w in each news document in the Chinese corpus;
(2) rescanning the corpus, resampling the theme of each word w according to a Gibbs sampling formula, and updating in the corpus;
(3) repeating the resampling process of the corpus until Gibbs sampling converges;
(4) and counting a frequency matrix of the topic-term co-occurrence of the corpus, wherein the matrix is an LDA model.
3. FIG. 6 is a process for an article with an unknown topic in the F language. And obtaining the F language LDA topic model of the article by using the same generation method of the LDA topic model.
4. And (5) comparing the similarity.
And comparing the similarity of the T topic Chinese LDA model and the LDA topic model of the F language unknown topic article by using a KL distance method. The formula is as follows:
for all j, when pj=qjWhen D isKL(p, q) ═ 0. Considering the asymmetry of the KL distance formula, there is a KL distance symmetry formula:
Dλ(p,q)=λDKL(p,λq+(1-λ)q)+(1-λ)DKL(q,λp+(1-λ)p)
when λ is 0.5, the KL distance formula can be converted to the JS distance formula:
the JS distance formula is selected as the similarity measurement standard.
5. And (6) evaluating the standard. This patent adopts F measurement as the evaluation criteria. The F metric is a balance of combined precision and recall indicators in the information retrieval. Expressed using the following formula:
P=Na/Nb
R=Na/NC
wherein N isaIndicates the total number of correctly identified individuals, NbRepresenting the total number of recognized individuals, NcRepresents the total number of individuals present in the test set, P represents the correct rate, and R represents the recall rate.