CN104156350B

CN104156350B - Text semantic meaning extraction method based on thin division MapReduce

Info

Publication number: CN104156350B
Application number: CN201410379847.5A
Authority: CN
Inventors: 曾嘉; 高阳; 严建峰; 刘晓升; 杨璐; 刘志强
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2014-08-04
Filing date: 2014-08-04
Publication date: 2018-03-06
Anticipated expiration: 2034-08-04
Also published as: CN104156350A

Abstract

The present invention relates to a kind of text semantic meaning extraction method based on thin division MapReduce.Methods described includes：Pending text set is subjected to dual division according to document dimension and word list dimension, each division is the partial content of part text；Apply for a number of Mapper, each division of text set is respectively trained using LDA topic model SparseLDA algorithms, obtains local parameter, and different marks is given to different parameters, record the Reducer corresponding to it；Apply for that a number of Reducer, different types of Reducer merge the local parameter of not isolabeling, obtain global parameter, be output to file；This Mapper and Reducer process is repeated until reaching the condition of convergence, obtains final training pattern, semantic interpretation and expression for new text.

Description

Text semantic meaning extraction method based on thin division MapReduce

Technical field

The present invention relates to machine learning field, more particularly to a kind of text semantic extraction based on thin division MapReduce Method.

Background technology

The semantic understanding of text is research topic more popular at present, and once exponential type increases digital information in internet Long, which includes webpage, social networks news, books, picture, audio, video, microblogging and scientific paper etc., with document shape The information speedup that formula is presented is particularly swift and violent.How effectively to organize, manage and make a summary these text messages and excavate it is wherein hidden The knowledge contained is a major challenge that current computer science faces.In addition, the network application related to search is required for efficiently Semantic understanding module, the main purpose of user is obtained, so as to be preferably user service.Such as the search engine needs of Baidu With mostly concerned text is inquired about with user, Taobao's search needs to feed back to the product that user meets the most.

Topic model (Topic Models) is a kind of unsupervised learning algorithm, it is not necessary to artificial mark, saves manpower money Source.More ripe topic model is latent Dirichletal location (Latent Dirichlet A] location, LDA) at present Algorithm, the algorithm assumes that a document is multi-threaded probability distribution, and a theme is the probability distribution on word list.LDA Algorithm goes out topic model to predict the theme distribution of new document from data focusing study.With increasing for document, document is included Theme also increasing, while the size of word list is also constantly increasing.In order to preferably explain the theme wherein contained, we Stabilization is needed, it is practical, the high-dimensional processing method of big data can be handled.

Be parallel it is a kind of handle the high-dimensional direct method of big data, existing parallel LDA algorithm deficient in stability with can Autgmentability, higher speed-up ratio can not be obtained using more processors.We select MapReduce as parallel basis, Its expansible bottleneck is analyzed, proposes improved method, strengthens the scalability and practicality of algorithm.

In view of it is above-mentioned the defects of, the design people, be actively subject to research and innovation, to found a kind of high efficiency semantic compression Parallel text big data storage method, make it with more the value in industry.

The content of the invention

In order to solve the above technical problems, it is an object of the invention to provide a kind of autgmentability is strong, it will be appreciated that big data, high latitude Text set the text semantic meaning extraction method based on thin division MapReduce.

Text semantic meaning extraction method of the present invention based on thin division MapReduce, methods described include：

Pending text set is divided respectively with two dimensions of document dimension and word dimension；

By the document after division and word, the multiple processing by MapReduce up to reaching the condition of convergence, obtains respectively Training pattern；

Semantic interpretation and expression are carried out to text based on the training pattern.

Specifically, methods described specifically includes：

Map phase process is carried out respectively to the document after division and word, line number is entered based on predetermined L D A topic models According to training, some local parameters are obtained, different marks is given to different local parameters；

Not Reduce corresponding to the local parameter of isolabeling is recorded, carrying out Reduce processing to the local parameter obtains entirely Office's parameter；

Said process is repeated until reaching the condition of convergence, obtains training pattern；

Further, described local parameter includes document-theme distribution θ_D×K, theme-word distribution phi_K×W, theme is total Distribution phi_K, and tetra- kinds of parameters of log-likelihood log-likelihood of text set；

Reducer corresponding to four kinds of described parameters is Doc-Reducer, Wordstats-Reducer respectively, Globalstats-Reducer, Likelihood-Reducer.

Further, the Reducer of each is to collect summation to the correspondence of derived data, and defeated by predetermined form Go out to file.

Further, Reducer corresponding to different local parameters is different.

By such scheme, the present invention at least has advantages below：

Text semantic meaning extraction method of the present invention based on thin division MapReduce, in implementation process, internal memory can To reach the 1/M of existing algorithm, M can be set by the user, low memory consumption i.e. illustrate this method can do it is more massive Topic model, it is extensive on the extensive or theme either on text.In speed, due to existing based on thin division MapReduce LDA models are all based on variation Bayes, and the present invention uses SparseLDA, a kind of quick high accuracy LDA approximate resoning algorithms, so have obvious acceleration in speed, and precision does not decline.

Brief description of the drawings

Fig. 1 is the schematic diagram of text semantic meaning extraction method of the present invention based on thin division MapReduce；

The text that Fig. 2 is the 2*3 of text semantic meaning extraction method of the present invention based on thin division MapReduce specifically divides original Reason figure；

Fig. 3 is Experimental comparison results' figure of text semantic meaning extraction method of the present invention based on thin division MapReduce；

Fig. 4 is Experimental comparison results' figure of text semantic meaning extraction method of the present invention based on thin division MapReduce；

Fig. 5 is the scalability proof diagram of text semantic meaning extraction method of the present invention based on thin division MapReduce.

Embodiment

With reference to the accompanying drawings and examples, the embodiment of the present invention is described in further detail.Implement below Example is used to illustrate the present invention, but is not limited to the scope of the present invention.

(1) LDA models：

LDA models are a kind of three layers of Bayesian models.Mode input data set size is designated as D*W, and wherein D is that document is total Number, W is word table size.D*W matrixes are changed into D*K matrixes and K*W matrixes by LDA models, are designated as θ respectively_D*KDocument subject matter point Cloth and the distribution of c subject words.Wherein number of topics K can be set.The algorithm for deriving LDA processes has several, most practical, conventional calculations Method is gibbs sampler (Gibbs Sampling, GS), and the present invention uses SparseLDA, a kind of GS algorithms of speed-optimization, GS Main thought be the distribution for calculating every document d each word w a K size, then therefrom select a theme k to assign Give corresponding θ_D*KWith.

The new probability formula (2) that original GS is used to derive LDA models by SparseLDA becomes formula (1), so as to reduce Some calculation procedures repeated, accelerate the training speed of model.

(2) present invention employs thin division MapReduce method explanation：

It is distribution task to each Map processes, it is necessary to be divided, division methods of the prior art are that document is carried out Division.And document and word list are all divided respectively in the present invention, as shown in figure 1, data set is divided into N*M blocks, Ideally there is N*M Map process, but due to the limitation of equipment, be possibly less than N*M Map.Map friendships with a line are assigned to Doc-Reducer, share N number of Doc-Reducer.The Map of same row is handed over and is assigned to Wordstats-Reducer, shares M Wordstats-Reducer.Likelihood-Reducer in Fig. 1 is used for the maximum likelihood value for solving all Map, can be used as In the condition of convergence of model, can also evaluation model quality.Fig. 2 gives 2*3 text division methods, illustrates MapReduce specific partition process.

(3) MapReduce implementation methods explanation：

Map and Reduce key step.In the Map stages, for any one Map (n, m) in N*M, first from magnetic The relevant parameter of current iteration t is loaded in disk, judges whether it is first time iteration, if first time iteration, then at random Initialize φ and θ, be otherwise updated φ and θ using SparseLDA algorithms, finally corresponding φ is sent to again corresponding to Wordstats-Reducer, corresponding θ are sent to corresponding Doc-Reducer.It is different types of in the Reduce stages Reducer merges different parameters.

Document-word D*W matrixes have higher degree of rarefication, to every text, using word index and word number as Two tuple forms record a word in a document.

It is strong based on thin division MapReduce, autgmentability it is an object of the invention to propose, it will be appreciated that big data high latitude Semantic expressiveness topic model, the algorithm mainly solve the problems, such as big data disposal ability.It is existing to divide MapReduce based on thin Topic model in internal memory, consumption is excessive in terms of the time, leads to not expand to higher data dimension.The present invention proposes thin draw Divide MapReduce, solve the problems, such as memory consumption, improve the scalability of algorithm.

The specific steps of text semantic meaning extraction method of the present embodiment based on thin division MapReduce include：

By pending text set D*W matrixes, dual division, each division are carried out according to document dimension D and word list dimension W It is the partial content of part text, a total of N*M divided block.Apply for a number of Mapper, use LDA topic models Each division of text set is respectively trained in SparseLDA algorithms, obtains local parameter, and different marks is given to different parameters Note, records the Reducer corresponding to it；Apply for that a number of Reducer, different types of Reducer merge not isolabeling Local parameter, obtain global parameter, be output to file；This Mapper and Reducer process is repeated until reaching the condition of convergence, Final training pattern is obtained, semantic interpretation and expression for new text.

Wherein, the parameter that each Mapper trains to obtain includes document-theme distribution θ_D×K, theme-word distribution phi_K×W, Theme total distributed φ_K, and the log-likelihood log-likelihood of text set.Four kinds of parameters correspond to four kinds of Reducer, right Should be Doc-Reducer, Wordstats-Reducer, Globalstats-Reducer, Likelihood-Reducer respectively. The Reducer of each is to collect summation to the correspondence of derived data, and is output to file by certain form, to carry out MapReduce next time.

Wherein, based on thin division MapReduce, the two dimension division of text and word list is carried out, uses a variety of Reducer pairs Different parameters are handled.

The method of the invention and the contrast test of art methods Mr.LDA algorithms：

1,000 ten thousand inquiry records of the data set used in present invention test from Tengxun's soso search engines, average every is looked into Inquiry includes 5 words, is changed into about 216MB after LDA pattern of the inputs.We compare inventive algorithm and calculated with existing Mr.LDA The result of method, Mr.LDA are the existing LDA algorithms based on MapReduce, and theme number K is arranged to 200.Because Mr.LDA exists Under the data set, under internal memory limitation, 200 themes can only be run.From Fig. 3,4 as can be seen that of the invention in speed and precision On it is all better than Mr.LDA.Gap in speed increases with the increase of number of files.Due to using different calculations in precision Method, Mr.LDA is based on variation Bayes, and we have selected accurate gibbs sampler.

The present invention can build the topic model of bigger theme, Fig. 5 give bigger theme number 100,1000, 10000,50000 } under, the time of cost required for each iteration of the present invention, it can be seen that when number of topics reaches When 50000, nearly 750 seconds of the time of each iteration.

Described above is only the preferred embodiment of the present invention, is not intended to limit the invention, it is noted that for this skill For the those of ordinary skill in art field, without departing from the technical principles of the invention, can also make it is some improvement and Modification, these improvement and modification also should be regarded as protection scope of the present invention.

Claims

A kind of 1. text semantic meaning extraction method based on thin division MapReduce, it is characterised in that：Methods described specifically includes：

By pending text set D*W matrixes, dual division is carried out according to document dimension D and word list dimension W, each division is portion The partial content of single cent sheet, a total of N*M divided block；

Apply for a number of Mapper, each of text set, which is respectively trained, using LDA topic model SparseLDA algorithms draws Point, local parameter is obtained, and different marks is given to different parameters, record the Reducer corresponding to it, the LDA master D*W matrixes are changed into D*K matrixes and K*W matrixes by topic model, are designated as θ respectively_D×KDocument subject matter is distributed and the distribution of c subject words, Number of topics K can be set, and the SparseLDA algorithms are a kind of gibbs sampler algorithm of speed-optimization, gibbs sampler Main thought is the distribution for calculating every document d each word w a K size, then therefrom selects a theme k to assign Corresponding θ_D×KAnd original gibbs sampler is become formula by c, SparseLDA for deriving the new probability formula (2) of LDA models (1)：

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>z</mi> <mo>=</mo> <mi>t</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>&Proportional;</mo> <mfrac> <mrow> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mi>&beta;</mi> </mrow> <mrow> <mi>&beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>+</mo> <mfrac> <mrow> <msub> <mi>n</mi> <mrow> <mi>t</mi> <mo>|</mo> <mi>d</mi> </mrow> </msub> <mi>&beta;</mi> </mrow> <mrow> <mi>&beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>+</mo> <mfrac> <mrow> <mo>(</mo> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mi>t</mi> <mo>|</mo> <mi>d</mi> </mrow> </msub> <mo>)</mo> <msub> <mi>n</mi> <mrow> <mi>w</mi> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> <mrow> <mi>&beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>z</mi> <mo>=</mo> <mi>t</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>&Proportional;</mo> <mrow> <mo>(</mo> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mi>t</mi> <mo>|</mo> <mi>d</mi> </mrow> </msub> <mo>)</mo> </mrow> <mfrac> <mrow> <mi>&beta;</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mi>w</mi> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> <mrow> <mi>&beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

Apply for that a number of Reducer, different types of Reducer merge the local parameter of not isolabeling, obtain global ginseng Number, is output to file；

Mapper and Reducer processes are repeated until reaching the condition of convergence, obtain final training pattern, the semanteme for new text Explain and express,

Described local parameter includes document-theme distribution θ_D×K, theme-word distributionTheme total distributedAnd Tetra- kinds of parameters of log-likelihood log-likelihood of text set；Reducer corresponding to four kinds of described parameters is respectively Doc-Reducer, Wordstats-Reducer, Globalstats-Reducer, Likelihood-Reducer,

Wherein, in Map stages, for any one Map (n, m) in N*M, the loading current iteration moment first from disk Relevant parameter, judge whether it is first time iteration, if first time iteration, then random initializtionAnd θ_D×K, otherwise use SparseLDA algorithms are updatedAnd θ_D×K, finally again by corresponding toIt is sent to corresponding Wordstats- Reducer, corresponding θ_D×KIt is sent to corresponding Doc-Reducer；In the Reduce stages, different types of Reducer fusions are not Same parameter.
2. the text semantic meaning extraction method according to claim 1 based on thin division MapReduce, it is characterised in that：Often A kind of Reducer is to collect summation to the correspondence of derived data, and is output to file by predetermined form.
3. the text semantic meaning extraction method according to claim 1 based on thin division MapReduce, it is characterised in that：No Reducer corresponding to same local parameter is different.