CN104156350B - Text semantic meaning extraction method based on thin division MapReduce - Google Patents
Text semantic meaning extraction method based on thin division MapReduce Download PDFInfo
- Publication number
- CN104156350B CN104156350B CN201410379847.5A CN201410379847A CN104156350B CN 104156350 B CN104156350 B CN 104156350B CN 201410379847 A CN201410379847 A CN 201410379847A CN 104156350 B CN104156350 B CN 104156350B
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- reducer
- text
- division
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 16
- 239000003638 chemical reducing agent Substances 0.000 claims abstract description 44
- 238000000034 method Methods 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims abstract description 9
- 230000009977 dual effect Effects 0.000 claims abstract description 3
- 238000005457 optimization Methods 0.000 claims description 2
- 241000208340 Araliaceae Species 0.000 claims 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims 1
- 235000003140 Panax quinquefolius Nutrition 0.000 claims 1
- 230000004927 fusion Effects 0.000 claims 1
- 235000008434 ginseng Nutrition 0.000 claims 1
- 238000004364 calculation method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of text semantic meaning extraction method based on thin division MapReduce.Methods described includes:Pending text set is subjected to dual division according to document dimension and word list dimension, each division is the partial content of part text;Apply for a number of Mapper, each division of text set is respectively trained using LDA topic model SparseLDA algorithms, obtains local parameter, and different marks is given to different parameters, record the Reducer corresponding to it;Apply for that a number of Reducer, different types of Reducer merge the local parameter of not isolabeling, obtain global parameter, be output to file;This Mapper and Reducer process is repeated until reaching the condition of convergence, obtains final training pattern, semantic interpretation and expression for new text.
Description
Technical field
The present invention relates to machine learning field, more particularly to a kind of text semantic extraction based on thin division MapReduce
Method.
Background technology
The semantic understanding of text is research topic more popular at present, and once exponential type increases digital information in internet
Long, which includes webpage, social networks news, books, picture, audio, video, microblogging and scientific paper etc., with document shape
The information speedup that formula is presented is particularly swift and violent.How effectively to organize, manage and make a summary these text messages and excavate it is wherein hidden
The knowledge contained is a major challenge that current computer science faces.In addition, the network application related to search is required for efficiently
Semantic understanding module, the main purpose of user is obtained, so as to be preferably user service.Such as the search engine needs of Baidu
With mostly concerned text is inquired about with user, Taobao's search needs to feed back to the product that user meets the most.
Topic model (Topic Models) is a kind of unsupervised learning algorithm, it is not necessary to artificial mark, saves manpower money
Source.More ripe topic model is latent Dirichletal location (Latent Dirichlet A] location, LDA) at present
Algorithm, the algorithm assumes that a document is multi-threaded probability distribution, and a theme is the probability distribution on word list.LDA
Algorithm goes out topic model to predict the theme distribution of new document from data focusing study.With increasing for document, document is included
Theme also increasing, while the size of word list is also constantly increasing.In order to preferably explain the theme wherein contained, we
Stabilization is needed, it is practical, the high-dimensional processing method of big data can be handled.
Be parallel it is a kind of handle the high-dimensional direct method of big data, existing parallel LDA algorithm deficient in stability with can
Autgmentability, higher speed-up ratio can not be obtained using more processors.We select MapReduce as parallel basis,
Its expansible bottleneck is analyzed, proposes improved method, strengthens the scalability and practicality of algorithm.
In view of it is above-mentioned the defects of, the design people, be actively subject to research and innovation, to found a kind of high efficiency semantic compression
Parallel text big data storage method, make it with more the value in industry.
The content of the invention
In order to solve the above technical problems, it is an object of the invention to provide a kind of autgmentability is strong, it will be appreciated that big data, high latitude
Text set the text semantic meaning extraction method based on thin division MapReduce.
Text semantic meaning extraction method of the present invention based on thin division MapReduce, methods described include:
Pending text set is divided respectively with two dimensions of document dimension and word dimension;
By the document after division and word, the multiple processing by MapReduce up to reaching the condition of convergence, obtains respectively
Training pattern;
Semantic interpretation and expression are carried out to text based on the training pattern.
Specifically, methods described specifically includes:
Pending text set is divided respectively with two dimensions of document dimension and word dimension;
Map phase process is carried out respectively to the document after division and word, line number is entered based on predetermined L D A topic models
According to training, some local parameters are obtained, different marks is given to different local parameters;
Not Reduce corresponding to the local parameter of isolabeling is recorded, carrying out Reduce processing to the local parameter obtains entirely
Office's parameter;
Said process is repeated until reaching the condition of convergence, obtains training pattern;
Semantic interpretation and expression are carried out to text based on the training pattern.
Further, described local parameter includes document-theme distribution θD×K, theme-word distribution phiK×W, theme is total
Distribution phiK, and tetra- kinds of parameters of log-likelihood log-likelihood of text set;
Reducer corresponding to four kinds of described parameters is Doc-Reducer, Wordstats-Reducer respectively,
Globalstats-Reducer, Likelihood-Reducer.
Further, the Reducer of each is to collect summation to the correspondence of derived data, and defeated by predetermined form
Go out to file.
Further, Reducer corresponding to different local parameters is different.
By such scheme, the present invention at least has advantages below:
Text semantic meaning extraction method of the present invention based on thin division MapReduce, in implementation process, internal memory can
To reach the 1/M of existing algorithm, M can be set by the user, low memory consumption i.e. illustrate this method can do it is more massive
Topic model, it is extensive on the extensive or theme either on text.In speed, due to existing based on thin division
MapReduce LDA models are all based on variation Bayes, and the present invention uses SparseLDA, a kind of quick high accuracy
LDA approximate resoning algorithms, so have obvious acceleration in speed, and precision does not decline.
Brief description of the drawings
Fig. 1 is the schematic diagram of text semantic meaning extraction method of the present invention based on thin division MapReduce;
The text that Fig. 2 is the 2*3 of text semantic meaning extraction method of the present invention based on thin division MapReduce specifically divides original
Reason figure;
Fig. 3 is Experimental comparison results' figure of text semantic meaning extraction method of the present invention based on thin division MapReduce;
Fig. 4 is Experimental comparison results' figure of text semantic meaning extraction method of the present invention based on thin division MapReduce;
Fig. 5 is the scalability proof diagram of text semantic meaning extraction method of the present invention based on thin division MapReduce.
Embodiment
With reference to the accompanying drawings and examples, the embodiment of the present invention is described in further detail.Implement below
Example is used to illustrate the present invention, but is not limited to the scope of the present invention.
(1) LDA models:
LDA models are a kind of three layers of Bayesian models.Mode input data set size is designated as D*W, and wherein D is that document is total
Number, W is word table size.D*W matrixes are changed into D*K matrixes and K*W matrixes by LDA models, are designated as θ respectivelyD*KDocument subject matter point
Cloth and the distribution of c subject words.Wherein number of topics K can be set.The algorithm for deriving LDA processes has several, most practical, conventional calculations
Method is gibbs sampler (Gibbs Sampling, GS), and the present invention uses SparseLDA, a kind of GS algorithms of speed-optimization, GS
Main thought be the distribution for calculating every document d each word w a K size, then therefrom select a theme k to assign
Give corresponding θD*KWith.
The new probability formula (2) that original GS is used to derive LDA models by SparseLDA becomes formula (1), so as to reduce
Some calculation procedures repeated, accelerate the training speed of model.
(2) present invention employs thin division MapReduce method explanation:
It is distribution task to each Map processes, it is necessary to be divided, division methods of the prior art are that document is carried out
Division.And document and word list are all divided respectively in the present invention, as shown in figure 1, data set is divided into N*M blocks,
Ideally there is N*M Map process, but due to the limitation of equipment, be possibly less than N*M Map.Map friendships with a line are assigned to
Doc-Reducer, share N number of Doc-Reducer.The Map of same row is handed over and is assigned to Wordstats-Reducer, shares M
Wordstats-Reducer.Likelihood-Reducer in Fig. 1 is used for the maximum likelihood value for solving all Map, can be used as
In the condition of convergence of model, can also evaluation model quality.Fig. 2 gives 2*3 text division methods, illustrates
MapReduce specific partition process.
(3) MapReduce implementation methods explanation:
Map and Reduce key step.In the Map stages, for any one Map (n, m) in N*M, first from magnetic
The relevant parameter of current iteration t is loaded in disk, judges whether it is first time iteration, if first time iteration, then at random
Initialize φ and θ, be otherwise updated φ and θ using SparseLDA algorithms, finally corresponding φ is sent to again corresponding to
Wordstats-Reducer, corresponding θ are sent to corresponding Doc-Reducer.It is different types of in the Reduce stages
Reducer merges different parameters.
Document-word D*W matrixes have higher degree of rarefication, to every text, using word index and word number as
Two tuple forms record a word in a document.
It is strong based on thin division MapReduce, autgmentability it is an object of the invention to propose, it will be appreciated that big data high latitude
Semantic expressiveness topic model, the algorithm mainly solve the problems, such as big data disposal ability.It is existing to divide MapReduce based on thin
Topic model in internal memory, consumption is excessive in terms of the time, leads to not expand to higher data dimension.The present invention proposes thin draw
Divide MapReduce, solve the problems, such as memory consumption, improve the scalability of algorithm.
The specific steps of text semantic meaning extraction method of the present embodiment based on thin division MapReduce include:
By pending text set D*W matrixes, dual division, each division are carried out according to document dimension D and word list dimension W
It is the partial content of part text, a total of N*M divided block.Apply for a number of Mapper, use LDA topic models
Each division of text set is respectively trained in SparseLDA algorithms, obtains local parameter, and different marks is given to different parameters
Note, records the Reducer corresponding to it;Apply for that a number of Reducer, different types of Reducer merge not isolabeling
Local parameter, obtain global parameter, be output to file;This Mapper and Reducer process is repeated until reaching the condition of convergence,
Final training pattern is obtained, semantic interpretation and expression for new text.
Wherein, the parameter that each Mapper trains to obtain includes document-theme distribution θD×K, theme-word distribution phiK×W,
Theme total distributed φK, and the log-likelihood log-likelihood of text set.Four kinds of parameters correspond to four kinds of Reducer, right
Should be Doc-Reducer, Wordstats-Reducer, Globalstats-Reducer, Likelihood-Reducer respectively.
The Reducer of each is to collect summation to the correspondence of derived data, and is output to file by certain form, to carry out
MapReduce next time.
Wherein, based on thin division MapReduce, the two dimension division of text and word list is carried out, uses a variety of Reducer pairs
Different parameters are handled.
The method of the invention and the contrast test of art methods Mr.LDA algorithms:
1,000 ten thousand inquiry records of the data set used in present invention test from Tengxun's soso search engines, average every is looked into
Inquiry includes 5 words, is changed into about 216MB after LDA pattern of the inputs.We compare inventive algorithm and calculated with existing Mr.LDA
The result of method, Mr.LDA are the existing LDA algorithms based on MapReduce, and theme number K is arranged to 200.Because Mr.LDA exists
Under the data set, under internal memory limitation, 200 themes can only be run.From Fig. 3,4 as can be seen that of the invention in speed and precision
On it is all better than Mr.LDA.Gap in speed increases with the increase of number of files.Due to using different calculations in precision
Method, Mr.LDA is based on variation Bayes, and we have selected accurate gibbs sampler.
The present invention can build the topic model of bigger theme, Fig. 5 give bigger theme number 100,1000,
10000,50000 } under, the time of cost required for each iteration of the present invention, it can be seen that when number of topics reaches
When 50000, nearly 750 seconds of the time of each iteration.
Described above is only the preferred embodiment of the present invention, is not intended to limit the invention, it is noted that for this skill
For the those of ordinary skill in art field, without departing from the technical principles of the invention, can also make it is some improvement and
Modification, these improvement and modification also should be regarded as protection scope of the present invention.
Claims (3)
- A kind of 1. text semantic meaning extraction method based on thin division MapReduce, it is characterised in that:Methods described specifically includes:By pending text set D*W matrixes, dual division is carried out according to document dimension D and word list dimension W, each division is portion The partial content of single cent sheet, a total of N*M divided block;Apply for a number of Mapper, each of text set, which is respectively trained, using LDA topic model SparseLDA algorithms draws Point, local parameter is obtained, and different marks is given to different parameters, record the Reducer corresponding to it, the LDA master D*W matrixes are changed into D*K matrixes and K*W matrixes by topic model, are designated as θ respectivelyD×KDocument subject matter is distributed and the distribution of c subject words, Number of topics K can be set, and the SparseLDA algorithms are a kind of gibbs sampler algorithm of speed-optimization, gibbs sampler Main thought is the distribution for calculating every document d each word w a K size, then therefrom selects a theme k to assign Corresponding θD×KAnd original gibbs sampler is become formula by c, SparseLDA for deriving the new probability formula (2) of LDA models (1):<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>z</mi> <mo>=</mo> <mi>t</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>&Proportional;</mo> <mfrac> <mrow> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mi>&beta;</mi> </mrow> <mrow> <mi>&beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>+</mo> <mfrac> <mrow> <msub> <mi>n</mi> <mrow> <mi>t</mi> <mo>|</mo> <mi>d</mi> </mrow> </msub> <mi>&beta;</mi> </mrow> <mrow> <mi>&beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>+</mo> <mfrac> <mrow> <mo>(</mo> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mi>t</mi> <mo>|</mo> <mi>d</mi> </mrow> </msub> <mo>)</mo> <msub> <mi>n</mi> <mrow> <mi>w</mi> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> <mrow> <mi>&beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow><mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>z</mi> <mo>=</mo> <mi>t</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>&Proportional;</mo> <mrow> <mo>(</mo> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mi>t</mi> <mo>|</mo> <mi>d</mi> </mrow> </msub> <mo>)</mo> </mrow> <mfrac> <mrow> <mi>&beta;</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mi>w</mi> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> <mrow> <mi>&beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>Apply for that a number of Reducer, different types of Reducer merge the local parameter of not isolabeling, obtain global ginseng Number, is output to file;Mapper and Reducer processes are repeated until reaching the condition of convergence, obtain final training pattern, the semanteme for new text Explain and express,Described local parameter includes document-theme distribution θD×K, theme-word distributionTheme total distributedAnd Tetra- kinds of parameters of log-likelihood log-likelihood of text set;Reducer corresponding to four kinds of described parameters is respectively Doc-Reducer, Wordstats-Reducer, Globalstats-Reducer, Likelihood-Reducer,Wherein, in Map stages, for any one Map (n, m) in N*M, the loading current iteration moment first from disk Relevant parameter, judge whether it is first time iteration, if first time iteration, then random initializtionAnd θD×K, otherwise use SparseLDA algorithms are updatedAnd θD×K, finally again by corresponding toIt is sent to corresponding Wordstats- Reducer, corresponding θD×KIt is sent to corresponding Doc-Reducer;In the Reduce stages, different types of Reducer fusions are not Same parameter.
- 2. the text semantic meaning extraction method according to claim 1 based on thin division MapReduce, it is characterised in that:Often A kind of Reducer is to collect summation to the correspondence of derived data, and is output to file by predetermined form.
- 3. the text semantic meaning extraction method according to claim 1 based on thin division MapReduce, it is characterised in that:No Reducer corresponding to same local parameter is different.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410379847.5A CN104156350B (en) | 2014-08-04 | 2014-08-04 | Text semantic meaning extraction method based on thin division MapReduce |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410379847.5A CN104156350B (en) | 2014-08-04 | 2014-08-04 | Text semantic meaning extraction method based on thin division MapReduce |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104156350A CN104156350A (en) | 2014-11-19 |
CN104156350B true CN104156350B (en) | 2018-03-06 |
Family
ID=51881855
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410379847.5A Expired - Fee Related CN104156350B (en) | 2014-08-04 | 2014-08-04 | Text semantic meaning extraction method based on thin division MapReduce |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104156350B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104574965B (en) * | 2015-01-11 | 2017-01-04 | 杭州电子科技大学 | A kind of urban transportation hot spot region based on magnanimity traffic flow data division methods |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678277A (en) * | 2013-12-04 | 2014-03-26 | 东软集团股份有限公司 | Theme-vocabulary distribution establishing method and system based on document segmenting |
-
2014
- 2014-08-04 CN CN201410379847.5A patent/CN104156350B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678277A (en) * | 2013-12-04 | 2014-03-26 | 东软集团股份有限公司 | Theme-vocabulary distribution establishing method and system based on document segmenting |
Non-Patent Citations (2)
Title |
---|
一种基于LDA的在线主题演化挖掘模型;崔凯;《计算机科学》;20101130;第37卷(第11期);156-160 * |
一种并行LDA主题模型建立方法研究;王旭仁;《北京理工大学学报》;20130630;第33卷(第6期);590-593 * |
Also Published As
Publication number | Publication date |
---|---|
CN104156350A (en) | 2014-11-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11288444B2 (en) | Optimization techniques for artificial intelligence | |
Beskow et al. | Its all in a name: detecting and labeling bots by their name | |
CN104035917B (en) | A kind of knowledge mapping management method and system based on semantic space mapping | |
Argyrou et al. | Topic modelling on Instagram hashtags: An alternative way to Automatic Image Annotation? | |
CN104298776B (en) | Search-engine results optimization system based on LDA models | |
CN106339495A (en) | Topic detection method and system based on hierarchical incremental clustering | |
US10460203B2 (en) | Jaccard similarity estimation of weighted samples: scaling and randomized rounding sample selection with circular smearing | |
CN106294418B (en) | Search method and searching system | |
CN108241613A (en) | A kind of method and apparatus for extracting keyword | |
US20160062979A1 (en) | Word classification based on phonetic features | |
Nicholls et al. | Understanding news story chains using information retrieval and network clustering techniques | |
Foxcroft et al. | Name2vec: Personal names embeddings | |
Patel et al. | Sentiment analysis on movie review using deep learning RNN method | |
US20180121819A1 (en) | Jaccard similarity estimation of weighted samples: circular smearing with scaling and randomized rounding sample selection | |
IT201900005326A1 (en) | AUTOMATED SYSTEM AND METHOD FOR EXTRACTION AND PRESENTATION OF QUANTITATIVE INFORMATION THROUGH PREDICTIVE DATA ANALYSIS | |
CN103218368A (en) | Method and device for discovering hot words | |
Syarif | Trending topic prediction by optimizing K-nearest neighbor algorithm | |
CN106970919B (en) | Method and device for discovering new word group | |
CN104156350B (en) | Text semantic meaning extraction method based on thin division MapReduce | |
CN105550308A (en) | Information processing method, retrieval method and electronic device | |
US9811780B1 (en) | Identifying subjective attributes by analysis of curation signals | |
Maheshwari et al. | Twitter Sentiment Analysis in the Crisis Between Russia and Ukraine Using the Bert and LSTM Model | |
Fischer et al. | Timely semantics: a study of a stream-based ranking system for entity relationships | |
US9792358B2 (en) | Generating and using socially-curated brains | |
CN109947873B (en) | Method, device and equipment for constructing scenic spot knowledge map and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20180306 |