[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN104156350B - Text semantic meaning extraction method based on thin division MapReduce - Google Patents

Text semantic meaning extraction method based on thin division MapReduce Download PDF

Info

Publication number
CN104156350B
CN104156350B CN201410379847.5A CN201410379847A CN104156350B CN 104156350 B CN104156350 B CN 104156350B CN 201410379847 A CN201410379847 A CN 201410379847A CN 104156350 B CN104156350 B CN 104156350B
Authority
CN
China
Prior art keywords
mrow
msub
reducer
text
division
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410379847.5A
Other languages
Chinese (zh)
Other versions
CN104156350A (en
Inventor
曾嘉
高阳
严建峰
刘晓升
杨璐
刘志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201410379847.5A priority Critical patent/CN104156350B/en
Publication of CN104156350A publication Critical patent/CN104156350A/en
Application granted granted Critical
Publication of CN104156350B publication Critical patent/CN104156350B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of text semantic meaning extraction method based on thin division MapReduce.Methods described includes:Pending text set is subjected to dual division according to document dimension and word list dimension, each division is the partial content of part text;Apply for a number of Mapper, each division of text set is respectively trained using LDA topic model SparseLDA algorithms, obtains local parameter, and different marks is given to different parameters, record the Reducer corresponding to it;Apply for that a number of Reducer, different types of Reducer merge the local parameter of not isolabeling, obtain global parameter, be output to file;This Mapper and Reducer process is repeated until reaching the condition of convergence, obtains final training pattern, semantic interpretation and expression for new text.

Description

Text semantic meaning extraction method based on thin division MapReduce
Technical field
The present invention relates to machine learning field, more particularly to a kind of text semantic extraction based on thin division MapReduce Method.
Background technology
The semantic understanding of text is research topic more popular at present, and once exponential type increases digital information in internet Long, which includes webpage, social networks news, books, picture, audio, video, microblogging and scientific paper etc., with document shape The information speedup that formula is presented is particularly swift and violent.How effectively to organize, manage and make a summary these text messages and excavate it is wherein hidden The knowledge contained is a major challenge that current computer science faces.In addition, the network application related to search is required for efficiently Semantic understanding module, the main purpose of user is obtained, so as to be preferably user service.Such as the search engine needs of Baidu With mostly concerned text is inquired about with user, Taobao's search needs to feed back to the product that user meets the most.
Topic model (Topic Models) is a kind of unsupervised learning algorithm, it is not necessary to artificial mark, saves manpower money Source.More ripe topic model is latent Dirichletal location (Latent Dirichlet A] location, LDA) at present Algorithm, the algorithm assumes that a document is multi-threaded probability distribution, and a theme is the probability distribution on word list.LDA Algorithm goes out topic model to predict the theme distribution of new document from data focusing study.With increasing for document, document is included Theme also increasing, while the size of word list is also constantly increasing.In order to preferably explain the theme wherein contained, we Stabilization is needed, it is practical, the high-dimensional processing method of big data can be handled.
Be parallel it is a kind of handle the high-dimensional direct method of big data, existing parallel LDA algorithm deficient in stability with can Autgmentability, higher speed-up ratio can not be obtained using more processors.We select MapReduce as parallel basis, Its expansible bottleneck is analyzed, proposes improved method, strengthens the scalability and practicality of algorithm.
In view of it is above-mentioned the defects of, the design people, be actively subject to research and innovation, to found a kind of high efficiency semantic compression Parallel text big data storage method, make it with more the value in industry.
The content of the invention
In order to solve the above technical problems, it is an object of the invention to provide a kind of autgmentability is strong, it will be appreciated that big data, high latitude Text set the text semantic meaning extraction method based on thin division MapReduce.
Text semantic meaning extraction method of the present invention based on thin division MapReduce, methods described include:
Pending text set is divided respectively with two dimensions of document dimension and word dimension;
By the document after division and word, the multiple processing by MapReduce up to reaching the condition of convergence, obtains respectively Training pattern;
Semantic interpretation and expression are carried out to text based on the training pattern.
Specifically, methods described specifically includes:
Pending text set is divided respectively with two dimensions of document dimension and word dimension;
Map phase process is carried out respectively to the document after division and word, line number is entered based on predetermined L D A topic models According to training, some local parameters are obtained, different marks is given to different local parameters;
Not Reduce corresponding to the local parameter of isolabeling is recorded, carrying out Reduce processing to the local parameter obtains entirely Office's parameter;
Said process is repeated until reaching the condition of convergence, obtains training pattern;
Semantic interpretation and expression are carried out to text based on the training pattern.
Further, described local parameter includes document-theme distribution θD×K, theme-word distribution phiK×W, theme is total Distribution phiK, and tetra- kinds of parameters of log-likelihood log-likelihood of text set;
Reducer corresponding to four kinds of described parameters is Doc-Reducer, Wordstats-Reducer respectively, Globalstats-Reducer, Likelihood-Reducer.
Further, the Reducer of each is to collect summation to the correspondence of derived data, and defeated by predetermined form Go out to file.
Further, Reducer corresponding to different local parameters is different.
By such scheme, the present invention at least has advantages below:
Text semantic meaning extraction method of the present invention based on thin division MapReduce, in implementation process, internal memory can To reach the 1/M of existing algorithm, M can be set by the user, low memory consumption i.e. illustrate this method can do it is more massive Topic model, it is extensive on the extensive or theme either on text.In speed, due to existing based on thin division MapReduce LDA models are all based on variation Bayes, and the present invention uses SparseLDA, a kind of quick high accuracy LDA approximate resoning algorithms, so have obvious acceleration in speed, and precision does not decline.
Brief description of the drawings
Fig. 1 is the schematic diagram of text semantic meaning extraction method of the present invention based on thin division MapReduce;
The text that Fig. 2 is the 2*3 of text semantic meaning extraction method of the present invention based on thin division MapReduce specifically divides original Reason figure;
Fig. 3 is Experimental comparison results' figure of text semantic meaning extraction method of the present invention based on thin division MapReduce;
Fig. 4 is Experimental comparison results' figure of text semantic meaning extraction method of the present invention based on thin division MapReduce;
Fig. 5 is the scalability proof diagram of text semantic meaning extraction method of the present invention based on thin division MapReduce.
Embodiment
With reference to the accompanying drawings and examples, the embodiment of the present invention is described in further detail.Implement below Example is used to illustrate the present invention, but is not limited to the scope of the present invention.
(1) LDA models:
LDA models are a kind of three layers of Bayesian models.Mode input data set size is designated as D*W, and wherein D is that document is total Number, W is word table size.D*W matrixes are changed into D*K matrixes and K*W matrixes by LDA models, are designated as θ respectivelyD*KDocument subject matter point Cloth and the distribution of c subject words.Wherein number of topics K can be set.The algorithm for deriving LDA processes has several, most practical, conventional calculations Method is gibbs sampler (Gibbs Sampling, GS), and the present invention uses SparseLDA, a kind of GS algorithms of speed-optimization, GS Main thought be the distribution for calculating every document d each word w a K size, then therefrom select a theme k to assign Give corresponding θD*KWith.
The new probability formula (2) that original GS is used to derive LDA models by SparseLDA becomes formula (1), so as to reduce Some calculation procedures repeated, accelerate the training speed of model.
(2) present invention employs thin division MapReduce method explanation:
It is distribution task to each Map processes, it is necessary to be divided, division methods of the prior art are that document is carried out Division.And document and word list are all divided respectively in the present invention, as shown in figure 1, data set is divided into N*M blocks, Ideally there is N*M Map process, but due to the limitation of equipment, be possibly less than N*M Map.Map friendships with a line are assigned to Doc-Reducer, share N number of Doc-Reducer.The Map of same row is handed over and is assigned to Wordstats-Reducer, shares M Wordstats-Reducer.Likelihood-Reducer in Fig. 1 is used for the maximum likelihood value for solving all Map, can be used as In the condition of convergence of model, can also evaluation model quality.Fig. 2 gives 2*3 text division methods, illustrates MapReduce specific partition process.
(3) MapReduce implementation methods explanation:
Map and Reduce key step.In the Map stages, for any one Map (n, m) in N*M, first from magnetic The relevant parameter of current iteration t is loaded in disk, judges whether it is first time iteration, if first time iteration, then at random Initialize φ and θ, be otherwise updated φ and θ using SparseLDA algorithms, finally corresponding φ is sent to again corresponding to Wordstats-Reducer, corresponding θ are sent to corresponding Doc-Reducer.It is different types of in the Reduce stages Reducer merges different parameters.
Document-word D*W matrixes have higher degree of rarefication, to every text, using word index and word number as Two tuple forms record a word in a document.
It is strong based on thin division MapReduce, autgmentability it is an object of the invention to propose, it will be appreciated that big data high latitude Semantic expressiveness topic model, the algorithm mainly solve the problems, such as big data disposal ability.It is existing to divide MapReduce based on thin Topic model in internal memory, consumption is excessive in terms of the time, leads to not expand to higher data dimension.The present invention proposes thin draw Divide MapReduce, solve the problems, such as memory consumption, improve the scalability of algorithm.
The specific steps of text semantic meaning extraction method of the present embodiment based on thin division MapReduce include:
By pending text set D*W matrixes, dual division, each division are carried out according to document dimension D and word list dimension W It is the partial content of part text, a total of N*M divided block.Apply for a number of Mapper, use LDA topic models Each division of text set is respectively trained in SparseLDA algorithms, obtains local parameter, and different marks is given to different parameters Note, records the Reducer corresponding to it;Apply for that a number of Reducer, different types of Reducer merge not isolabeling Local parameter, obtain global parameter, be output to file;This Mapper and Reducer process is repeated until reaching the condition of convergence, Final training pattern is obtained, semantic interpretation and expression for new text.
Wherein, the parameter that each Mapper trains to obtain includes document-theme distribution θD×K, theme-word distribution phiK×W, Theme total distributed φK, and the log-likelihood log-likelihood of text set.Four kinds of parameters correspond to four kinds of Reducer, right Should be Doc-Reducer, Wordstats-Reducer, Globalstats-Reducer, Likelihood-Reducer respectively. The Reducer of each is to collect summation to the correspondence of derived data, and is output to file by certain form, to carry out MapReduce next time.
Wherein, based on thin division MapReduce, the two dimension division of text and word list is carried out, uses a variety of Reducer pairs Different parameters are handled.
The method of the invention and the contrast test of art methods Mr.LDA algorithms:
1,000 ten thousand inquiry records of the data set used in present invention test from Tengxun's soso search engines, average every is looked into Inquiry includes 5 words, is changed into about 216MB after LDA pattern of the inputs.We compare inventive algorithm and calculated with existing Mr.LDA The result of method, Mr.LDA are the existing LDA algorithms based on MapReduce, and theme number K is arranged to 200.Because Mr.LDA exists Under the data set, under internal memory limitation, 200 themes can only be run.From Fig. 3,4 as can be seen that of the invention in speed and precision On it is all better than Mr.LDA.Gap in speed increases with the increase of number of files.Due to using different calculations in precision Method, Mr.LDA is based on variation Bayes, and we have selected accurate gibbs sampler.
The present invention can build the topic model of bigger theme, Fig. 5 give bigger theme number 100,1000, 10000,50000 } under, the time of cost required for each iteration of the present invention, it can be seen that when number of topics reaches When 50000, nearly 750 seconds of the time of each iteration.
Described above is only the preferred embodiment of the present invention, is not intended to limit the invention, it is noted that for this skill For the those of ordinary skill in art field, without departing from the technical principles of the invention, can also make it is some improvement and Modification, these improvement and modification also should be regarded as protection scope of the present invention.

Claims (3)

  1. A kind of 1. text semantic meaning extraction method based on thin division MapReduce, it is characterised in that:Methods described specifically includes:
    By pending text set D*W matrixes, dual division is carried out according to document dimension D and word list dimension W, each division is portion The partial content of single cent sheet, a total of N*M divided block;
    Apply for a number of Mapper, each of text set, which is respectively trained, using LDA topic model SparseLDA algorithms draws Point, local parameter is obtained, and different marks is given to different parameters, record the Reducer corresponding to it, the LDA master D*W matrixes are changed into D*K matrixes and K*W matrixes by topic model, are designated as θ respectivelyD×KDocument subject matter is distributed and the distribution of c subject words, Number of topics K can be set, and the SparseLDA algorithms are a kind of gibbs sampler algorithm of speed-optimization, gibbs sampler Main thought is the distribution for calculating every document d each word w a K size, then therefrom selects a theme k to assign Corresponding θD×KAnd original gibbs sampler is become formula by c, SparseLDA for deriving the new probability formula (2) of LDA models (1):
    <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>z</mi> <mo>=</mo> <mi>t</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>&amp;Proportional;</mo> <mfrac> <mrow> <msub> <mi>&amp;alpha;</mi> <mi>t</mi> </msub> <mi>&amp;beta;</mi> </mrow> <mrow> <mi>&amp;beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&amp;CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>+</mo> <mfrac> <mrow> <msub> <mi>n</mi> <mrow> <mi>t</mi> <mo>|</mo> <mi>d</mi> </mrow> </msub> <mi>&amp;beta;</mi> </mrow> <mrow> <mi>&amp;beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&amp;CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>+</mo> <mfrac> <mrow> <mo>(</mo> <msub> <mi>&amp;alpha;</mi> <mi>t</mi> </msub> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mi>t</mi> <mo>|</mo> <mi>d</mi> </mrow> </msub> <mo>)</mo> <msub> <mi>n</mi> <mrow> <mi>w</mi> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> <mrow> <mi>&amp;beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&amp;CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
    <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>z</mi> <mo>=</mo> <mi>t</mi> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>&amp;Proportional;</mo> <mrow> <mo>(</mo> <msub> <mi>&amp;alpha;</mi> <mi>t</mi> </msub> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mi>t</mi> <mo>|</mo> <mi>d</mi> </mrow> </msub> <mo>)</mo> </mrow> <mfrac> <mrow> <mi>&amp;beta;</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mi>w</mi> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> <mrow> <mi>&amp;beta;</mi> <mi>V</mi> <mo>+</mo> <msub> <mi>n</mi> <mrow> <mo>&amp;CenterDot;</mo> <mo>|</mo> <mi>t</mi> </mrow> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
    Apply for that a number of Reducer, different types of Reducer merge the local parameter of not isolabeling, obtain global ginseng Number, is output to file;
    Mapper and Reducer processes are repeated until reaching the condition of convergence, obtain final training pattern, the semanteme for new text Explain and express,
    Described local parameter includes document-theme distribution θD×K, theme-word distributionTheme total distributedAnd Tetra- kinds of parameters of log-likelihood log-likelihood of text set;Reducer corresponding to four kinds of described parameters is respectively Doc-Reducer, Wordstats-Reducer, Globalstats-Reducer, Likelihood-Reducer,
    Wherein, in Map stages, for any one Map (n, m) in N*M, the loading current iteration moment first from disk Relevant parameter, judge whether it is first time iteration, if first time iteration, then random initializtionAnd θD×K, otherwise use SparseLDA algorithms are updatedAnd θD×K, finally again by corresponding toIt is sent to corresponding Wordstats- Reducer, corresponding θD×KIt is sent to corresponding Doc-Reducer;In the Reduce stages, different types of Reducer fusions are not Same parameter.
  2. 2. the text semantic meaning extraction method according to claim 1 based on thin division MapReduce, it is characterised in that:Often A kind of Reducer is to collect summation to the correspondence of derived data, and is output to file by predetermined form.
  3. 3. the text semantic meaning extraction method according to claim 1 based on thin division MapReduce, it is characterised in that:No Reducer corresponding to same local parameter is different.
CN201410379847.5A 2014-08-04 2014-08-04 Text semantic meaning extraction method based on thin division MapReduce Expired - Fee Related CN104156350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410379847.5A CN104156350B (en) 2014-08-04 2014-08-04 Text semantic meaning extraction method based on thin division MapReduce

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410379847.5A CN104156350B (en) 2014-08-04 2014-08-04 Text semantic meaning extraction method based on thin division MapReduce

Publications (2)

Publication Number Publication Date
CN104156350A CN104156350A (en) 2014-11-19
CN104156350B true CN104156350B (en) 2018-03-06

Family

ID=51881855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410379847.5A Expired - Fee Related CN104156350B (en) 2014-08-04 2014-08-04 Text semantic meaning extraction method based on thin division MapReduce

Country Status (1)

Country Link
CN (1) CN104156350B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104574965B (en) * 2015-01-11 2017-01-04 杭州电子科技大学 A kind of urban transportation hot spot region based on magnanimity traffic flow data division methods

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678277A (en) * 2013-12-04 2014-03-26 东软集团股份有限公司 Theme-vocabulary distribution establishing method and system based on document segmenting

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678277A (en) * 2013-12-04 2014-03-26 东软集团股份有限公司 Theme-vocabulary distribution establishing method and system based on document segmenting

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于LDA的在线主题演化挖掘模型;崔凯;《计算机科学》;20101130;第37卷(第11期);156-160 *
一种并行LDA主题模型建立方法研究;王旭仁;《北京理工大学学报》;20130630;第33卷(第6期);590-593 *

Also Published As

Publication number Publication date
CN104156350A (en) 2014-11-19

Similar Documents

Publication Publication Date Title
US11288444B2 (en) Optimization techniques for artificial intelligence
Beskow et al. Its all in a name: detecting and labeling bots by their name
CN104035917B (en) A kind of knowledge mapping management method and system based on semantic space mapping
Argyrou et al. Topic modelling on Instagram hashtags: An alternative way to Automatic Image Annotation?
CN104298776B (en) Search-engine results optimization system based on LDA models
CN106339495A (en) Topic detection method and system based on hierarchical incremental clustering
US10460203B2 (en) Jaccard similarity estimation of weighted samples: scaling and randomized rounding sample selection with circular smearing
CN106294418B (en) Search method and searching system
CN108241613A (en) A kind of method and apparatus for extracting keyword
US20160062979A1 (en) Word classification based on phonetic features
Nicholls et al. Understanding news story chains using information retrieval and network clustering techniques
Foxcroft et al. Name2vec: Personal names embeddings
Patel et al. Sentiment analysis on movie review using deep learning RNN method
US20180121819A1 (en) Jaccard similarity estimation of weighted samples: circular smearing with scaling and randomized rounding sample selection
IT201900005326A1 (en) AUTOMATED SYSTEM AND METHOD FOR EXTRACTION AND PRESENTATION OF QUANTITATIVE INFORMATION THROUGH PREDICTIVE DATA ANALYSIS
CN103218368A (en) Method and device for discovering hot words
Syarif Trending topic prediction by optimizing K-nearest neighbor algorithm
CN106970919B (en) Method and device for discovering new word group
CN104156350B (en) Text semantic meaning extraction method based on thin division MapReduce
CN105550308A (en) Information processing method, retrieval method and electronic device
US9811780B1 (en) Identifying subjective attributes by analysis of curation signals
Maheshwari et al. Twitter Sentiment Analysis in the Crisis Between Russia and Ukraine Using the Bert and LSTM Model
Fischer et al. Timely semantics: a study of a stream-based ranking system for entity relationships
US9792358B2 (en) Generating and using socially-curated brains
CN109947873B (en) Method, device and equipment for constructing scenic spot knowledge map and readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180306