CN108804595A

CN108804595A - A kind of short text representation method based on word2vec

Info

Publication number: CN108804595A
Application number: CN201810525103.8A
Authority: CN
Inventors: 路永和; 张炜婷
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-11-13
Anticipated expiration: 2038-05-28
Also published as: CN108804595B

Abstract

The present invention relates to a kind of short text representation method based on word2vec, includes the following steps：S1：Word2vec method parameters are arranged by the training text collection of Text Pretreatment in input, and training obtains the corresponding term vector set of training text collection；S2：A series of close words that the word is concentrated in entire training text are calculated by the COS distance between term vector for each word in every document；S3：Calculate the COS distance of the close word and document in every document；S4：It sorts from big to small according to COS distance, the final n close words and cosine measurement for choosing preceding n close words and corresponding COS distance formation document；S5：N weights of the close word in the document for calculating the word and selection in document, form new text representation, export each document and are based on the improved vector space expressions of word2vec.

Description

A kind of short text representation method based on word2vec

Technical field

The present invention relates to Computer Science and Technology fields, more particularly, to a kind of short text based on word2vec Representation method.

Background technology

In text mining processing, machine understands sample information and needs to first pass through text representation link, converts sample to Numerical value.With the development of the continuous broadening and computer technology of natural language processing range, how preferably to be indicated using numerical value Semantic information representated by text is one of vital research point in text-processing field always, because this directly affects text This mining effect.For short text Mining Problems, effective Text Representation method is even more the difficult point studied, especially social The short text that platform generates not only has the problems such as one justice of sparse traditional feature, semantic imperfect, polysemy and more words, Also have many characteristics, such as random expression, neologisms abuse, substantial amounts.

Common text representation model has Boolean Model, probabilistic model and vector space model, wherein most common text It indicates that model is vector space model (Vector Space Model), was proposed in 1958 by Gerard Slaton et al..To The basic thought of quantity space model is to indicate text using vector, i.e., the selected part Feature Words from training set then will be each Feature Words are one-dimensional as vector space coordinate system, and such text just turns to a vector in multi-C vector space by form, Wherein each text is the point in n-dimensional space, and text can be weighed by the distance between angle between vector or vector Between similarity (platform moral skill, Wang Jun text classification feature weight innovatory algorithm [J] computer engineering, 2010,36 (9): 197-199.).However there is vector space model data space to express defect that is sparse and ignoring semantic information between word and word, This causes it slightly weak to the expression ability of short text.Some scholars trial is modified for these defects, such as Wang B K It is proposed a strong feature thesaurus (SFT) based on latent Dirichletal location and information gain, SFT synthesis LDA and IG to improve the weight of vocabulary, to select the stronger Feature Words of semantic information (Wang B K, Huang Y F, Yang W X,et al.Short text classification based on strong feature thesaurus [J].JOURNAL OF ZHEJIANG UNIVERSITY-SCIENCE C-COMPUTERS&ELECTRONICS,2012,13 (9):649-659.).Yang Lili etc. propose a kind of by the semantic extension side of the vocabulary and semantic feature that combine short text Method, this method obtain the semantic feature of vocabulary using wikipedia as background knowledge library, based on vocabulary and semantic combination Recalculate term weight function (Yang L, Li C, Ding Q, et al.Combining Lexical and Semantic Features for Short Text Classification[J].Procedia Computer Science,2013,22 (0):78-86.)。

2013, the Tomas Mikolov team of Google issued a kind of term vector life of increasing income based on deep learning At tool --- word2vec (Mikolov T, Le Q V, Sutskever I.Exploiting similarities among languages for machine translation[J].arXiv preprint arXiv:1309.4168, 2013.Mikolov T,Chen K,Corrado G,et al.Efficient estimation of word representations in vector space[J].arXiv preprint arXiv:1301.3781,2013.).The calculation Method can learn the term vector to high quality from extensive genuine document corpus in a relatively short period of time, and for easily Calculate the Semantic Similarity between word and word.Word2vec not only can be found that the semantic information between word, is also vectorial empty Between model express sparse problem in short text and provide new solution.

Invention content

The present invention is directed to be directed to sparse data space existing for vector space model (VSM) and semantic missing, carry Go out the short text representation method based on word2vec, the short text indicated using the short text representation method based on word2vec Cluster result can preferably extract knowledget opic.

To realize the above goal of the invention, the technical solution adopted is that：

A kind of short text representation method based on word2vec, includes the following steps：

S1：Word2vec method parameters are arranged by the training text collection of Text Pretreatment in input, and training obtains training text The corresponding term vector set of this collection；

S2：The word is calculated by the COS distance between term vector for each word in every document entirely to instruct Practice a series of close words in text set；

S3：Calculate the COS distance of the close word and document in every document；

S4：It sorts from big to small according to COS distance, n close words and corresponding COS distance are formed before final selection The n close words and cosine measurement of document；

S5：N weights of the close word in the document for calculating the word and selection in document, form new text representation, It exports each document and is based on the improved vector space expressions of word2vec.

Preferably, the preprocessing process of the step S1 training text collection includes：

S1.1：It builds user-oriented dictionary and word segmentation processing and part-of-speech tagging is carried out to training text；

S1.2：Stop words is removed according to existing deactivated vocabulary, and pronoun, preposition, the noun of locality are removed according to part of speech；

S1.3：Feature selecting is carried out using the methods of TF, IDF or TF-IDF, reduces characteristic dimension.

Preferably, the specific calculating process of the step S3 is as follows：

If certain words in document have consistent close word, the COS distance of consistent close word is added, is formed Otherwise the COS distance of close word and document retains former close word and its COS distance with the word in document：

S (t, d)=s (t, t₁)+s(t,t₂)+s(t,t₃)+…+s(t,t_n) (1)

Wherein, t, t₁,t₂,t₃,…,t_nFor the vocabulary in document d, s (t, t_n) indicate vocabulary t in word t and document d_n's Cosine is measured, and s (t, d) indicates that the cosine of word t and document d is measured.

Preferably, the step S5 calculates the tool of n weights of the close word in the document of word and selection in document Body process is as follows：

Wherein, W (t, nd) is weights of the word t in the document nd after n close words are added, and passes through feature weight calculating side Method TF-IDF is calculated；S (t, d) indicates that the cosine of word t and document d is measured.

Compared with prior art, the beneficial effects of the invention are as follows：

(1) present invention proposes a kind of short text representation method based on word2vec, and text is found using word2vec In each vocabulary close word, and then expansion of the close word of text as text feature in vector space model is calculated The method of exhibition, the extension of this feature had not only considered the semantic relation between word, but also to solve vector space model sign sparse The problem of.

(2) the experimental results showed that, text cluster and text of the short text representation method based on word2vec in experiment Classification link has the performance for being significantly better than traditional vector space model, cluster link DB_index averagely to reduce 0.704, point Class link classification accuracy averagely improves 4.614%, illustrates the short text representation method based on word2vec in technology and answers Clustering Effect is improved with two levels, can preferably extract the knowledget opic in language material.

Description of the drawings

Vector space model improved methods of the Fig. 1 based on word2vec indicates the process of short text

Fig. 2 is changed under different cluster numbers with intrinsic dimensionality based on the text that traditional vector space model method indicates DB_index line charts

The DB_ that Fig. 3 is changed under different cluster numbers with intrinsic dimensionality based on the text that the method for the invention indicates Index line charts

Cluster DB_indexs of the Fig. 4 based on the text that traditional vector space model method and the method for the invention indicate takes It is worth block diagram

The classification accuracy for the text that Fig. 5 is indicated based on traditional vector space model method and the method for the invention is with spy Levy the block diagram of dimension variation

Specific implementation mode

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

Below in conjunction with drawings and examples, the present invention is further elaborated.

Embodiment 1

Below in conjunction with attached drawing, the above-mentioned present invention and other technical characteristic and advantage are described in more detail, at this In embodiment by taking comprehensive two children policy short text corpus as an example.

The acquisition of training text collection and preprocessing process are as follows：

Experiment comprehensive two children policy short text corpus used is obtained by crawling Sina weibo, to the data of crawl into After the necessary cleaning of row and filtering, 102300 data availables are obtained as experiment corpus.Use the participle of NLPIR2016 After the Chinese segment word and part-of-speech tagging of Java editions completion texts of system, imports Harbin Institute of Technology and deactivate vocabulary removal stop words, meanwhile, The word without practical significance such as pronoun, preposition, noun of locality is removed according to part of speech.It is main at present to wrap in feature selection process The unsupervised feature selection approach contained is TF, IDF, TF-IDF, this example selects TF-IDF methods to carry out feature selecting, is reduced special Levy dimension.

(1) short text indicates process

It is the process that short text is indicated using the short text representation method based on word2vec, specific steps as shown in Figure 1 It is as follows：

S1：Before the short text representation method based on word2vec, need first under Linux environment, with Google Word2vec Open-Source Tools generate term vector document to input data.Made with the text data after the completion of training set Text Pretreatment It is generated for word2vec term vectors for data set, parameter is set as-cbow 0-size 200-window 5-negative 0-hs 1-sample 1e-3-threads 12-binary 1, using Skip-Gram models, training window size is 5, is generated The term vector of one 200 dimension.

S2：Be calculated that the word in text is concentrated in entire training text according to term vector using word2vec methods one The close word of series, table 1 illustrate the word in a text and its correspond to close word and COS distance value.

The close word and COS distance for the partial words that table 1word2vec methods obtain

S3：The COS distance of the close word and document of every document is calculated according to formula (1).

S4：Consider the selection of document similar word number n, if the selection of n values is too small, each document is considered to join after feature selecting Close word is very few with calculating；If the selection of n values is excessive, text representation link calculation amount and run time, this hair will be significantly greatly increased Bright setting n is 50, that is, selects the extension feature of 50 close words and corresponding COS distance as document before document.

S5：N weights of the close word in the document that the word and selection in document are calculated according to formula (2), form new Text representation, export each document and be based on the improved vector spaces of word2vec and indicate.

(2) evaluation method

Using K-means clustering methods, calculate separately using conventional method and the short text expression side based on word2vec DB_index of the document selection different characteristic dimension that method indicates between clusters number 5-15, by seeking minimum DB_index To determine cluster number.

It is clustered based on traditional vector space model method difference and is rolled over the DB_index values of intrinsic dimensionality variation under number Line chart is as shown in Figure 2.The DB_ changed with intrinsic dimensionality under short text representation method difference cluster number based on word2vec It is as shown in Figure 3 that index values change line chart.

By Fig. 2, Fig. 3 it is found that no matter based on traditional vector space model method the still short text table based on word2vec Show method, when cluster number is 13, regardless of characteristic dimension number, separating degree is kept relatively steady between within-cluster variance and class Fixed, category division is stablized relatively, and there are minimum values by DB_index, so it is optimum cluster number to select cluster number 13.

(1)DB_index

Wherein k is clusters number, d_ijIt is the distance at two class centers, S_iIt is sample the putting down to class cluster center in class cluster i Equal distance.

Due to cluster number be 13 when, obtain optimum cluster effect, therefore compare cluster number be 13 when two kinds of text representations The Clustering Effect of method.Fig. 4 is the DB-index of two kinds of document representation methods under different characteristic dimension when it is 13 to cluster number Value block diagram.As shown in Figure 4, when cluster number is 13, when intrinsic dimensionality is between 200-2000, based on word2vec's Short text representation method can obtain DB-index values more lower than traditional vector space model method.Show to utilize and be based on The short text representation method of word2vec carries out text representation, can preferably express text so that text is in cluster process Obtain in smaller cluster separating degree between condensation degree and the cluster of bigger.

(2) cluster result is explained

In view of the short text representation method intrinsic dimensionality based on word2vec be 200 when, DB_index has minimum value 1.168, thus cluster result at this time is explained as shown in table 2.

As shown in Table 2, there are direct livelihood issues urgently to pay close attention to and solve after comprehensive two children policy opens, including classification 1 education and medical care, classification 4 are married and have children late, 11 women of classification obtains employment, this is after comprehensive two children policy opens, and the common people are straight at the first time The two child's relevant issues fed back are connect, these problems are negative effect places caused by comprehensive two children policy, it should cause to have Unit is closed to pay close attention to and put into effect corresponding measure.At the same time, classification 2, classification 6, classification 9 embody comprehensive two children policy and are brought Economy, life stress burden, the common people need more to improve the chance that personal income is horizontal, quality of life could consider two children, This also forces government to implement more Welfare systems, provide various promotion common people's income levels such as more comprehensive employment The means of rising, although otherwise open comprehensive two child, under the pressure of actual life pressure, the universal vegetative characters of the common people are not high, The problem of an aging population can not be eased.Classification 3, classification 8, classification 13 relate generally to the family that may be brought after comprehensive two child Front yard internal problem, these contents are that each common people are considering whether to ring although with the contents such as policy execution without direct relation The content of worried worry when answering comprehensive two children policy, it is believed that the common people have oneself appropriate way and judgement.Classification 5, classification 10, classification 12 relates generally to view and impression of the common people for comprehensive two children policy, and most of this three classes content embodies the common people couple In the support and expectation of comprehensive two children policy after all, it can be seen that comprehensive two children policy still meets common people's demand.

Text examples in the different cluster character pair words of table 2 and class

What the short text indicated using the method for the invention was polymerized to illustrated to each classification in cluster result Classification has explanatory well, it is easier to extract the knowledget opic in class cluster.

(3) using cluster result as the text classification accuracy of training corpus

It is explained according to the Feature Words of each classification and classification and manual sort is carried out for test set document, realize test set Classification marks.Using the document manually marked as test set, the document that text cluster is obtained classification results is used as training set With the order of accuarcy for clustering the training corpus built automatically of upchecking.Traditional vector space model method and base is respectively adopted Text representation is carried out in the short text representation method of word2vec, continues to use the feature selection approach TF- consistent with cluster link IDF, the classification results obtained based on different classifications device under different characteristic dimension are as shown in table 3.

The accuracy rate that different classifications device changes with intrinsic dimensionality under the different document representation methods of table 3

In order to more intuitively compare difference on effect of two kinds of different document representation methods when carrying out text classification, it is based on table 3 can draw the value block diagram of text classification accuracy under different document representation methods with intrinsic dimensionality, as shown in Figure 5.

As shown in Figure 5, under the short text representation method based on word2vec, in addition to intrinsic dimensionality be 100 (at this time may be by Very few in intrinsic dimensionality, no enough Feature Words can be used for distinguishing different classes of) other than other intrinsic dimensionalities, by cluster from The training corpus of dynamic structure can obtain the classification accuracy higher than 80%.Meanwhile it has been observed that different characteristic dimension, no In the case of grader, the classification accuracy of the short text representation method based on word2vec is above traditional vector space model Method promotes amplitude in addition to intrinsic dimensionality is 500, is promoted only 2.38% when classifier methods are SVM, in the case of other Between 3.16% to 6.87%.This shows preferably to distinguish by clustering the corpus of structure using such document representation method Knowledget opic in language material, in the upper better effect of acquirement of application.

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this All any modification, equivalent and improvement etc., should be included in the claims in the present invention made by within the spirit and principle of invention Protection domain within.

Claims

1. a kind of short text representation method based on word2vec, it is characterised in that：Include the following steps：

S1：Word2vec method parameters are arranged by the training text collection of Text Pretreatment in input, and training obtains training text collection Corresponding term vector set；

S2：The word is calculated in entire training text by the COS distance between term vector for each word in every document A series of close words of this concentration；

S4：It sorts from big to small according to COS distance, it is final to choose preceding n close words and corresponding COS distance formation document N close words and cosine measurement；

S5：N weights of the close word in the document for calculating the word and selection in document, form new text representation, export Each document is based on the improved vector spaces of word2vec and indicates.

2. the short text representation method according to claim 1 based on word2vec, it is characterised in that：The step S1 instructions Practice text set preprocessing process include：

3. the short text representation method according to claim 1 based on word2vec, it is characterised in that：The step S3's Specific calculating process is as follows：

If certain words in document have consistent close word, the COS distance of consistent close word is added, is formed close Otherwise the COS distance of word and document retains former close word and its COS distance with the word in document：

S (t, d)=s (t, t₁)+s(t,t₂)+s(t,t₃)+…+s(t,t_n) (1)

Wherein, t, t_1,t₂,t₃,…,t_nFor the vocabulary in document d, s (t, t_n) indicate vocabulary t in word t and document d_nCosine degree Amount, s (t, d) indicate that the cosine of word t and document d is measured.

4. the short text representation method according to claim 3 based on word2vec, it is characterised in that：The step S5 meters The detailed process for calculating n weights of the close word in the document of the word and selection in document is as follows：

Wherein, W (t, nd) is weights of the word t in the document nd after n close words are added, and passes through feature weight computational methods TF-IDF is calculated；S (t, d) indicates that the cosine of word t and document d is measured.