CN108804595A - A kind of short text representation method based on word2vec - Google Patents
A kind of short text representation method based on word2vec Download PDFInfo
- Publication number
- CN108804595A CN108804595A CN201810525103.8A CN201810525103A CN108804595A CN 108804595 A CN108804595 A CN 108804595A CN 201810525103 A CN201810525103 A CN 201810525103A CN 108804595 A CN108804595 A CN 108804595A
- Authority
- CN
- China
- Prior art keywords
- document
- word
- close
- word2vec
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 239000013598 vector Substances 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 23
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 3
- 238000005259 measurement Methods 0.000 claims abstract description 3
- 230000008569 process Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 2
- 238000000205 computational method Methods 0.000 claims 1
- 230000014509 gene expression Effects 0.000 abstract description 5
- 230000000694 effects Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 238000005065 mining Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 206010028916 Neologism Diseases 0.000 description 1
- 244000097202 Rathbunia alamosensis Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 230000032683 aging Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000009833 condensation Methods 0.000 description 1
- 230000005494 condensation Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000035882 stress Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of short text representation method based on word2vec, includes the following steps:S1:Word2vec method parameters are arranged by the training text collection of Text Pretreatment in input, and training obtains the corresponding term vector set of training text collection;S2:A series of close words that the word is concentrated in entire training text are calculated by the COS distance between term vector for each word in every document;S3:Calculate the COS distance of the close word and document in every document;S4:It sorts from big to small according to COS distance, the final n close words and cosine measurement for choosing preceding n close words and corresponding COS distance formation document;S5:N weights of the close word in the document for calculating the word and selection in document, form new text representation, export each document and are based on the improved vector space expressions of word2vec.
Description
Technical field
The present invention relates to Computer Science and Technology fields, more particularly, to a kind of short text based on word2vec
Representation method.
Background technology
In text mining processing, machine understands sample information and needs to first pass through text representation link, converts sample to
Numerical value.With the development of the continuous broadening and computer technology of natural language processing range, how preferably to be indicated using numerical value
Semantic information representated by text is one of vital research point in text-processing field always, because this directly affects text
This mining effect.For short text Mining Problems, effective Text Representation method is even more the difficult point studied, especially social
The short text that platform generates not only has the problems such as one justice of sparse traditional feature, semantic imperfect, polysemy and more words,
Also have many characteristics, such as random expression, neologisms abuse, substantial amounts.
Common text representation model has Boolean Model, probabilistic model and vector space model, wherein most common text
It indicates that model is vector space model (Vector Space Model), was proposed in 1958 by Gerard Slaton et al..To
The basic thought of quantity space model is to indicate text using vector, i.e., the selected part Feature Words from training set then will be each
Feature Words are one-dimensional as vector space coordinate system, and such text just turns to a vector in multi-C vector space by form,
Wherein each text is the point in n-dimensional space, and text can be weighed by the distance between angle between vector or vector
Between similarity (platform moral skill, Wang Jun text classification feature weight innovatory algorithm [J] computer engineering, 2010,36 (9):
197-199.).However there is vector space model data space to express defect that is sparse and ignoring semantic information between word and word,
This causes it slightly weak to the expression ability of short text.Some scholars trial is modified for these defects, such as Wang B K
It is proposed a strong feature thesaurus (SFT) based on latent Dirichletal location and information gain, SFT synthesis
LDA and IG to improve the weight of vocabulary, to select the stronger Feature Words of semantic information (Wang B K, Huang Y F,
Yang W X,et al.Short text classification based on strong feature thesaurus
[J].JOURNAL OF ZHEJIANG UNIVERSITY-SCIENCE C-COMPUTERS&ELECTRONICS,2012,13
(9):649-659.).Yang Lili etc. propose a kind of by the semantic extension side of the vocabulary and semantic feature that combine short text
Method, this method obtain the semantic feature of vocabulary using wikipedia as background knowledge library, based on vocabulary and semantic combination
Recalculate term weight function (Yang L, Li C, Ding Q, et al.Combining Lexical and Semantic
Features for Short Text Classification[J].Procedia Computer Science,2013,22
(0):78-86.)。
2013, the Tomas Mikolov team of Google issued a kind of term vector life of increasing income based on deep learning
At tool --- word2vec (Mikolov T, Le Q V, Sutskever I.Exploiting similarities among
languages for machine translation[J].arXiv preprint arXiv:1309.4168,
2013.Mikolov T,Chen K,Corrado G,et al.Efficient estimation of word
representations in vector space[J].arXiv preprint arXiv:1301.3781,2013.).The calculation
Method can learn the term vector to high quality from extensive genuine document corpus in a relatively short period of time, and for easily
Calculate the Semantic Similarity between word and word.Word2vec not only can be found that the semantic information between word, is also vectorial empty
Between model express sparse problem in short text and provide new solution.
Invention content
The present invention is directed to be directed to sparse data space existing for vector space model (VSM) and semantic missing, carry
Go out the short text representation method based on word2vec, the short text indicated using the short text representation method based on word2vec
Cluster result can preferably extract knowledget opic.
To realize the above goal of the invention, the technical solution adopted is that:
A kind of short text representation method based on word2vec, includes the following steps:
S1:Word2vec method parameters are arranged by the training text collection of Text Pretreatment in input, and training obtains training text
The corresponding term vector set of this collection;
S2:The word is calculated by the COS distance between term vector for each word in every document entirely to instruct
Practice a series of close words in text set;
S3:Calculate the COS distance of the close word and document in every document;
S4:It sorts from big to small according to COS distance, n close words and corresponding COS distance are formed before final selection
The n close words and cosine measurement of document;
S5:N weights of the close word in the document for calculating the word and selection in document, form new text representation,
It exports each document and is based on the improved vector space expressions of word2vec.
Preferably, the preprocessing process of the step S1 training text collection includes:
S1.1:It builds user-oriented dictionary and word segmentation processing and part-of-speech tagging is carried out to training text;
S1.2:Stop words is removed according to existing deactivated vocabulary, and pronoun, preposition, the noun of locality are removed according to part of speech;
S1.3:Feature selecting is carried out using the methods of TF, IDF or TF-IDF, reduces characteristic dimension.
Preferably, the specific calculating process of the step S3 is as follows:
If certain words in document have consistent close word, the COS distance of consistent close word is added, is formed
Otherwise the COS distance of close word and document retains former close word and its COS distance with the word in document:
S (t, d)=s (t, t1)+s(t,t2)+s(t,t3)+…+s(t,tn) (1)
Wherein, t, t1,t2,t3,…,tnFor the vocabulary in document d, s (t, tn) indicate vocabulary t in word t and document dn's
Cosine is measured, and s (t, d) indicates that the cosine of word t and document d is measured.
Preferably, the step S5 calculates the tool of n weights of the close word in the document of word and selection in document
Body process is as follows:
Wherein, W (t, nd) is weights of the word t in the document nd after n close words are added, and passes through feature weight calculating side
Method TF-IDF is calculated;S (t, d) indicates that the cosine of word t and document d is measured.
Compared with prior art, the beneficial effects of the invention are as follows:
(1) present invention proposes a kind of short text representation method based on word2vec, and text is found using word2vec
In each vocabulary close word, and then expansion of the close word of text as text feature in vector space model is calculated
The method of exhibition, the extension of this feature had not only considered the semantic relation between word, but also to solve vector space model sign sparse
The problem of.
(2) the experimental results showed that, text cluster and text of the short text representation method based on word2vec in experiment
Classification link has the performance for being significantly better than traditional vector space model, cluster link DB_index averagely to reduce 0.704, point
Class link classification accuracy averagely improves 4.614%, illustrates the short text representation method based on word2vec in technology and answers
Clustering Effect is improved with two levels, can preferably extract the knowledget opic in language material.
Description of the drawings
Vector space model improved methods of the Fig. 1 based on word2vec indicates the process of short text
Fig. 2 is changed under different cluster numbers with intrinsic dimensionality based on the text that traditional vector space model method indicates
DB_index line charts
The DB_ that Fig. 3 is changed under different cluster numbers with intrinsic dimensionality based on the text that the method for the invention indicates
Index line charts
Cluster DB_indexs of the Fig. 4 based on the text that traditional vector space model method and the method for the invention indicate takes
It is worth block diagram
The classification accuracy for the text that Fig. 5 is indicated based on traditional vector space model method and the method for the invention is with spy
Levy the block diagram of dimension variation
Specific implementation mode
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
Below in conjunction with drawings and examples, the present invention is further elaborated.
Embodiment 1
Below in conjunction with attached drawing, the above-mentioned present invention and other technical characteristic and advantage are described in more detail, at this
In embodiment by taking comprehensive two children policy short text corpus as an example.
The acquisition of training text collection and preprocessing process are as follows:
Experiment comprehensive two children policy short text corpus used is obtained by crawling Sina weibo, to the data of crawl into
After the necessary cleaning of row and filtering, 102300 data availables are obtained as experiment corpus.Use the participle of NLPIR2016
After the Chinese segment word and part-of-speech tagging of Java editions completion texts of system, imports Harbin Institute of Technology and deactivate vocabulary removal stop words, meanwhile,
The word without practical significance such as pronoun, preposition, noun of locality is removed according to part of speech.It is main at present to wrap in feature selection process
The unsupervised feature selection approach contained is TF, IDF, TF-IDF, this example selects TF-IDF methods to carry out feature selecting, is reduced special
Levy dimension.
(1) short text indicates process
It is the process that short text is indicated using the short text representation method based on word2vec, specific steps as shown in Figure 1
It is as follows:
S1:Before the short text representation method based on word2vec, need first under Linux environment, with Google
Word2vec Open-Source Tools generate term vector document to input data.Made with the text data after the completion of training set Text Pretreatment
It is generated for word2vec term vectors for data set, parameter is set as-cbow 0-size 200-window 5-negative
0-hs 1-sample 1e-3-threads 12-binary 1, using Skip-Gram models, training window size is 5, is generated
The term vector of one 200 dimension.
S2:Be calculated that the word in text is concentrated in entire training text according to term vector using word2vec methods one
The close word of series, table 1 illustrate the word in a text and its correspond to close word and COS distance value.
The close word and COS distance for the partial words that table 1word2vec methods obtain
S3:The COS distance of the close word and document of every document is calculated according to formula (1).
S4:Consider the selection of document similar word number n, if the selection of n values is too small, each document is considered to join after feature selecting
Close word is very few with calculating;If the selection of n values is excessive, text representation link calculation amount and run time, this hair will be significantly greatly increased
Bright setting n is 50, that is, selects the extension feature of 50 close words and corresponding COS distance as document before document.
S5:N weights of the close word in the document that the word and selection in document are calculated according to formula (2), form new
Text representation, export each document and be based on the improved vector spaces of word2vec and indicate.
(2) evaluation method
Using K-means clustering methods, calculate separately using conventional method and the short text expression side based on word2vec
DB_index of the document selection different characteristic dimension that method indicates between clusters number 5-15, by seeking minimum DB_index
To determine cluster number.
It is clustered based on traditional vector space model method difference and is rolled over the DB_index values of intrinsic dimensionality variation under number
Line chart is as shown in Figure 2.The DB_ changed with intrinsic dimensionality under short text representation method difference cluster number based on word2vec
It is as shown in Figure 3 that index values change line chart.
By Fig. 2, Fig. 3 it is found that no matter based on traditional vector space model method the still short text table based on word2vec
Show method, when cluster number is 13, regardless of characteristic dimension number, separating degree is kept relatively steady between within-cluster variance and class
Fixed, category division is stablized relatively, and there are minimum values by DB_index, so it is optimum cluster number to select cluster number 13.
(1)DB_index
Wherein k is clusters number, dijIt is the distance at two class centers, SiIt is sample the putting down to class cluster center in class cluster i
Equal distance.
Due to cluster number be 13 when, obtain optimum cluster effect, therefore compare cluster number be 13 when two kinds of text representations
The Clustering Effect of method.Fig. 4 is the DB-index of two kinds of document representation methods under different characteristic dimension when it is 13 to cluster number
Value block diagram.As shown in Figure 4, when cluster number is 13, when intrinsic dimensionality is between 200-2000, based on word2vec's
Short text representation method can obtain DB-index values more lower than traditional vector space model method.Show to utilize and be based on
The short text representation method of word2vec carries out text representation, can preferably express text so that text is in cluster process
Obtain in smaller cluster separating degree between condensation degree and the cluster of bigger.
(2) cluster result is explained
In view of the short text representation method intrinsic dimensionality based on word2vec be 200 when, DB_index has minimum value
1.168, thus cluster result at this time is explained as shown in table 2.
As shown in Table 2, there are direct livelihood issues urgently to pay close attention to and solve after comprehensive two children policy opens, including classification
1 education and medical care, classification 4 are married and have children late, 11 women of classification obtains employment, this is after comprehensive two children policy opens, and the common people are straight at the first time
The two child's relevant issues fed back are connect, these problems are negative effect places caused by comprehensive two children policy, it should cause to have
Unit is closed to pay close attention to and put into effect corresponding measure.At the same time, classification 2, classification 6, classification 9 embody comprehensive two children policy and are brought
Economy, life stress burden, the common people need more to improve the chance that personal income is horizontal, quality of life could consider two children,
This also forces government to implement more Welfare systems, provide various promotion common people's income levels such as more comprehensive employment
The means of rising, although otherwise open comprehensive two child, under the pressure of actual life pressure, the universal vegetative characters of the common people are not high,
The problem of an aging population can not be eased.Classification 3, classification 8, classification 13 relate generally to the family that may be brought after comprehensive two child
Front yard internal problem, these contents are that each common people are considering whether to ring although with the contents such as policy execution without direct relation
The content of worried worry when answering comprehensive two children policy, it is believed that the common people have oneself appropriate way and judgement.Classification 5, classification
10, classification 12 relates generally to view and impression of the common people for comprehensive two children policy, and most of this three classes content embodies the common people couple
In the support and expectation of comprehensive two children policy after all, it can be seen that comprehensive two children policy still meets common people's demand.
Text examples in the different cluster character pair words of table 2 and class
What the short text indicated using the method for the invention was polymerized to illustrated to each classification in cluster result
Classification has explanatory well, it is easier to extract the knowledget opic in class cluster.
(3) using cluster result as the text classification accuracy of training corpus
It is explained according to the Feature Words of each classification and classification and manual sort is carried out for test set document, realize test set
Classification marks.Using the document manually marked as test set, the document that text cluster is obtained classification results is used as training set
With the order of accuarcy for clustering the training corpus built automatically of upchecking.Traditional vector space model method and base is respectively adopted
Text representation is carried out in the short text representation method of word2vec, continues to use the feature selection approach TF- consistent with cluster link
IDF, the classification results obtained based on different classifications device under different characteristic dimension are as shown in table 3.
The accuracy rate that different classifications device changes with intrinsic dimensionality under the different document representation methods of table 3
In order to more intuitively compare difference on effect of two kinds of different document representation methods when carrying out text classification, it is based on table
3 can draw the value block diagram of text classification accuracy under different document representation methods with intrinsic dimensionality, as shown in Figure 5.
As shown in Figure 5, under the short text representation method based on word2vec, in addition to intrinsic dimensionality be 100 (at this time may be by
Very few in intrinsic dimensionality, no enough Feature Words can be used for distinguishing different classes of) other than other intrinsic dimensionalities, by cluster from
The training corpus of dynamic structure can obtain the classification accuracy higher than 80%.Meanwhile it has been observed that different characteristic dimension, no
In the case of grader, the classification accuracy of the short text representation method based on word2vec is above traditional vector space model
Method promotes amplitude in addition to intrinsic dimensionality is 500, is promoted only 2.38% when classifier methods are SVM, in the case of other
Between 3.16% to 6.87%.This shows preferably to distinguish by clustering the corpus of structure using such document representation method
Knowledget opic in language material, in the upper better effect of acquirement of application.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair
The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description
To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this
All any modification, equivalent and improvement etc., should be included in the claims in the present invention made by within the spirit and principle of invention
Protection domain within.
Claims (4)
1. a kind of short text representation method based on word2vec, it is characterised in that:Include the following steps:
S1:Word2vec method parameters are arranged by the training text collection of Text Pretreatment in input, and training obtains training text collection
Corresponding term vector set;
S2:The word is calculated in entire training text by the COS distance between term vector for each word in every document
A series of close words of this concentration;
S3:Calculate the COS distance of the close word and document in every document;
S4:It sorts from big to small according to COS distance, it is final to choose preceding n close words and corresponding COS distance formation document
N close words and cosine measurement;
S5:N weights of the close word in the document for calculating the word and selection in document, form new text representation, export
Each document is based on the improved vector spaces of word2vec and indicates.
2. the short text representation method according to claim 1 based on word2vec, it is characterised in that:The step S1 instructions
Practice text set preprocessing process include:
S1.1:It builds user-oriented dictionary and word segmentation processing and part-of-speech tagging is carried out to training text;
S1.2:Stop words is removed according to existing deactivated vocabulary, and pronoun, preposition, the noun of locality are removed according to part of speech;
S1.3:Feature selecting is carried out using the methods of TF, IDF or TF-IDF, reduces characteristic dimension.
3. the short text representation method according to claim 1 based on word2vec, it is characterised in that:The step S3's
Specific calculating process is as follows:
If certain words in document have consistent close word, the COS distance of consistent close word is added, is formed close
Otherwise the COS distance of word and document retains former close word and its COS distance with the word in document:
S (t, d)=s (t, t1)+s(t,t2)+s(t,t3)+…+s(t,tn) (1)
Wherein, t, t1,t2,t3,…,tnFor the vocabulary in document d, s (t, tn) indicate vocabulary t in word t and document dnCosine degree
Amount, s (t, d) indicate that the cosine of word t and document d is measured.
4. the short text representation method according to claim 3 based on word2vec, it is characterised in that:The step S5 meters
The detailed process for calculating n weights of the close word in the document of the word and selection in document is as follows:
Wherein, W (t, nd) is weights of the word t in the document nd after n close words are added, and passes through feature weight computational methods
TF-IDF is calculated;S (t, d) indicates that the cosine of word t and document d is measured.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810525103.8A CN108804595B (en) | 2018-05-28 | 2018-05-28 | Short text representation method based on word2vec |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810525103.8A CN108804595B (en) | 2018-05-28 | 2018-05-28 | Short text representation method based on word2vec |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108804595A true CN108804595A (en) | 2018-11-13 |
CN108804595B CN108804595B (en) | 2021-07-27 |
Family
ID=64090655
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810525103.8A Expired - Fee Related CN108804595B (en) | 2018-05-28 | 2018-05-28 | Short text representation method based on word2vec |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108804595B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162620A (en) * | 2019-01-10 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Black detection method, device, server and the storage medium for producing advertisement |
CN110232128A (en) * | 2019-06-21 | 2019-09-13 | 华中师范大学 | Topic file classification method and device |
CN110442873A (en) * | 2019-08-07 | 2019-11-12 | 云南电网有限责任公司信息中心 | A kind of hot spot work order acquisition methods and device based on CBOW model |
CN110705304A (en) * | 2019-08-09 | 2020-01-17 | 华南师范大学 | Attribute word extraction method |
CN111177401A (en) * | 2019-12-12 | 2020-05-19 | 西安交通大学 | Power grid free text knowledge extraction method |
CN112257431A (en) * | 2020-10-30 | 2021-01-22 | 中电万维信息技术有限责任公司 | NLP-based short text data processing method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279288A (en) * | 2015-12-04 | 2016-01-27 | 深圳大学 | Online content recommending method based on deep neural network |
CN107102989A (en) * | 2017-05-24 | 2017-08-29 | 南京大学 | A kind of entity disambiguation method based on term vector, convolutional neural networks |
CN107590218A (en) * | 2017-09-01 | 2018-01-16 | 南京理工大学 | The efficient clustering method of multiple features combination Chinese text based on Spark |
-
2018
- 2018-05-28 CN CN201810525103.8A patent/CN108804595B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279288A (en) * | 2015-12-04 | 2016-01-27 | 深圳大学 | Online content recommending method based on deep neural network |
CN107102989A (en) * | 2017-05-24 | 2017-08-29 | 南京大学 | A kind of entity disambiguation method based on term vector, convolutional neural networks |
CN107590218A (en) * | 2017-09-01 | 2018-01-16 | 南京理工大学 | The efficient clustering method of multiple features combination Chinese text based on Spark |
Non-Patent Citations (1)
Title |
---|
陈杰等: "基于Word2vec 的文档分类方法", 《计算机系统应用》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162620A (en) * | 2019-01-10 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Black detection method, device, server and the storage medium for producing advertisement |
CN110162620B (en) * | 2019-01-10 | 2023-08-18 | 腾讯科技(深圳)有限公司 | Method and device for detecting black advertisements, server and storage medium |
CN110232128A (en) * | 2019-06-21 | 2019-09-13 | 华中师范大学 | Topic file classification method and device |
CN110442873A (en) * | 2019-08-07 | 2019-11-12 | 云南电网有限责任公司信息中心 | A kind of hot spot work order acquisition methods and device based on CBOW model |
CN110705304A (en) * | 2019-08-09 | 2020-01-17 | 华南师范大学 | Attribute word extraction method |
CN111177401A (en) * | 2019-12-12 | 2020-05-19 | 西安交通大学 | Power grid free text knowledge extraction method |
CN112257431A (en) * | 2020-10-30 | 2021-01-22 | 中电万维信息技术有限责任公司 | NLP-based short text data processing method |
Also Published As
Publication number | Publication date |
---|---|
CN108804595B (en) | 2021-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804595A (en) | A kind of short text representation method based on word2vec | |
CN106484664B (en) | Similarity calculating method between a kind of short text | |
CN105843897B (en) | A kind of intelligent Answer System towards vertical field | |
CN109829166B (en) | People and host customer opinion mining method based on character-level convolutional neural network | |
CN107239439A (en) | Public sentiment sentiment classification method based on word2vec | |
Waitelonis et al. | Linked data enabled generalized vector space model to improve document retrieval | |
CN103207913B (en) | The acquisition methods of commercial fine granularity semantic relation and system | |
CN109960756B (en) | News event information induction method | |
CN110059311A (en) | A kind of keyword extracting method and system towards judicial style data | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN107608999A (en) | A kind of Question Classification method suitable for automatically request-answering system | |
CN103425635A (en) | Method and device for recommending answers | |
CN107562717A (en) | A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence | |
CN110134925A (en) | A kind of Chinese patent text similarity calculating method | |
CN104361059B (en) | A kind of harmful information identification and Web page classification method based on multi-instance learning | |
CN108268554A (en) | A kind of method and apparatus for generating filtering junk short messages strategy | |
Sadr et al. | Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms | |
Ho et al. | Entities with quantities: extraction, search, and ranking | |
JP2015007920A (en) | Extraction of social structural model using text processing | |
Song et al. | Semantic analysis and implicit target extraction of comments from E-commerce websites | |
CN106503153A (en) | Computer text classification system, system and text classification method thereof | |
CN110324278A (en) | Account main body consistency detecting method, device and equipment | |
Tao et al. | Research on topics trends based on weighted K-means | |
Khalid et al. | Opinion reason mining: Implicit aspects beyond implying aspects | |
CN109977231A (en) | A kind of depressive emotion analysis method based on emotion decay factor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210727 |