CN112257431A - NLP-based short text data processing method - Google Patents
NLP-based short text data processing method Download PDFInfo
- Publication number
- CN112257431A CN112257431A CN202011184771.2A CN202011184771A CN112257431A CN 112257431 A CN112257431 A CN 112257431A CN 202011184771 A CN202011184771 A CN 202011184771A CN 112257431 A CN112257431 A CN 112257431A
- Authority
- CN
- China
- Prior art keywords
- text data
- short text
- cosine distance
- word
- cosine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of natural language processing, in particular to a short text data processing method based on NLP. The method comprises the following steps: acquiring short text data, jieba word segmentation, stopwords removal, word bag acquisition, corpus making, TF-IDF processing, cosine distance calculation, and standardizing the short text data according to the calculated cosine distance; the problems that the manual processing efficiency of short text data is low, the short text data is not accurate and the large data is difficult to process are solved, so that a large amount of manpower and material resources are consumed.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a short text data processing method based on NLP.
Background
With the rapid development of network information technology and the gradual transformation of traditional paper information to digital information, more and more information, especially short text information, is accumulated in the network. Most of the short text data are collected by an information system and stored in a relational database. Short text data is expressed in various forms, but the expression has the same meaning, such as: there is a employment questionnaire, there is a non-selection item engaged in the type of work, the index item is now to be analyzed statistically, but the data collected by the index item has various expression forms and the expression meaning is the same, such as: restaurant waiters, restaurants, food waiters, noodle broilers, cooks, pastry chefs, beef noodles, hotel waiters, cookers, hotels, restaurants, restaurant waiters, the food and beverage industry, hot pot restaurants, and the like, all of which express the food and beverage industry. For statistical analysis of the index item, the short text data collected by the index item is processed, so that the statistical data can be accurate. The traditional processing method mainly uses a manual processing way, and the method has a plurality of disadvantages: firstly, a large amount of manpower and material resources are consumed; secondly, there is a phenomenon that the results obtained are inconsistent with the requirements. The manual processing mode with low efficiency faces more and more difficulties, the big data is still inexhaustible, and the short text data processing becomes a development direction in a rational way.
With the development of NLP, NLP is a human intelligent technology with very strong adaptability, which assists human to change self defects and weaknesses from the consciousness field and behavior habit, so that NLP is effectively applied to the short text data processing field.
Disclosure of Invention
The embodiment of the invention aims to provide a method for processing short text data based on NLP, and aims to solve the problems of low efficiency, inaccuracy and difficulty in processing large data of the short text data by hand, so that a large amount of manpower and material resources are consumed.
A method of NLP-based short text data processing, the method comprising the steps of:
acquiring short text data, jieba word segmentation, stopwords removal, word bag acquisition, corpus making, TF-IDF processing, cosine distance calculation, and standardizing the short text data according to the calculated cosine distance;
the short text data acquisition synchronizes the short text data in the service database to a local TXT file through a DataX tool;
synchronizing the jieba word segmentation to short text data in a local TXT file, and performing word segmentation by a jieba word segmentation tool in a row unit;
the stopwords are removed in the step of stopwords, and the stop words are deleted from the short text data with the words being separated through an NLTK tool;
the step of obtaining word bags allocates a unique integer id to all words appearing in the corpus by using a genim library;
the step of manufacturing the corpus is to calculate the occurrence times of each different word by using a word2vec tool in a genesis library, convert the word into an integer word id and return the result as a sparse vector;
in the step TF-IDF, the manufactured corpus is used as the input of a Tfidf model for model training;
in the step of calculating the cosine distance, similarity is evaluated by calculating a cosine value of an included angle between two vectors, and the cosine similarity draws the vectors into a vector space according to coordinate values, for example, in the most common two-dimensional space, if the coordinates of the vectors a and b are (x1, y1), (x2, and y2), the corresponding cosine distance is:
let vector a = (a1, a 2., An), B = (B1, B2.., Bn), generalize to multidimensional:
the smaller the included angle is, the closer the cosine value is to 1, and the more identical the directions are, the more similar; it can be seen that a cosine distance between 0 and 1 and approximately close to 1 indicates that the two are more similar.
And standardizing the short text data according to the calculated cosine distance, calculating the cosine distance, and when the cosine distance is greater than 0.8, replacing the short text data with standard short text data, and standardizing the short text data which is not standard by the above mode.
The invention has the beneficial effects that:
the invention discloses a method for processing short text data based on NLP, which comprises the steps of dividing words by jiaba, removing stop words (stopwords), obtaining word bags, making corpora, carrying out TF-IDF processing, calculating cosine distance, standardizing the short text data according to the calculated cosine distance and the like, can standardize the short text data collected in a form, and is favorable for application and statistical analysis of the data. Compared with the prior art, the method has the following advantages due to the adoption of the short text data processing method: the problems that the manual processing efficiency of short text data is low, the short text data is not accurate and the large data is difficult to process are solved, so that a large amount of manpower and material resources are consumed.
Drawings
Fig. 1 is a flowchart of a method of short text data processing of the present invention NLP.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples, and for convenience of description, only parts related to the examples of the present invention are shown. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
A method of NLP-based short text data processing, the method comprising the steps of:
acquiring short text data, jieba word segmentation, stopwords removal, word bag acquisition, corpus making, TF-IDF processing, cosine distance calculation, and standardizing the short text data according to the calculated cosine distance;
the short text data acquisition synchronizes the short text data in the service database to a local TXT file through a DataX tool;
synchronizing the jieba word segmentation to short text data in a local TXT file, and performing word segmentation by a jieba word segmentation tool in a row unit;
the stopwords are removed in the step of stopwords, and the stop words are deleted from the short text data with the words being separated through an NLTK tool;
the step of obtaining word bags allocates a unique integer id to all words appearing in the corpus by using a genim library;
the step of manufacturing the corpus is to calculate the occurrence times of each different word by using a word2vec tool in a genesis library, convert the word into an integer word id and return the result as a sparse vector;
in the step TF-IDF, the manufactured corpus is used as the input of a Tfidf model for model training;
in the step of calculating the cosine distance, similarity is evaluated by calculating a cosine value of an included angle between two vectors, and the cosine similarity draws the vectors into a vector space according to coordinate values, for example, in the most common two-dimensional space, if the coordinates of the vectors a and b are (x1, y1), (x2, and y2), the corresponding cosine distance is:
let vector a = (a1, a 2., An), B = (B1, B2.., Bn), generalize to multidimensional:
the smaller the included angle is, the closer the cosine value is to 1, and the more identical the directions are, the more similar; it can be seen that a cosine distance between 0 and 1 and approximately close to 1 indicates that the two are more similar.
And standardizing the short text data according to the calculated cosine distance, calculating the cosine distance, and when the cosine distance is greater than 0.8, replacing the short text data with standard short text data, and standardizing the short text data which is not standard by the above mode.
Examples
Fig. 1 is a flowchart of a method for processing NLP-based short text data according to an embodiment of the present invention, where the method includes:
s101, short text data acquisition:
we synchronize short text data in the traffic database to the local TXT file through the DataX tool.
S102 jieba word segmentation:
and (4) cutting the sentences most accurately by using the short text data acquired in the step (S101) in a line unit and in an accurate mode.
S103 stop word (stopwords):
and deleting stop words contained in the jieba divided words in the step S102 by loading the Chinese stop words accumulated by us by using the NLTK.
S104, acquiring a bag of words:
by using the genim library, all words appearing in the corpus are assigned a unique integer id, such as: { 'restaurant': 0, 'fry': 1, 'pull': 2, 'open': 3, 'member': 4, 'restaurant': 5, 'noodle pulling': 6, 'hotel': 7, 'restaurant': 8, 'chafing dish': 9, 'dining': 10, 'service': 11 }.
S105, language material preparation:
by using word2vec in the genim library, count the number of occurrences of each different word, convert the word to an integer word id, and return the result as a sparse vector, such as: [ [ (0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1) ], [ (2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1) ] ], a first term of each tuple corresponds to the ID of a symbol in the dictionary, and a second term corresponds to the number of times the symbol occurs.
S106, TF-IDF processing is carried out:
and (5) taking the linguistic data manufactured in the step (S105) as the input of the Tfidf model, and performing model training.
S107 calculates the cosine distance:
and calculating the cosine distance by directly calling the class for calculating the cosine distance of the sparse matrix in the gensim library, wherein the data returned in the step S106 is used as the input of the step.
S108, normalizing the short text data according to the calculated cosine distance:
and judging the similarity of the short texts according to the cosine values calculated in the step S107, and when the cosine distance is greater than 0.8, replacing the short text data with standard short text data, wherein the short text data which is not standard can be standardized by the mode.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (2)
1. A method for NLP-based short text data processing, the method comprising the steps of:
acquiring short text data, jieba word segmentation, stopwords removal, word bag acquisition, corpus making, TF-IDF processing, cosine distance calculation, and standardizing the short text data according to the calculated cosine distance;
the short text data acquisition synchronizes the short text data in the service database to a local TXT file through a DataX tool;
synchronizing the jieba word segmentation to short text data in a local TXT file, and performing word segmentation by a jieba word segmentation tool in a row unit;
the stopwords are removed in the step of stopwords, and the stop words are deleted from the short text data with the words being separated through an NLTK tool;
the step of obtaining word bags allocates a unique integer id to all words appearing in the corpus by using a genim library;
the step of manufacturing the corpus is to calculate the occurrence times of each different word by using a word2vec tool in a genesis library, convert the word into an integer word id and return the result as a sparse vector;
in the step TF-IDF, the manufactured corpus is used as the input of a Tfidf model for model training;
in the step of calculating the cosine distance, similarity is evaluated by calculating a cosine value of an included angle between two vectors, the cosine similarity draws the vectors into a vector space according to coordinate values, and if the coordinates of the vectors a and b are (x1, y1), (x2, y2), respectively, the corresponding cosine distance is:
let vector a = (a1, a 2., An), B = (B1, B2.., Bn), generalize to multidimensional:
the smaller the included angle is, the closer the cosine value is to 1, the more similar the included angle is; the cosine distance is between 0 and 1 and both are more similar the closer to 1.
2. The method as claimed in claim 1, wherein the method for processing NLP-based short text data is characterized in that the normalization of the short text data according to the calculated cosine distance calculates the cosine distance, and when the cosine distance is greater than 0.8, the short text data which is not standard is replaced by the standard short text data, and the short text data which is not standard is normalized in this way.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011184771.2A CN112257431A (en) | 2020-10-30 | 2020-10-30 | NLP-based short text data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011184771.2A CN112257431A (en) | 2020-10-30 | 2020-10-30 | NLP-based short text data processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112257431A true CN112257431A (en) | 2021-01-22 |
Family
ID=74268854
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011184771.2A Pending CN112257431A (en) | 2020-10-30 | 2020-10-30 | NLP-based short text data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112257431A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334495A (en) * | 2018-01-30 | 2018-07-27 | 国家计算机网络与信息安全管理中心 | Short text similarity calculating method and system |
CN108628825A (en) * | 2018-04-10 | 2018-10-09 | 平安科技(深圳)有限公司 | Text message Similarity Match Method, device, computer equipment and storage medium |
CN108804595A (en) * | 2018-05-28 | 2018-11-13 | 中山大学 | A kind of short text representation method based on word2vec |
CN110362819A (en) * | 2019-06-14 | 2019-10-22 | 中电万维信息技术有限责任公司 | Text emotion analysis method based on convolutional neural networks |
CN110956033A (en) * | 2019-12-04 | 2020-04-03 | 北京中电普华信息技术有限公司 | Text similarity calculation method and device |
CN111488429A (en) * | 2020-03-19 | 2020-08-04 | 杭州叙简科技股份有限公司 | Short text clustering system based on search engine and short text clustering method thereof |
CN111523328A (en) * | 2020-04-13 | 2020-08-11 | 中博信息技术研究院有限公司 | Intelligent customer service semantic processing method |
-
2020
- 2020-10-30 CN CN202011184771.2A patent/CN112257431A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334495A (en) * | 2018-01-30 | 2018-07-27 | 国家计算机网络与信息安全管理中心 | Short text similarity calculating method and system |
CN108628825A (en) * | 2018-04-10 | 2018-10-09 | 平安科技(深圳)有限公司 | Text message Similarity Match Method, device, computer equipment and storage medium |
CN108804595A (en) * | 2018-05-28 | 2018-11-13 | 中山大学 | A kind of short text representation method based on word2vec |
CN110362819A (en) * | 2019-06-14 | 2019-10-22 | 中电万维信息技术有限责任公司 | Text emotion analysis method based on convolutional neural networks |
CN110956033A (en) * | 2019-12-04 | 2020-04-03 | 北京中电普华信息技术有限公司 | Text similarity calculation method and device |
CN111488429A (en) * | 2020-03-19 | 2020-08-04 | 杭州叙简科技股份有限公司 | Short text clustering system based on search engine and short text clustering method thereof |
CN111523328A (en) * | 2020-04-13 | 2020-08-11 | 中博信息技术研究院有限公司 | Intelligent customer service semantic processing method |
Non-Patent Citations (2)
Title |
---|
MOLEARNER: "文本相似度分析(基于jieba和gensim)", 《HTTPS://WWW.CNBLOGS.COM/WKSLEARNER/P/10505562.HTML》 * |
潘永青: "相似度计算方法(三) 余弦相似度", 《HTTPS://BLOG.CSDN.NET/U014539465/ARTICLE/DETAILS/105353638》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Afzaal et al. | Tourism mobile app with aspect-based sentiment classification framework for tourist reviews | |
CN109165294B (en) | Short text classification method based on Bayesian classification | |
CN105843897B (en) | A kind of intelligent Answer System towards vertical field | |
CN111177591B (en) | Knowledge graph-based Web data optimization method for visual requirements | |
CN111143479A (en) | Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm | |
CN101079025B (en) | File correlation computing system and method | |
CN107203507B (en) | Feature vocabulary extracting method and device | |
CN102411621A (en) | Chinese query-oriented multi-document automatic abstracting method based on cloud model | |
CN103678287B (en) | A kind of method that keyword is unified | |
CN112364172A (en) | Method for constructing knowledge graph in government official document field | |
Ahlgren | Research on sentiment analysis: the first decade | |
CN115238071A (en) | Data standard generation method, storage medium and system based on similar clustering and data exploration | |
Bagalkotkar et al. | A novel technique for efficient text document summarization as a service | |
CN109614484A (en) | A kind of Text Clustering Method and its system based on classification effectiveness | |
CN109101488B (en) | Word semantic similarity calculation method based on known network | |
CN106021424B (en) | A kind of literature author's duplication of name detection method | |
CN113360647A (en) | 5G mobile service complaint source-tracing analysis method based on clustering | |
CN106503153B (en) | Computer text classification system | |
CN112052401A (en) | Recommendation method based on user comments | |
CN112257431A (en) | NLP-based short text data processing method | |
CN103729348B (en) | A kind of analysis method of sentence translation complexity | |
CN108628875B (en) | Text label extraction method and device and server | |
CN107357918B (en) | Text representation method based on graph | |
WO2021142968A1 (en) | Multilingual-oriented semantic similarity calculation method for general place names, and application thereof | |
CN102346777B (en) | A kind of method and apparatus that illustrative sentence retrieval result is ranked up |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210122 |