CN112257431A

CN112257431A - NLP-based short text data processing method

Info

Publication number: CN112257431A
Application number: CN202011184771.2A
Authority: CN
Inventors: 魏建军; 刘磊; 郭真; 王富
Original assignee: China Telecom Wanwei Information Technology Co Ltd
Current assignee: China Telecom Wanwei Information Technology Co Ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-01-22

Abstract

The invention relates to the technical field of natural language processing, in particular to a short text data processing method based on NLP. The method comprises the following steps: acquiring short text data, jieba word segmentation, stopwords removal, word bag acquisition, corpus making, TF-IDF processing, cosine distance calculation, and standardizing the short text data according to the calculated cosine distance; the problems that the manual processing efficiency of short text data is low, the short text data is not accurate and the large data is difficult to process are solved, so that a large amount of manpower and material resources are consumed.

Description

NLP-based short text data processing method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a short text data processing method based on NLP.

Background

With the rapid development of network information technology and the gradual transformation of traditional paper information to digital information, more and more information, especially short text information, is accumulated in the network. Most of the short text data are collected by an information system and stored in a relational database. Short text data is expressed in various forms, but the expression has the same meaning, such as: there is a employment questionnaire, there is a non-selection item engaged in the type of work, the index item is now to be analyzed statistically, but the data collected by the index item has various expression forms and the expression meaning is the same, such as: restaurant waiters, restaurants, food waiters, noodle broilers, cooks, pastry chefs, beef noodles, hotel waiters, cookers, hotels, restaurants, restaurant waiters, the food and beverage industry, hot pot restaurants, and the like, all of which express the food and beverage industry. For statistical analysis of the index item, the short text data collected by the index item is processed, so that the statistical data can be accurate. The traditional processing method mainly uses a manual processing way, and the method has a plurality of disadvantages: firstly, a large amount of manpower and material resources are consumed; secondly, there is a phenomenon that the results obtained are inconsistent with the requirements. The manual processing mode with low efficiency faces more and more difficulties, the big data is still inexhaustible, and the short text data processing becomes a development direction in a rational way.

With the development of NLP, NLP is a human intelligent technology with very strong adaptability, which assists human to change self defects and weaknesses from the consciousness field and behavior habit, so that NLP is effectively applied to the short text data processing field.

Disclosure of Invention

The embodiment of the invention aims to provide a method for processing short text data based on NLP, and aims to solve the problems of low efficiency, inaccuracy and difficulty in processing large data of the short text data by hand, so that a large amount of manpower and material resources are consumed.

A method of NLP-based short text data processing, the method comprising the steps of:

acquiring short text data, jieba word segmentation, stopwords removal, word bag acquisition, corpus making, TF-IDF processing, cosine distance calculation, and standardizing the short text data according to the calculated cosine distance;

the short text data acquisition synchronizes the short text data in the service database to a local TXT file through a DataX tool;

synchronizing the jieba word segmentation to short text data in a local TXT file, and performing word segmentation by a jieba word segmentation tool in a row unit;

the stopwords are removed in the step of stopwords, and the stop words are deleted from the short text data with the words being separated through an NLTK tool;

the step of obtaining word bags allocates a unique integer id to all words appearing in the corpus by using a genim library;

the step of manufacturing the corpus is to calculate the occurrence times of each different word by using a word2vec tool in a genesis library, convert the word into an integer word id and return the result as a sparse vector;

in the step TF-IDF, the manufactured corpus is used as the input of a Tfidf model for model training;

in the step of calculating the cosine distance, similarity is evaluated by calculating a cosine value of an included angle between two vectors, and the cosine similarity draws the vectors into a vector space according to coordinate values, for example, in the most common two-dimensional space, if the coordinates of the vectors a and b are (x1, y1), (x2, and y2), the corresponding cosine distance is:

let vector a = (a1, a 2., An), B = (B1, B2.., Bn), generalize to multidimensional:

the smaller the included angle is, the closer the cosine value is to 1, and the more identical the directions are, the more similar; it can be seen that a cosine distance between 0 and 1 and approximately close to 1 indicates that the two are more similar.

And standardizing the short text data according to the calculated cosine distance, calculating the cosine distance, and when the cosine distance is greater than 0.8, replacing the short text data with standard short text data, and standardizing the short text data which is not standard by the above mode.

The invention has the beneficial effects that:

the invention discloses a method for processing short text data based on NLP, which comprises the steps of dividing words by jiaba, removing stop words (stopwords), obtaining word bags, making corpora, carrying out TF-IDF processing, calculating cosine distance, standardizing the short text data according to the calculated cosine distance and the like, can standardize the short text data collected in a form, and is favorable for application and statistical analysis of the data. Compared with the prior art, the method has the following advantages due to the adoption of the short text data processing method: the problems that the manual processing efficiency of short text data is low, the short text data is not accurate and the large data is difficult to process are solved, so that a large amount of manpower and material resources are consumed.

Drawings

Fig. 1 is a flowchart of a method of short text data processing of the present invention NLP.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples, and for convenience of description, only parts related to the examples of the present invention are shown. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Examples

Fig. 1 is a flowchart of a method for processing NLP-based short text data according to an embodiment of the present invention, where the method includes:

s101, short text data acquisition:

we synchronize short text data in the traffic database to the local TXT file through the DataX tool.

S102 jieba word segmentation:

and (4) cutting the sentences most accurately by using the short text data acquired in the step (S101) in a line unit and in an accurate mode.

S103 stop word (stopwords):

and deleting stop words contained in the jieba divided words in the step S102 by loading the Chinese stop words accumulated by us by using the NLTK.

S104, acquiring a bag of words:

by using the genim library, all words appearing in the corpus are assigned a unique integer id, such as: { 'restaurant': 0, 'fry': 1, 'pull': 2, 'open': 3, 'member': 4, 'restaurant': 5, 'noodle pulling': 6, 'hotel': 7, 'restaurant': 8, 'chafing dish': 9, 'dining': 10, 'service': 11 }.

S105, language material preparation:

by using word2vec in the genim library, count the number of occurrences of each different word, convert the word to an integer word id, and return the result as a sparse vector, such as: [ [ (0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1) ], [ (2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1) ] ], a first term of each tuple corresponds to the ID of a symbol in the dictionary, and a second term corresponds to the number of times the symbol occurs.

S106, TF-IDF processing is carried out:

and (5) taking the linguistic data manufactured in the step (S105) as the input of the Tfidf model, and performing model training.

S107 calculates the cosine distance:

and calculating the cosine distance by directly calling the class for calculating the cosine distance of the sparse matrix in the gensim library, wherein the data returned in the step S106 is used as the input of the step.

S108, normalizing the short text data according to the calculated cosine distance:

and judging the similarity of the short texts according to the cosine values calculated in the step S107, and when the cosine distance is greater than 0.8, replacing the short text data with standard short text data, wherein the short text data which is not standard can be standardized by the mode.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for NLP-based short text data processing, the method comprising the steps of:

in the step of calculating the cosine distance, similarity is evaluated by calculating a cosine value of an included angle between two vectors, the cosine similarity draws the vectors into a vector space according to coordinate values, and if the coordinates of the vectors a and b are (x1, y1), (x2, y2), respectively, the corresponding cosine distance is:

the smaller the included angle is, the closer the cosine value is to 1, the more similar the included angle is; the cosine distance is between 0 and 1 and both are more similar the closer to 1.

2. The method as claimed in claim 1, wherein the method for processing NLP-based short text data is characterized in that the normalization of the short text data according to the calculated cosine distance calculates the cosine distance, and when the cosine distance is greater than 0.8, the short text data which is not standard is replaced by the standard short text data, and the short text data which is not standard is normalized in this way.