[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112257431A - NLP-based short text data processing method - Google Patents

NLP-based short text data processing method Download PDF

Info

Publication number
CN112257431A
CN112257431A CN202011184771.2A CN202011184771A CN112257431A CN 112257431 A CN112257431 A CN 112257431A CN 202011184771 A CN202011184771 A CN 202011184771A CN 112257431 A CN112257431 A CN 112257431A
Authority
CN
China
Prior art keywords
text data
short text
cosine distance
word
cosine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011184771.2A
Other languages
Chinese (zh)
Inventor
魏建军
刘磊
郭真
王富
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Wanwei Information Technology Co Ltd
Original Assignee
China Telecom Wanwei Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Wanwei Information Technology Co Ltd filed Critical China Telecom Wanwei Information Technology Co Ltd
Priority to CN202011184771.2A priority Critical patent/CN112257431A/en
Publication of CN112257431A publication Critical patent/CN112257431A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of natural language processing, in particular to a short text data processing method based on NLP. The method comprises the following steps: acquiring short text data, jieba word segmentation, stopwords removal, word bag acquisition, corpus making, TF-IDF processing, cosine distance calculation, and standardizing the short text data according to the calculated cosine distance; the problems that the manual processing efficiency of short text data is low, the short text data is not accurate and the large data is difficult to process are solved, so that a large amount of manpower and material resources are consumed.

Description

NLP-based short text data processing method
Technical Field
The invention relates to the technical field of natural language processing, in particular to a short text data processing method based on NLP.
Background
With the rapid development of network information technology and the gradual transformation of traditional paper information to digital information, more and more information, especially short text information, is accumulated in the network. Most of the short text data are collected by an information system and stored in a relational database. Short text data is expressed in various forms, but the expression has the same meaning, such as: there is a employment questionnaire, there is a non-selection item engaged in the type of work, the index item is now to be analyzed statistically, but the data collected by the index item has various expression forms and the expression meaning is the same, such as: restaurant waiters, restaurants, food waiters, noodle broilers, cooks, pastry chefs, beef noodles, hotel waiters, cookers, hotels, restaurants, restaurant waiters, the food and beverage industry, hot pot restaurants, and the like, all of which express the food and beverage industry. For statistical analysis of the index item, the short text data collected by the index item is processed, so that the statistical data can be accurate. The traditional processing method mainly uses a manual processing way, and the method has a plurality of disadvantages: firstly, a large amount of manpower and material resources are consumed; secondly, there is a phenomenon that the results obtained are inconsistent with the requirements. The manual processing mode with low efficiency faces more and more difficulties, the big data is still inexhaustible, and the short text data processing becomes a development direction in a rational way.
With the development of NLP, NLP is a human intelligent technology with very strong adaptability, which assists human to change self defects and weaknesses from the consciousness field and behavior habit, so that NLP is effectively applied to the short text data processing field.
Disclosure of Invention
The embodiment of the invention aims to provide a method for processing short text data based on NLP, and aims to solve the problems of low efficiency, inaccuracy and difficulty in processing large data of the short text data by hand, so that a large amount of manpower and material resources are consumed.
A method of NLP-based short text data processing, the method comprising the steps of:
acquiring short text data, jieba word segmentation, stopwords removal, word bag acquisition, corpus making, TF-IDF processing, cosine distance calculation, and standardizing the short text data according to the calculated cosine distance;
the short text data acquisition synchronizes the short text data in the service database to a local TXT file through a DataX tool;
synchronizing the jieba word segmentation to short text data in a local TXT file, and performing word segmentation by a jieba word segmentation tool in a row unit;
the stopwords are removed in the step of stopwords, and the stop words are deleted from the short text data with the words being separated through an NLTK tool;
the step of obtaining word bags allocates a unique integer id to all words appearing in the corpus by using a genim library;
the step of manufacturing the corpus is to calculate the occurrence times of each different word by using a word2vec tool in a genesis library, convert the word into an integer word id and return the result as a sparse vector;
in the step TF-IDF, the manufactured corpus is used as the input of a Tfidf model for model training;
in the step of calculating the cosine distance, similarity is evaluated by calculating a cosine value of an included angle between two vectors, and the cosine similarity draws the vectors into a vector space according to coordinate values, for example, in the most common two-dimensional space, if the coordinates of the vectors a and b are (x1, y1), (x2, and y2), the corresponding cosine distance is:
Figure DEST_PATH_IMAGE001
let vector a = (a1, a 2., An), B = (B1, B2.., Bn), generalize to multidimensional:
Figure 986610DEST_PATH_IMAGE002
the smaller the included angle is, the closer the cosine value is to 1, and the more identical the directions are, the more similar; it can be seen that a cosine distance between 0 and 1 and approximately close to 1 indicates that the two are more similar.
And standardizing the short text data according to the calculated cosine distance, calculating the cosine distance, and when the cosine distance is greater than 0.8, replacing the short text data with standard short text data, and standardizing the short text data which is not standard by the above mode.
The invention has the beneficial effects that:
the invention discloses a method for processing short text data based on NLP, which comprises the steps of dividing words by jiaba, removing stop words (stopwords), obtaining word bags, making corpora, carrying out TF-IDF processing, calculating cosine distance, standardizing the short text data according to the calculated cosine distance and the like, can standardize the short text data collected in a form, and is favorable for application and statistical analysis of the data. Compared with the prior art, the method has the following advantages due to the adoption of the short text data processing method: the problems that the manual processing efficiency of short text data is low, the short text data is not accurate and the large data is difficult to process are solved, so that a large amount of manpower and material resources are consumed.
Drawings
Fig. 1 is a flowchart of a method of short text data processing of the present invention NLP.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples, and for convenience of description, only parts related to the examples of the present invention are shown. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
A method of NLP-based short text data processing, the method comprising the steps of:
acquiring short text data, jieba word segmentation, stopwords removal, word bag acquisition, corpus making, TF-IDF processing, cosine distance calculation, and standardizing the short text data according to the calculated cosine distance;
the short text data acquisition synchronizes the short text data in the service database to a local TXT file through a DataX tool;
synchronizing the jieba word segmentation to short text data in a local TXT file, and performing word segmentation by a jieba word segmentation tool in a row unit;
the stopwords are removed in the step of stopwords, and the stop words are deleted from the short text data with the words being separated through an NLTK tool;
the step of obtaining word bags allocates a unique integer id to all words appearing in the corpus by using a genim library;
the step of manufacturing the corpus is to calculate the occurrence times of each different word by using a word2vec tool in a genesis library, convert the word into an integer word id and return the result as a sparse vector;
in the step TF-IDF, the manufactured corpus is used as the input of a Tfidf model for model training;
in the step of calculating the cosine distance, similarity is evaluated by calculating a cosine value of an included angle between two vectors, and the cosine similarity draws the vectors into a vector space according to coordinate values, for example, in the most common two-dimensional space, if the coordinates of the vectors a and b are (x1, y1), (x2, and y2), the corresponding cosine distance is:
Figure 497488DEST_PATH_IMAGE001
let vector a = (a1, a 2., An), B = (B1, B2.., Bn), generalize to multidimensional:
Figure 249674DEST_PATH_IMAGE002
the smaller the included angle is, the closer the cosine value is to 1, and the more identical the directions are, the more similar; it can be seen that a cosine distance between 0 and 1 and approximately close to 1 indicates that the two are more similar.
And standardizing the short text data according to the calculated cosine distance, calculating the cosine distance, and when the cosine distance is greater than 0.8, replacing the short text data with standard short text data, and standardizing the short text data which is not standard by the above mode.
Examples
Fig. 1 is a flowchart of a method for processing NLP-based short text data according to an embodiment of the present invention, where the method includes:
s101, short text data acquisition:
we synchronize short text data in the traffic database to the local TXT file through the DataX tool.
S102 jieba word segmentation:
and (4) cutting the sentences most accurately by using the short text data acquired in the step (S101) in a line unit and in an accurate mode.
S103 stop word (stopwords):
and deleting stop words contained in the jieba divided words in the step S102 by loading the Chinese stop words accumulated by us by using the NLTK.
S104, acquiring a bag of words:
by using the genim library, all words appearing in the corpus are assigned a unique integer id, such as: { 'restaurant': 0, 'fry': 1, 'pull': 2, 'open': 3, 'member': 4, 'restaurant': 5, 'noodle pulling': 6, 'hotel': 7, 'restaurant': 8, 'chafing dish': 9, 'dining': 10, 'service': 11 }.
S105, language material preparation:
by using word2vec in the genim library, count the number of occurrences of each different word, convert the word to an integer word id, and return the result as a sparse vector, such as: [ [ (0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1) ], [ (2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1) ] ], a first term of each tuple corresponds to the ID of a symbol in the dictionary, and a second term corresponds to the number of times the symbol occurs.
S106, TF-IDF processing is carried out:
and (5) taking the linguistic data manufactured in the step (S105) as the input of the Tfidf model, and performing model training.
S107 calculates the cosine distance:
and calculating the cosine distance by directly calling the class for calculating the cosine distance of the sparse matrix in the gensim library, wherein the data returned in the step S106 is used as the input of the step.
S108, normalizing the short text data according to the calculated cosine distance:
and judging the similarity of the short texts according to the cosine values calculated in the step S107, and when the cosine distance is greater than 0.8, replacing the short text data with standard short text data, wherein the short text data which is not standard can be standardized by the mode.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (2)

1. A method for NLP-based short text data processing, the method comprising the steps of:
acquiring short text data, jieba word segmentation, stopwords removal, word bag acquisition, corpus making, TF-IDF processing, cosine distance calculation, and standardizing the short text data according to the calculated cosine distance;
the short text data acquisition synchronizes the short text data in the service database to a local TXT file through a DataX tool;
synchronizing the jieba word segmentation to short text data in a local TXT file, and performing word segmentation by a jieba word segmentation tool in a row unit;
the stopwords are removed in the step of stopwords, and the stop words are deleted from the short text data with the words being separated through an NLTK tool;
the step of obtaining word bags allocates a unique integer id to all words appearing in the corpus by using a genim library;
the step of manufacturing the corpus is to calculate the occurrence times of each different word by using a word2vec tool in a genesis library, convert the word into an integer word id and return the result as a sparse vector;
in the step TF-IDF, the manufactured corpus is used as the input of a Tfidf model for model training;
in the step of calculating the cosine distance, similarity is evaluated by calculating a cosine value of an included angle between two vectors, the cosine similarity draws the vectors into a vector space according to coordinate values, and if the coordinates of the vectors a and b are (x1, y1), (x2, y2), respectively, the corresponding cosine distance is:
Figure DEST_PATH_IMAGE002
let vector a = (a1, a 2., An), B = (B1, B2.., Bn), generalize to multidimensional:
Figure DEST_PATH_IMAGE004
the smaller the included angle is, the closer the cosine value is to 1, the more similar the included angle is; the cosine distance is between 0 and 1 and both are more similar the closer to 1.
2. The method as claimed in claim 1, wherein the method for processing NLP-based short text data is characterized in that the normalization of the short text data according to the calculated cosine distance calculates the cosine distance, and when the cosine distance is greater than 0.8, the short text data which is not standard is replaced by the standard short text data, and the short text data which is not standard is normalized in this way.
CN202011184771.2A 2020-10-30 2020-10-30 NLP-based short text data processing method Pending CN112257431A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011184771.2A CN112257431A (en) 2020-10-30 2020-10-30 NLP-based short text data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011184771.2A CN112257431A (en) 2020-10-30 2020-10-30 NLP-based short text data processing method

Publications (1)

Publication Number Publication Date
CN112257431A true CN112257431A (en) 2021-01-22

Family

ID=74268854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011184771.2A Pending CN112257431A (en) 2020-10-30 2020-10-30 NLP-based short text data processing method

Country Status (1)

Country Link
CN (1) CN112257431A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334495A (en) * 2018-01-30 2018-07-27 国家计算机网络与信息安全管理中心 Short text similarity calculating method and system
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN108804595A (en) * 2018-05-28 2018-11-13 中山大学 A kind of short text representation method based on word2vec
CN110362819A (en) * 2019-06-14 2019-10-22 中电万维信息技术有限责任公司 Text emotion analysis method based on convolutional neural networks
CN110956033A (en) * 2019-12-04 2020-04-03 北京中电普华信息技术有限公司 Text similarity calculation method and device
CN111488429A (en) * 2020-03-19 2020-08-04 杭州叙简科技股份有限公司 Short text clustering system based on search engine and short text clustering method thereof
CN111523328A (en) * 2020-04-13 2020-08-11 中博信息技术研究院有限公司 Intelligent customer service semantic processing method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334495A (en) * 2018-01-30 2018-07-27 国家计算机网络与信息安全管理中心 Short text similarity calculating method and system
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN108804595A (en) * 2018-05-28 2018-11-13 中山大学 A kind of short text representation method based on word2vec
CN110362819A (en) * 2019-06-14 2019-10-22 中电万维信息技术有限责任公司 Text emotion analysis method based on convolutional neural networks
CN110956033A (en) * 2019-12-04 2020-04-03 北京中电普华信息技术有限公司 Text similarity calculation method and device
CN111488429A (en) * 2020-03-19 2020-08-04 杭州叙简科技股份有限公司 Short text clustering system based on search engine and short text clustering method thereof
CN111523328A (en) * 2020-04-13 2020-08-11 中博信息技术研究院有限公司 Intelligent customer service semantic processing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MOLEARNER: "文本相似度分析(基于jieba和gensim)", 《HTTPS://WWW.CNBLOGS.COM/WKSLEARNER/P/10505562.HTML》 *
潘永青: "相似度计算方法(三) 余弦相似度", 《HTTPS://BLOG.CSDN.NET/U014539465/ARTICLE/DETAILS/105353638》 *

Similar Documents

Publication Publication Date Title
Afzaal et al. Tourism mobile app with aspect-based sentiment classification framework for tourist reviews
CN109165294B (en) Short text classification method based on Bayesian classification
CN105843897B (en) A kind of intelligent Answer System towards vertical field
CN111177591B (en) Knowledge graph-based Web data optimization method for visual requirements
CN111143479A (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
CN101079025B (en) File correlation computing system and method
CN107203507B (en) Feature vocabulary extracting method and device
CN102411621A (en) Chinese query-oriented multi-document automatic abstracting method based on cloud model
CN103678287B (en) A kind of method that keyword is unified
CN112364172A (en) Method for constructing knowledge graph in government official document field
Ahlgren Research on sentiment analysis: the first decade
CN115238071A (en) Data standard generation method, storage medium and system based on similar clustering and data exploration
Bagalkotkar et al. A novel technique for efficient text document summarization as a service
CN109614484A (en) A kind of Text Clustering Method and its system based on classification effectiveness
CN109101488B (en) Word semantic similarity calculation method based on known network
CN106021424B (en) A kind of literature author's duplication of name detection method
CN113360647A (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN106503153B (en) Computer text classification system
CN112052401A (en) Recommendation method based on user comments
CN112257431A (en) NLP-based short text data processing method
CN103729348B (en) A kind of analysis method of sentence translation complexity
CN108628875B (en) Text label extraction method and device and server
CN107357918B (en) Text representation method based on graph
WO2021142968A1 (en) Multilingual-oriented semantic similarity calculation method for general place names, and application thereof
CN102346777B (en) A kind of method and apparatus that illustrative sentence retrieval result is ranked up

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210122