[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN117951246A - New word discovery and application field prediction method and system for network technology - Google Patents

New word discovery and application field prediction method and system for network technology Download PDF

Info

Publication number
CN117951246A
CN117951246A CN202410351116.3A CN202410351116A CN117951246A CN 117951246 A CN117951246 A CN 117951246A CN 202410351116 A CN202410351116 A CN 202410351116A CN 117951246 A CN117951246 A CN 117951246A
Authority
CN
China
Prior art keywords
network technology
new word
words
application field
technology new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410351116.3A
Other languages
Chinese (zh)
Other versions
CN117951246B (en
Inventor
丁建伟
李斌
李航
李欣泽
陈周国
王泽珺
王鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 30 Research Institute
Original Assignee
CETC 30 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 30 Research Institute filed Critical CETC 30 Research Institute
Priority to CN202410351116.3A priority Critical patent/CN117951246B/en
Publication of CN117951246A publication Critical patent/CN117951246A/en
Application granted granted Critical
Publication of CN117951246B publication Critical patent/CN117951246B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a network technology new word discovery and application field prediction method and system, relates to the field of natural language processing, and is used for improving accuracy of network technology new word discovery and field prediction. The method comprises three parts, wherein the first part is to preliminarily determine new seed words and the application field thereof by utilizing a mode of manually collecting similar words and obtaining Glove word vector models; the second part is to collect the latest updated scientific text data in the external knowledge base for storage; the third part is to determine new words of network technology by combining multiple NLP models and predict corresponding application fields. The invention deeply digs the self characteristics of the new word of the network technology, fully considers the meaning expressed in the sentence, and improves the recall rate of the new word under the condition of ensuring the correct rate; and the maximum public substring is utilized to merge the application fields of the new words, so that the prediction accuracy of the application fields is further improved.

Description

New word discovery and application field prediction method and system for network technology
Technical Field
The invention relates to the field of natural language processing, in particular to a network technology new word discovery and application field prediction system based on multiple models.
Background
With the rapid development of the internet, the network security situation is increasingly complex. Therefore, the nouns of the network technology (called as network technology new words) are discovered in time, corresponding application predictions are made, network attacks, illegal transactions and the like can be early warned in time, and the network environment safety is maintained.
The generation of new words of network technology is frequent, particularly in the present big data and big model age, and the new words are very time-consuming and labor-consuming to find manually, and have high miss report rate, so that most people can know the words when the words are widely used by people. At present, the techniques of machine learning, natural language processing and the like are widely applied to the discovery of new words in network technology. The popular new word discovery schemes at present can be judged by combining word frequency, for example, a network new word discovery method and system based on statistics and similarity, which are designed in Chinese patent literature with publication number of CN 113033183A. However, when the vocabulary just appears and the word frequency is low, detection cannot be performed, so that information delay creates certain difficulties for network attack, illegal transaction and information hazard early warning. In addition, there is also a network new word discovery scheme for performing vocabulary clustering in combination with semantic similarity, for example, a new network new word discovery method based on sentence semantic similarity, which is designed in chinese patent literature with publication number CN117574886 a. However, the scheme of semantic similarity comparison is highly related to the richness of word segmentation rules, contexts and standard corpus, the mode of finding new words is limited, and features of network new words are easily ignored in an inverse manner, so that the finally found network new words are easily deviated from the actual application field.
Disclosure of Invention
The invention aims at: aiming at all or part of the existing problems, the network technology new word discovery and application field prediction system is provided to break through the limitation of the prior art in the aspect of discovering the network technology new word, mine the deep features of the network technology new word, and discover the network technology new word and predict the application field more accurately.
The technical scheme adopted by the invention is as follows:
A network technology new word discovery and application field prediction method comprises the following steps:
determining a first number of seed new words; expanding the number of seed new words from the collected corpus to a second number using a Glove word vector model; marking application fields of various new words and storing the new words;
collecting latest updated scientific text data from an external knowledge base;
Updating a first keyword weight dictionary of the KeyBERT model and a second keyword weight dictionary of the LAC model by using each piece of scientific text data;
Sequentially comparing each key value pair in the first keyword weight dictionary and the second keyword weight dictionary, and updating a third keyword weight dictionary with the corresponding key and the corresponding value when the edit distance of the two keys compared is within a first threshold;
Selecting a key with a value reaching a first condition from the third keyword weight dictionary as a final network technology new word;
Carrying out semantic analysis on scientific text data corresponding to the final network technology new word so as to predict the application field of the final network technology new word;
And associating the application field of the final network technology new word with the application field of the stored network technology new word.
Further, the associating the application field of the final network technology new word with the application field of the stored network technology new word includes:
and determining whether the final network technology new word and the stored network technology new word are related to the same application field according to the gap between the final network technology new word and the stored network technology new word.
Further, calculating the maximum public sub-string length between the final network technology new word and the stored network technology new word by using a maximum sub-string algorithm, and associating the final network technology new word and the stored network technology new word to the same application field when the maximum public sub-string length is greater than zero.
Further, the collecting the latest updated scientific text data from the external knowledge base includes:
collecting latest updated scientific text information and scientific image information from an external knowledge base;
And extracting text information in the scientific image information, and combining the text information with the scientific text information to obtain scientific text data.
Further, updating the first keyword weight dictionary of KeyBERT models and the second keyword weight dictionary of LAC models with each piece of the scientific text data includes:
initializing KeyBERT a first keyword weight dictionary of the model and a second keyword weight dictionary of the LAC model;
for each piece of scientific text data, extracting unigram keywords, bigram keywords and corresponding weights respectively by using KeyBERT models, and updating the first keyword weight dictionary by taking the keywords as keys and the sum of the weights as a value; extracting unigram keywords, bigram keywords and corresponding weights respectively by using the LAC model, taking the keywords as keys, taking the sum of the weights as a value, and updating the second keyword weight dictionary.
Further, the sequentially comparing each key value pair in the first keyword weight dictionary and the second keyword weight dictionary, and updating the third keyword weight dictionary with the corresponding key and the corresponding value when the edit distance of the two compared keys is within the first threshold value, including:
and respectively extracting a first key value pair { key1: value1} of the first keyword weight dictionary and a second key value pair { key2: value2} of the second keyword weight dictionary in sequence, and updating the third keyword weight dictionary as follows when the editing distance of the key1 = key2 or the editing distance of the key2 and the second key value pair { key2: value2} is not more than 1:
And updating the third keyword weight dictionary by taking key2 as a key, taking value1+lg (value 2) as a value, or taking key1 as a key, and taking value2+lg (value 1) as a value.
Further, selecting a key with a value reaching a first condition from the third keyword weight dictionary as a final network technology new word, including:
Sorting each key value pair in the third keyword weight dictionary according to the order of the weights from small to large;
screening out the first number of key value pairs;
the weight of each key value pair is standardized by using a Max-Min standardization algorithm;
and screening out keys with standardized weights reaching a second threshold value as final network technology new words.
Further, the determining a first number of seed new words; expanding the number of seed new words from the collected corpus to a second number using a Glove word vector model, comprising:
Setting a first number of seed new words;
repeating the following steps until the seed new word reaches a second number:
Collecting corpus, wherein each corpus at least comprises one new seed word;
Training Glove word vector models by using the collected corpus;
obtaining similar words of the new seed words by using the trained Glove word vector model;
And calculating the similarity between the similar words and the new seed words, and screening the obtained similar words according to a set third threshold value to expand the number of the new seed words by the screened similar words.
The invention also provides a network technology new word discovery and application field prediction system, which comprises a processor, wherein the processor is configured to execute the network technology new word discovery and application field prediction method.
The invention also provides another network technology new word discovery and application field prediction system, which comprises a computer readable storage medium, wherein the computer readable storage medium is stored with a computer program, and the computer program is operated to execute the network technology new word discovery and application field prediction method.
In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:
The invention constructs a method for automatically finding the network technology new word by combining the Glove word vector model, the Keybert, the LAC model, the edit distance and the like, deeply mines the self characteristics of the network technology new word, fully considers the meaning expressed in the sentence, splits the vocabulary from multiple dimensions and can improve the recall rate of the new word under the condition of ensuring the correct rate. In addition, the invention utilizes the largest public substring to merge the application fields of the new words, can avoid the redundancy of the prediction types of the application fields, improves the semantic understanding capability, and can provide powerful support for coping with network security early warning, network attack, illegal transaction and improper technology propagation through the prediction of the application fields of the new technology corresponding to the network technology new words.
Drawings
The invention will now be described by way of example and with reference to the accompanying drawings in which:
FIG. 1 is one embodiment of seed new word discovery and expansion.
Fig. 2 is one embodiment of scientific text data collection.
FIG. 3 is one embodiment of a first keyword weight dictionary, a second keyword weight dictionary, a third keyword weight dictionary update.
Fig. 4 is one embodiment of final network technology new word discovery and its application domain determination.
Fig. 5 is a diagram of an initial network technology new word iterative expansion module architecture.
Fig. 6 is a diagram of a data acquisition module architecture of a science and technology type website.
Fig. 7 is a diagram of network technology new words discovery and application domain prediction module architecture.
Detailed Description
All of the features disclosed in this specification, or all of the steps in a method or process disclosed, may be combined in any combination, except for mutually exclusive features and/or steps.
Any feature disclosed in this specification (including any accompanying claims, abstract) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. That is, each feature is one example only of a generic series of equivalent or similar features, unless expressly stated otherwise.
Example 1
A network technology new word discovery and application field prediction method adopts the characteristics of multi-model mining keywords, so that the network technology new word can be discovered more accurately. The method comprises three parts, wherein the first part is to preliminarily determine new seed words and application fields thereof by utilizing a mode of manually collecting and obtaining similar words by using a Glove word vector model; the second part is to collect latest updated scientific text data in an external knowledge base such as scientific network news and paper data, weChat scientific public numbers, microblogs, knowledge science plates and the like; the third part is to use multiple NLP (Natural Language Processing ) models in fusion to determine new words of network technology and to predict corresponding application areas.
1. First part
In the portion, determining a first number of seed new words; expanding the number of seed new words from the collected corpus to a second number using a Glove word vector model; the application fields of various new sub-words are marked.
This part is mainly done manually. By collecting new technology and corresponding application fields popular for the last month (or other time period) on each large network platform and summarizing, a first representative number (here, 5 are taken as examples) of network technology new words are selected as seed new words. And then, repeatedly using GLove word vector models to find out similar words of the new seed words from the collected corpus, screening out the most similar first (e.g. 15) network technology new words, and expanding the number of the new seed words, namely expanding the number to a set second number. And setting the application field of each new word of the seeds by using a manual labeling mode. It should be noted that, the expansion to the second number may include 5 new words of the initial determination and 15 expanded similar words, or may only reserve 15 new words of the network technology expanded as new words of the seed. And (3) entering and exiting the final new seed words and the corresponding application fields into the MySQL database table.
As shown in fig. 1, the first portion, in some embodiments, comprises the steps of:
Step 1: determining an initial seed new word: manually setting 5 current most popular network technology new words as initial seed new words;
Step 2: manually obtaining text corpus for the new seed words which are obtained at present, wherein each corpus at least comprises one new seed word;
step 3: performing Glove word vector model training by using the corpus obtained in the step 2;
Step 4: obtaining similar words of the new seed words by utilizing the Glove word vector model trained in the step 3;
Step 5: calculating the similarity between the similar words obtained in the step 4 and the new words of the seeds, intelligently screening according to a preset third threshold value, selecting the similar words with the similarity reaching the third threshold value to expand the number of the new words of the seeds, and using a manual method to assist in confirmation;
Step 6: if the number after expansion is not 20, returning to the step 2, otherwise, manually determining the application field of the new word of the seed;
step 7: and merging the new seed words and the corresponding application fields, and storing the new seed words and the corresponding application fields into a MySQL database table.
The first part is packaged into a prediction system, so that an initial network new word iteration expansion module can be obtained, namely a network technology new word discovery and application field prediction system can be designed, the system comprises the initial network new word iteration expansion module, the module is configured to execute the step of the first part, and the module architecture is shown in fig. 5.
2. Second part
The component collects the latest updated scientific text data from the external knowledge base.
The part adopts a general data acquisition technology to acquire structured data from an external knowledge base such as scientific network news and paper data, weChat science and technology type public numbers, microblogs, knowledge and technology plates and the like, and mainly acquires titles and contents of corresponding articles. In addition, for the content with pictures or videos, the part downloads and stores the corresponding pictures and videos into Minio, and extracts the corresponding text merging article titles and content by using an image and video text extraction algorithm and stores the text merging article titles and content into the Hive database.
As shown in fig. 2, in some embodiments, the portion includes the steps of:
step 1: text data acquisition: collecting scientific text information updated recently (such as updated in the last month) from an external knowledge base (such as a science and technology website);
Step 2: for websites with scientific image information such as pictures and videos, corresponding scientific image information such as pictures and videos is also collected;
step 3: extracting text information in the picture or the video by using an image processing technology, and simultaneously storing the picture and the video into Minio;
Step 4: and (5) combining the text information in the step 1 and the step 3 and storing the text information in Hive.
The second part is packaged into a prediction system, so that a scientific and technical website data acquisition module can be obtained, wherein the module is configured to execute the steps of the second part, and the module architecture is shown in fig. 6.
3. Third part
The part updates a first keyword weight dictionary of KeyBERT models and a second keyword weight dictionary of LAC models by using each piece of scientific text data; sequentially comparing each key value pair in the first keyword weight dictionary and the second keyword weight dictionary, and updating the third keyword weight dictionary with the corresponding key and the corresponding value when the edit distance of the two keys compared is within a first threshold; selecting a key with a value reaching a first condition from the third keyword weight dictionary as a final network technology new word; and carrying out semantic analysis on scientific text data corresponding to the final network technology new word so as to predict the application field of the final network technology new word. In addition, the application field of the final network technology new word is associated with the application field of the stored network technology new word, for example, whether the final network technology new word and the stored network technology new word are associated to the same application field can be determined according to the gap between the final network technology new word and the stored network technology new word. So-called stored network technology new words, which obviously contain seed new words in the first part, in particular network technology new words which have been stored in the database table in the history, and network technology new words which are updated into the database table over time. For example, the corpus collected by currently finding new network technology words is the last month, and the new network technology words found from the earlier corpus and the application field thereof are already stored in the database table before one month.
For the data stored in Hive, the part extracts new network technology words and application fields one by one for each new data acquired in the past day (or other time length), specifically, extracts corresponding new network technology words by using KeyBERT model and LAC model, attaches weights, adds weights of the same words or words with editing distance smaller than 1 to determine common new words and weights extracted by the two models, then arranges and selects a third number (10 in this case) of words before the weights in reverse order, calculates standardized weights corresponding to the words by using Min-Max standardized algorithm, and finally screens out words with standard weights larger than 0.5 as final network technology words. For application domain prediction, the system gathers the determined scientific text data containing new words of network technology, and obtains the related application domain by semantic analysis. In addition, the third part may further obtain a maximum public sub-string length of the final network technology new word and the stored network technology new word by using a maximum public sub-string algorithm, if the maximum public sub-string length is greater than 0, the final network technology new word is considered to be similar to the stored network technology new word, and the application fields of the two new words may be associated to a unified application field, for example, the application fields of the two new words are combined (i.e. a union of the two application fields) as an application field common to the two application fields. And finally, storing the obtained final network technology new words and the application field into a MySQL database table.
As shown in fig. 3, in some embodiments, the third portion includes the steps of:
Step 1: scientific text data of one day (or other time length) is read from Hive, and the scientific text data is stored into a character string list L to be extracted according to the sequence of each record;
Step 2: initializing KeyBERT a model and an LAC model and a combined keyword weight dictionary k_subject (namely a first keyword weight dictionary), l_subject (namely a second keyword weight dictionary), c_subject (namely a third keyword weight dictionary);
step 3: sequentially reading each piece of scientific text data in the character string list L to be extracted in the step 1, and for each piece of scientific text data:
1) Extracting unigram keywords, bigram keywords and corresponding weights respectively by using KeyBERT models, wherein the two types of keywords correspond to the extracted weights respectively, and the key is taken as a key, and the sum of weight values of the two types of keywords is taken as a value to form a key value pair to update a keyword weight dictionary k_subject;
2) And similarly, extracting unigram, bigram keywords and corresponding weights by using the LAC model, and forming a key value pair to update a keyword weight dictionary l_subject by taking the keywords as keys and the sum of the weight values of the two types of keywords as values.
Step 4: the key value pairs { key1: value1}, { key2: value2}, key1, key2 being keys, value1, value2 being values, the two key value pairs extracted each time being corresponding, comparing the keys of the two key value pairs extracted, judging whether key1 is equal to key2, or whether the lycenstant (levenshtein) edit distance of both is not more than 1;
Step 5: if the condition in the step 4 is satisfied, a key2 is taken as a key, a value1+log (value 2) is taken as a value, or a key1 is taken as a key, a value2+lg (value 1) is taken as a value, and a key value pair is formed to update the keyword weight dictionary c_direct;
Step 6: and returning to the step 4 until the key comparison of the keyword weight dictionary k_subject and the keyword weight dictionary l_subject is completed, and then finishing updating the keyword weight dictionary c_subject.
As shown in fig. 4, further includes:
step 7: sorting each key value pair in the updated keyword weight dictionary c_subject in the step 6 according to the weight reverse order (namely from small to large), selecting a first third number of key value pairs, and carrying out standardization of the [0,1] intervals on the weight values by using a Max-Min standardization algorithm;
step 8: and (3) selecting the key with the standardized weight value more than or equal to 0.5 (namely the second threshold value) in the step (7) (namely the key forming a key value pair with the standardized weight value) as a new word of the final network technology.
Step 9: searching a corresponding application field in the character string list L to be extracted in the step 1 by using the final network technology new word (defined as kd 1) generated in the step 8 as a reference through a semantic analysis technology;
Step 10: reading the stored network technology new words (defined as kd 2) and the application fields thereof from the MySQL database table;
Step 11: comparing the new network technology words kd1 and kd2 in the step 9 and the step 10 in pairs, and obtaining the maximum common substring length len of the new network technology words by using a maximum substring algorithm find_lcs_ substr;
Step 12: if the maximum public substring length len calculated in the step 11 is more than 0, updating application fields corresponding to new network technology words kd1 and kd2 as a union of the application fields of the two, and writing the updated data into a MySQL database table;
step 13: if the maximum public sub-string length len=0 calculated in the step 11, directly adding the final network technology new word and the corresponding application field into a MySQL database table;
step 14: and returning to the step 11 until all the final network technology new words and application fields are added to the MySQL database table.
The third part is packaged into a system, so as to obtain a network technology new word discovery and application field prediction module, wherein the module is configured to execute the steps of the third part, and the module architecture is shown in fig. 7.
And finally, integrating the three modules (namely an initial network technology new word iteration expansion module, a scientific and technical website data acquisition module and a network technology new word discovery and application field prediction module) together to obtain the network technology new word discovery and application field prediction system.
Example 2
The embodiment describes the design thought of a network technology new word discovery and application field prediction system from the design principle. A processor is configured in the system to perform the method of embodiment 1. Alternatively, a computer readable storage medium having a computer program stored therein is configured to execute the method of embodiment 1. In the latter case, it is also necessary to provide a processor that reads the computer program from the computer storage medium to execute it.
The invention is not limited to the specific embodiments described above. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification, as well as to any novel one, or any novel combination, of the steps of the method or process disclosed.

Claims (10)

1. A network technology new word discovery and application field prediction method is characterized by comprising the following steps:
determining a first number of seed new words; expanding the number of seed new words from the collected corpus to a second number using a Glove word vector model; marking application fields of various new words and storing the new words;
collecting latest updated scientific text data from an external knowledge base;
Updating a first keyword weight dictionary of the KeyBERT model and a second keyword weight dictionary of the LAC model by using each piece of scientific text data;
Sequentially comparing each key value pair in the first keyword weight dictionary and the second keyword weight dictionary, and updating a third keyword weight dictionary with the corresponding key and the corresponding value when the edit distance of the two keys compared is within a first threshold;
Selecting a key with a value reaching a first condition from the third keyword weight dictionary as a final network technology new word;
Carrying out semantic analysis on scientific text data corresponding to the final network technology new word so as to predict the application field of the final network technology new word;
And associating the application field of the final network technology new word with the application field of the stored network technology new word.
2. The network technology new word discovery and application field prediction method according to claim 1, wherein the associating the application field of the final network technology new word with the application field of the stored network technology new word includes:
and determining whether the final network technology new word and the stored network technology new word are related to the same application field according to the gap between the final network technology new word and the stored network technology new word.
3. The network technology new word discovery and application field prediction method according to claim 2, wherein a maximum common substring length between the final network technology new word and the stored network technology new word is calculated by using a maximum substring algorithm, and when the maximum common substring length is greater than zero, the final network technology new word and the stored network technology new word are associated to the same application field.
4. The network technology new word discovery and application field prediction method according to claim 1, wherein the collecting the latest updated scientific text data from the external knowledge base comprises:
collecting latest updated scientific text information and scientific image information from an external knowledge base;
And extracting text information in the scientific image information, and combining the text information with the scientific text information to obtain scientific text data.
5. The network technology new word discovery and application field prediction method according to claim 1, wherein updating the first keyword weight dictionary of KeyBERT models and the second keyword weight dictionary of LAC models with each piece of scientific text data comprises:
initializing KeyBERT a first keyword weight dictionary of the model and a second keyword weight dictionary of the LAC model;
for each piece of scientific text data, extracting unigram keywords, bigram keywords and corresponding weights respectively by using KeyBERT models, and updating the first keyword weight dictionary by taking the keywords as keys and the sum of the weights as a value; extracting unigram keywords, bigram keywords and corresponding weights respectively by using the LAC model, taking the keywords as keys, taking the sum of the weights as a value, and updating the second keyword weight dictionary.
6. The network technology new word discovery and application field prediction method according to claim 5, wherein the sequentially comparing each key value pair in the first keyword weight dictionary and the second keyword weight dictionary, and updating the third keyword weight dictionary with the corresponding key and its corresponding value when the edit distance of the two keys compared is within a first threshold value, comprises:
and respectively extracting a first key value pair { key1: value1} of the first keyword weight dictionary and a second key value pair { key2: value2} of the second keyword weight dictionary in sequence, and updating the third keyword weight dictionary as follows when the editing distance of the key1 = key2 or the editing distance of the key2 and the second key value pair { key2: value2} is not more than 1:
And updating the third keyword weight dictionary by taking key2 as a key, taking value1+lg (value 2) as a value, or taking key1 as a key, and taking value2+lg (value 1) as a value.
7. The network technology new word discovery and application field prediction method according to claim 1, wherein selecting a key having a value reaching a first condition from the third keyword weight dictionary as a final network technology new word includes:
Sorting each key value pair in the third keyword weight dictionary according to the order of the weights from small to large;
screening out the first number of key value pairs;
the weight of each key value pair is standardized by using a Max-Min standardization algorithm;
and screening out keys with standardized weights reaching a second threshold value as final network technology new words.
8. The network technology new word discovery and application field prediction method of claim 1, wherein the determining a first number of seed new words; expanding the number of seed new words from the collected corpus to a second number using a Glove word vector model, comprising:
Setting a first number of seed new words;
repeating the following steps until the seed new word reaches a second number:
Collecting corpus, wherein each corpus at least comprises one new seed word;
Training Glove word vector models by using the collected corpus;
obtaining similar words of the new seed words by using the trained Glove word vector model;
And calculating the similarity between the similar words and the new seed words, and screening the obtained similar words according to a set third threshold value to expand the number of the new seed words by the screened similar words.
9. A network technology new word discovery and application field prediction system, comprising a processor configured to perform the network technology new word discovery and application field prediction method according to any one of claims 1 to 8.
10. A network technology new word discovery and application field prediction system comprising a computer readable storage medium having a computer program stored therein, characterized in that the computer program is run to perform the network technology new word discovery and application field prediction method according to any one of claims 1 to 8.
CN202410351116.3A 2024-03-26 2024-03-26 New word discovery and application field prediction method and system for network technology Active CN117951246B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410351116.3A CN117951246B (en) 2024-03-26 2024-03-26 New word discovery and application field prediction method and system for network technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410351116.3A CN117951246B (en) 2024-03-26 2024-03-26 New word discovery and application field prediction method and system for network technology

Publications (2)

Publication Number Publication Date
CN117951246A true CN117951246A (en) 2024-04-30
CN117951246B CN117951246B (en) 2024-05-28

Family

ID=90793052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410351116.3A Active CN117951246B (en) 2024-03-26 2024-03-26 New word discovery and application field prediction method and system for network technology

Country Status (1)

Country Link
CN (1) CN117951246B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004070636A (en) * 2002-08-06 2004-03-04 Mitsubishi Electric Corp Concept searching device
CN107305539A (en) * 2016-04-18 2017-10-31 南京理工大学 A kind of text tendency analysis method based on Word2Vec network sentiment new word discoveries
CN110413997A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 New word discovery method, system and readable storage medium for power industry
CN111160017A (en) * 2019-12-12 2020-05-15 北京文思海辉金信软件有限公司 Keyword extraction method, phonetics scoring method and phonetics recommendation method
CN111538893A (en) * 2020-04-29 2020-08-14 四川大学 Method for extracting network security new words from unstructured data
CN112883721A (en) * 2021-01-14 2021-06-01 科技日报社 Method and device for recognizing new words based on BERT pre-training model
CN116595970A (en) * 2023-03-15 2023-08-15 网易(杭州)网络有限公司 Sentence synonymous rewriting method and device and electronic equipment
CN117151089A (en) * 2022-05-19 2023-12-01 腾讯科技(深圳)有限公司 New word discovery method, device, equipment and medium
KR20240017706A (en) * 2022-08-01 2024-02-08 삼성전자주식회사 Electronic apparatus for identifying a newly coined word and control method thereof

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004070636A (en) * 2002-08-06 2004-03-04 Mitsubishi Electric Corp Concept searching device
CN107305539A (en) * 2016-04-18 2017-10-31 南京理工大学 A kind of text tendency analysis method based on Word2Vec network sentiment new word discoveries
CN110413997A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 New word discovery method, system and readable storage medium for power industry
CN111160017A (en) * 2019-12-12 2020-05-15 北京文思海辉金信软件有限公司 Keyword extraction method, phonetics scoring method and phonetics recommendation method
CN111538893A (en) * 2020-04-29 2020-08-14 四川大学 Method for extracting network security new words from unstructured data
CN112883721A (en) * 2021-01-14 2021-06-01 科技日报社 Method and device for recognizing new words based on BERT pre-training model
CN117151089A (en) * 2022-05-19 2023-12-01 腾讯科技(深圳)有限公司 New word discovery method, device, equipment and medium
KR20240017706A (en) * 2022-08-01 2024-02-08 삼성전자주식회사 Electronic apparatus for identifying a newly coined word and control method thereof
CN116595970A (en) * 2023-03-15 2023-08-15 网易(杭州)网络有限公司 Sentence synonymous rewriting method and device and electronic equipment

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
BAPTISTE BLOUIN 等: "Unlocking Transitional Chinese: Word Segmentation in Modern Historical Texts", 《ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》, 31 December 2023 (2023-12-31), pages 92 *
DANG VAN THIN 等: "Two New Large Corpora for Vietnamese Aspect-based Sentiment Analysis at Sentence Level", 《ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING》, 26 May 2021 (2021-05-26), pages 1 - 22 *
VIJENDER SINGH 等: "Hybrid Approach To Unsupervised Keyphrase Extraction", 《PROCEDIA COMPUTER SCIENCE》, 31 December 2023 (2023-12-31), pages 1 - 14 *
刘凡平 等: "基于BERT的开放领域中文新词发现研究", 《计算机应用与软件》, 12 June 2023 (2023-06-12), pages 173 - 180 *
刘建舟 等: "基于语料库和网络的新词自动识别", 《计算机应用》, 28 July 2004 (2004-07-28), pages 132 - 134 *
姚奕 等: "联合知识图谱和预训练模型的中文关键词抽取方法", 《计算机科学》, 10 June 2022 (2022-06-10), pages 243 - 251 *
廖涛 等: "丰富语义信息的BERT-CRNN突发事件要素识别", 《阜阳师范大学学报(自然科学版)》, 15 March 2023 (2023-03-15), pages 42 - 48 *
柳文婷: "基于情感新词识别的微博文本情感倾向分析研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 February 2022 (2022-02-15), pages 138 - 1455 *
莫超: "基于情感分析的古籍短视频推广研究——以抖音为例", 《中国优秀硕士学位论文全文数据库信息科技辑》, 15 October 2022 (2022-10-15), pages 141 - 118 *

Also Published As

Publication number Publication date
CN117951246B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
JP6643555B2 (en) Text processing method and apparatus based on ambiguous entity words
US9239875B2 (en) Method for disambiguated features in unstructured text
CN107229668B (en) Text extraction method based on keyword matching
CN106909655B (en) The knowledge mapping entity discovery excavated based on production alias and link method
CN113806482B (en) Cross-modal retrieval method, device, storage medium and equipment for video text
JP5078173B2 (en) Ambiguity Resolution Method and System
CN106537370A (en) Method and system for robust tagging of named entities in the presence of source or translation errors
CN106708929B (en) Video program searching method and device
CN113268995A (en) Chinese academy keyword extraction method, device and storage medium
CN114036930A (en) Text error correction method, device, equipment and computer readable medium
Zhou et al. Resolving surface forms to wikipedia topics
Wang et al. DM_NLP at semeval-2018 task 12: A pipeline system for toponym resolution
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
CN114661872B (en) Beginner-oriented API self-adaptive recommendation method and system
CN114817570A (en) News field multi-scene text error correction method based on knowledge graph
CN106570196B (en) Video program searching method and device
CN111061939A (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN114997288A (en) Design resource association method
CN112884087A (en) Biological enhancer and identification method for type thereof
CN114298048A (en) Named entity identification method and device
CN117951246B (en) New word discovery and application field prediction method and system for network technology
CN116680420B (en) Low-resource cross-language text retrieval method and device based on knowledge representation enhancement
CN111767733A (en) Document security classification discrimination method based on statistical word segmentation
CN113986345B (en) Pre-training enhanced code clone detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant