CN111488450A - Method and device for generating keyword library and electronic equipment - Google Patents
Method and device for generating keyword library and electronic equipment Download PDFInfo
- Publication number
- CN111488450A CN111488450A CN202010272926.1A CN202010272926A CN111488450A CN 111488450 A CN111488450 A CN 111488450A CN 202010272926 A CN202010272926 A CN 202010272926A CN 111488450 A CN111488450 A CN 111488450A
- Authority
- CN
- China
- Prior art keywords
- keyword
- determining
- corpus
- matching template
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000003860 storage Methods 0.000 claims description 16
- 238000004458 analytical method Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 9
- 230000004044 response Effects 0.000 claims description 4
- 238000010276 construction Methods 0.000 abstract description 4
- 230000018109 developmental process Effects 0.000 description 31
- 238000011161 development Methods 0.000 description 28
- 238000012545 processing Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 7
- 230000003287 optical effect Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 3
- 240000009088 Fragaria x ananassa Species 0.000 description 2
- 244000141359 Malus pumila Species 0.000 description 2
- 240000008790 Musa x paradisiaca Species 0.000 description 2
- 235000018290 Musa x paradisiaca Nutrition 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 244000099147 Ananas comosus Species 0.000 description 1
- 235000007119 Ananas comosus Nutrition 0.000 description 1
- 235000016623 Fragaria vesca Nutrition 0.000 description 1
- 235000011363 Fragaria x ananassa Nutrition 0.000 description 1
- 241000234295 Musa Species 0.000 description 1
- 235000021016 apples Nutrition 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 235000021015 bananas Nutrition 0.000 description 1
- 239000013256 coordination polymer Substances 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000007115 recruitment Effects 0.000 description 1
- 235000021012 strawberries Nutrition 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the disclosure discloses a method and a device for generating a keyword library and electronic equipment. The method comprises the following steps: determining a keyword matching template based on a preset first keyword; applying the keyword matching template to the obtained corpus to determine a second keyword; and generating a keyword library based on the first keyword and the second keyword. A small amount of preset first keywords can be used for determining second keywords in the mass linguistic data through the corresponding keyword matching templates, a large amount of keyword matching templates do not need to be written manually, and the construction efficiency of the keyword library is improved.
Description
Technical Field
The present disclosure relates to the field of internet technologies, and in particular, to a method and an apparatus for generating a keyword library, and an electronic device.
Background
Data mining can reveal implicit, previously unknown and potentially valuable information from a large amount of data. When data mining is performed, a required keyword may be prepared first. That is, the required keywords need to be selected from the relevant data sources and integrated into a keyword library for mining new keywords.
Disclosure of Invention
This disclosure is provided to introduce concepts in a simplified form that are further described below in the detailed description. This disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The embodiment of the disclosure provides a method and a device for generating a keyword library and electronic equipment. A small amount of preset first keywords can be used for determining second keywords in the mass linguistic data through the corresponding keyword matching templates, a large amount of keyword matching templates do not need to be written manually, and the construction efficiency of the keyword library is improved.
In a first aspect, an embodiment of the present disclosure provides a method for generating a keyword library, where the method includes: determining a keyword matching template based on a preset first keyword; applying the keyword matching template to the obtained corpus to determine a second keyword; and generating a keyword library based on the first keyword and the second keyword.
In a second aspect, an embodiment of the present disclosure provides an apparatus for generating a keyword library, where the apparatus includes: the first determining module is used for determining a keyword matching template based on a preset first keyword; the second determining module is used for applying the keyword matching template to the obtained corpus and determining a second keyword; and the generating module is used for generating a keyword library based on the first keyword and the second keyword.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method for generating a keyword library of the first aspect.
In a fourth aspect, the disclosed embodiments provide a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for generating a keyword library as described in the first aspect above.
According to the method, the device and the electronic equipment for generating the keyword library, the keyword matching template is determined based on the preset first keyword, then the keyword matching template is applied to the obtained corpus, the second keyword is determined, and finally the keyword library is generated based on the first keyword and the second keyword. A small amount of preset first keywords can be used for determining second keywords in the mass linguistic data through the corresponding keyword matching templates, a large amount of keyword matching templates do not need to be written manually, and the construction efficiency of the keyword library is improved.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
FIG. 1 is a flow diagram of one embodiment of a method for generating a thesaurus according to the present disclosure;
FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a method for generating a thesaurus in accordance with the present disclosure;
FIG. 3 is a schematic diagram illustrating one embodiment of an apparatus for generating a thesaurus according to the present disclosure;
FIG. 4 is an exemplary system architecture to which the method for generating a keyword library of one embodiment of the present disclosure may be applied;
fig. 5 is a schematic diagram of a basic structure of an electronic device provided according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict.
Referring to fig. 1, which shows a flowchart of one embodiment of a method for generating a keyword library according to the present disclosure, as shown in fig. 1, the method for generating a keyword library includes the following steps 101 to 103.
The preset first keyword may be a preset word having the same common attribute. For example, skill words that may characterize skills, such as making presentations, processing images, etc., for the office software domain. In some application scenarios, the object of the skill processing may be used as the above-mentioned skill word. For example, the above-described production presentation may be "presentation" as the above-described skill word, and the above-described processing image may be "image" as the above-described skill word.
The preset first keyword may be obtained from a server by a device that performs the method for generating a keyword library. After the device obtains a certain number of preset first keywords from the server, the preset first keywords can be uniformly stored in the initial word bank. The number of preset first keywords may be, for example, 10, 20, etc., and the specific number may be determined according to practical situations, and is not limited herein. The initial word stock stores all preset first keywords at the current moment. When the preset first keyword is required to be used, the preset first keyword can be directly extracted from the initial word stock.
The preset first keyword can be processed according to a preset rule to obtain the keyword matching template. The preset rule may be, for example, searching corpora including any one preset first keyword from a plurality of corpora set in advance, and determining a corresponding keyword matching template according to the searched corpora. For example, the preset corpus may be "various office software is used skillfully", "image processing technique is mastered", "data development experience is possessed", and the like. If the corpus "mastered image processing technique" including the preset first keyword "image processing technique" is found in these corpuses, a corresponding keyword matching template "mastered ×" can be generated. It should be noted that ". x" in this disclosure may represent corresponding characters, which when matched with any particular literal character, may generate a corresponding keyword matching template. In addition, "x" may not only represent 3 characters, but should be understood to represent substantially all of the key characters contained herein, i.e., 4 key characters, 5 key characters, etc., consecutive key characters may constitute the first keyword.
After the corresponding keyword matching template is generated according to the preset first keyword, the generated keyword matching template may be applied.
And 102, applying the keyword matching template to the obtained corpus to determine a second keyword.
The corpus obtained above may be a corpus stored in a corpus obtained in advance. The corpus may be sentences containing the above keywords. Documents such as job description documents and resume documents in job hunting documents may be segmented into sentences with separators such as semicolon and period, and the sentences may be saved to form the corpus. It should be noted that the sentences in the corpus may or may not include the keywords, and the terms obtained in advance are used as the standard. The corpus obtained above may also be a corpus obtained at the current time. That is, the keyword matching template may be applied to a corpus set in advance (i.e., corpus obtained in advance), and a second keyword is determined in the corpus; the keyword matching template can also be applied to the corpus acquired at the current moment, and a second keyword contained in the corpus acquired at the current moment is determined.
The second keyword may be a word having the same attribute as the preset first keyword. The attribute here may be, for example, a word which is the same as the above-described skill word; alternatively, for example, words belonging to fruits may be used.
When the keyword matching template is applied to the obtained corpus, the corpus successfully matched with the keyword matching template can be searched in the obtained corpus, and words with the same attribute as that of the preset first keyword in the successfully matched corpus are determined as the second keyword. For example, the keyword matching template determined according to any of the above-mentioned skill words is "having the development experience", when the corpus is "having the software development experience", the corpus may be considered to be successfully matched with the keyword matching template, and then "software" in "having the software development experience" may be determined as the second keyword.
After the second keyword is determined, the second keyword and the first keyword may be stored, and a keyword library may be generated. Here, since the first keyword is already stored in the initial thesaurus described above, the determined second keyword may be stored in the initial thesaurus described above. In this case, the initial thesaurus may be regarded as the generated keyword thesaurus. For example, the current first keywords include the above-mentioned skill words such as "train", "car", "bus", and the like; if the second keyword determined at present is "bus", the first keyword "train", "car", "bus" and the second keyword "bus" may be stored, and a keyword library including "train", "car", "bus" and "bus" is generated.
After generating the keyword library based on the first keyword and the second keyword, the second keyword in the current keyword library may be regarded as the preset first keyword with respect to the second keyword determined next time. And then, according to the first keyword and the second keyword in the current keyword library, continuously determining a corresponding keyword matching template, and determining a corresponding second keyword. For example, the keywords of "train", "car", "bus", and "bus" included in the keyword library may be the first keyword at the current time. A keyword matching template may be generated from these first keywords, and a second keyword "high-speed rail" having the same attribute may be determined.
The method and the device are repeated, so that the keyword library is enriched, and more key information can be matched when the keyword library is applied.
When judging whether the required key information exists in any document, a keyword library can be constructed in advance. Then, the keywords in the keyword library can be used for matching the document, and if the keywords in the keyword library exist in the document in a matching mode, the key information contained in the document can be determined. For example, in the field of intelligent recruitment, job hunting documents of job seekers can be matched through a large number of the above-mentioned technical words. That is, whether the job seeker can be registered can be determined by how many skill words are included in the job hunting document. Illustratively, the more skill words that can be matched in the job hunting document, the more skill that the job seeker has mastered, and the greater the probability of being recorded.
In the prior art, keywords are generally extracted from structured data or semi-structured data to form a keyword library. Therefore, the natural language corpora are less applied, the structured data or the semi-structured data are less, the corpora containing massive keywords cannot be covered, resource waste is caused, and time and labor are wasted when the keyword library is constructed.
In this embodiment, a keyword matching template is determined based on a preset first keyword, then the keyword matching template is applied to the obtained corpus, a second keyword is determined, and finally a keyword library is generated based on the first keyword and the second keyword. A small amount of preset first keywords can be used for determining second keywords in the mass linguistic data through the corresponding keyword matching templates, a large amount of keyword matching templates do not need to be written manually, and the construction efficiency of the keyword library is improved.
Referring to fig. 2, which shows a flowchart of another embodiment of a method for generating a keyword library according to the present disclosure, as shown in fig. 2, the method for generating a keyword library may include the following steps 201 to 206.
The target corpus may be a sentence including any one of the first keywords. For example, a sentence "i will make a presentation" containing the above-described skill word "presentation", a sentence "i will process an image" containing the above-described skill word "image", and the like.
For any first keyword, a sentence containing the first keyword can be searched in the obtained corpus, and the searched sentence is determined as a target corpus corresponding to the first keyword.
In some application scenarios, a target corpus corresponding to a preset first keyword is required to exist in the acquired corpus, and the preset first keyword may be extracted from the acquired corpus. The amount of extraction can be determined according to the number of the obtained corpora. Therefore, the number of the obtained corpora may be counted first, and then the preset first keyword of the preset share may be extracted. The predetermined fraction can be, for example, but not limited to, 5%, 7%, etc. For example, 1 thousand of the obtained corpora are counted, and 50 preset first keywords satisfying the preset first keyword attribute in the corpora can be extracted to form an initial lexicon.
When the target corpus is searched in the obtained corpus, the target corpus corresponding to a first keyword may not be searched according to the first keyword, only one corresponding target corpus may be searched, or a plurality of target corpora corresponding to the keyword may be searched. After the corresponding target linguistic data are found, adjacent characters of the corresponding first keywords with the first preset number in each target linguistic data can be extracted.
The first preset number may be, for example, but not limited to, 2, 3, etc. That is, the first 2 characters and the second 2 characters of the first keyword corresponding to each target corpus can be extracted from each target corpus to generate a corresponding keyword matching template. In some application scenarios, the first preset number in front of the extracted first keyword may be different from the first preset number in the back. For example, if the preset first keyword includes "database", all the corpora including "database" may be searched in the obtained corpora, and the corpora may be determined as the target corpora corresponding to "database". If the searched target corpus is 'database development experience', the first 2 characters 'with' and the next 2 characters 'development' of the first keyword of 'database' in the target corpus can be extracted to obtain a keyword matching template 'with x development'; the first 2 characters of the first keyword of the database in the target corpus, namely ' having ' and the last 4 characters of the first keyword ' are developed ' can also be extracted to obtain the keyword matching template of ' having ' development experience '. It should be noted that a target corpus may generate a corresponding keyword matching template. For multiple target corpora, multiple keyword matching templates may be generated.
After the target corpus is found, the found target corpus and the remaining corpora (i.e., non-target corpora) may be stored separately. Therefore, when the keyword matching template is used for matching the obtained corpora, the non-target corpora can be directly matched without matching all the corpora, and the operation procedures can be reduced to a certain degree. For example, a keyword matching template is "has x development experience" derived based on the target corpus "has database development experience". After the target corpus and the non-target corpus are stored respectively, the target corpus of "having database development experience" can be excluded, and only other non-target corpora such as "having front-end development experience", "having software development experience", "having three years development experience" and the like are matched.
And 204, matching the keyword matching template with the non-target corpus to obtain at least one matching result.
The matching result may be information content corresponding to each successfully matched non-target corpus. For example, in the keyword matching template of "having development experience", if "having rich development experience" exists in the obtained non-target corpus, the sentence is successfully matched with the keyword matching template at this time, although the three characters "rich" are not any of the above-mentioned skill words. But "has a rich development experience" may be determined as a matching result obtained from the keyword matching template.
And aiming at each keyword matching template, when the keyword matching template is matched with the non-target corpus, a plurality of matching results can be obtained. From any one of the matching results, a second keyword corresponding to the matching result can be determined. For example, a corpus "having embedded development experience" corresponding to the matching template "having development experience" described above is matched in the non-target corpus, and the keyword "embedded" in the corpus may be determined as the second keyword.
In some application scenarios, the second keyword may be determined according to whether a keyword exists in the non-target corpus. The keyword is understood to be essentially a character that can constitute the second keyword. That is, when determining whether the character represented by "×", in the keyword matching template, is a keyword, the determination may be made according to a preset attribute that can be determined as the first keyword. For example, it may be determined whether the "rich" mentioned above is the above-mentioned skill word, and if so, the character may be determined as the corresponding key character, and the corresponding "with rich development experience" may be determined as the corresponding matching result. Here, it is obvious that "rich" cannot be determined as the corresponding second keyword.
In some optional implementations, the step 205 may specifically include the following steps:
step 2051, at least one candidate second keyword is determined from the at least one matching result.
That is, a plurality of candidate second keywords may be determined from the plurality of matching results, and the candidate second keywords may include the same words as the first keywords or include different words from the first keywords.
And step 2052, removing the duplicate of at least one candidate second keyword to obtain a second keyword.
That is, it may be detected whether there is a word identical to the first keyword among the candidate second keywords. The second keywords can be obtained by removing the candidate second keywords identical to the first keywords. For example, the candidate second keywords determined from the matching result include "apple", "banana", and "strawberry", and the first keyword at the current time includes "pineapple", and "banana". The 'bananas' in the candidate second keywords can be deduplicated to obtain the second keywords 'apples' and 'strawberries' at the current moment.
After the second keyword is obtained, the first keyword and the obtained second keyword may be stored to generate a keyword library at the current time. Step 206, a keyword library is generated according to the first keyword and the second keyword.
In the embodiment, the corresponding keyword matching template is determined by extracting the characters before and after the first keywords with the first preset number in the target corpus, the second keywords are determined according to the duplication elimination operation, and finally the second keywords and the first keywords are stored to generate the keyword library. The operation is simple, convenient and fast.
In some other embodiments, the step 101 may include the following steps:
searching a target corpus containing a first keyword in the obtained corpus; and performing semantic analysis on the target corpus, and determining a keyword matching template based on a semantic analysis result.
Semantic analysis can be performed on each target corpus to obtain a semantic analysis result corresponding to the target corpus. When semantic analysis is carried out, word segmentation processing is carried out on characters in the target corpus, then judgment is carried out on each segmented word, whether the word can form a matching relation with the first keyword or not is judged, and then a corresponding keyword matching template is determined according to the matching relation. The word segmentation process here may be a representation of converting a sentence into a word. The matching relationship may be, for example, whether the segmented word and the first keyword can form a general and smooth corpus. For example, a target corpus "found for the" database "by the first keyword has rich database development experience", and the target corpus may be subjected to semantic analysis, that is, the "rich database development experience" may be subjected to word segmentation processing to obtain several words "having", "rich", "database", "development", and "experience", and then analyzed to determine the first keyword "database" therein, and determine "having", "developing", and "a word that can form a collocation relationship with the" database ", and then determine that the corresponding keyword matching template is" having "development".
In some other embodiments, the step 101 may include the following steps:
step one, determining at least one candidate keyword matching template according to the first keyword.
In the keyword matching template generated according to the preset first keyword, there may be a case where the keyword matching template can only match the target corpus, or a case where the keyword matching template can only match a very small number of corpora (e.g., 3 corpora). For example, the target corpus found from the first keyword is "database with" and embedded development experience ", and the keyword matching template generated by extracting the adjacent characters of the first keyword in 2 target corpora may be" have and embed ". At this time, if the keyword matching template is used to determine the second keyword, it may be possible to match only the target corpus. Therefore, the keyword matching templates which can be successfully matched with the corpus but are few in number can be filtered out, and the candidate keyword matching templates are determined.
And step two, determining the credibility of each candidate keyword matching template according to a detection matching result obtained by applying at least one candidate keyword matching template to the detection corpus.
The detection corpus may be a part of the acquired corpus.
After the candidate keyword matching templates are determined, the credibility of each candidate keyword matching template can be calculated. In some optional implementation manners, the confidence level of a candidate keyword matching template may be specifically calculated through the following steps 1 to 4.
Step 1, extracting any candidate keyword matching template from at least one candidate keyword matching template, and determining the extracted candidate keyword matching template as a target candidate keyword matching template with the credibility to be calculated at the current moment.
That is, any one of the candidate keyword matching templates may be determined as the target candidate keyword matching template, and then the credibility of the determined target candidate keyword matching template may be calculated.
And 2, counting the total number of the detected corpora successfully matched with the target candidate keyword matching template, and recording as a first number. For example, when the target candidate keyword matching template is "with development experience" as described above, the number of corpuses that can be successfully matched with "with development experience" in the obtained corpuses is determined.
And 3, counting the total number of the first keywords contained in the detected corpus successfully matched with the target candidate keyword matching template, and recording the total number as a second number. For example, when the target candidate keyword matching template is "with development experience", the information content represented by "×" in the corpus successfully matched with "with development experience" is the number of the first keyword. That is, the number of target corpora included in the corpus that is successfully matched.
And 4, determining the credibility of the extracted candidate keyword matching template according to the first quantity and the second quantity.
After the first quantity and the second quantity are obtained, the ratio of the second quantity to the first quantity can be calculated, and the credibility of the target candidate keyword is determined. I.e. confidence level (second number/first number). For example, when the target keyword candidate matching template is "with development experience", the number of corpora that can be successfully matched with "with development experience" in the obtained corpora is 100, and the number of first keywords included in the 100 corpora is 70, the reliability of the keyword candidate matching template is (70/100) — 70%.
And step three, determining the candidate keyword matching template with the corresponding credibility value larger than a preset credibility threshold value as the keyword matching template.
After the credibility of each candidate keyword matching template is obtained, the candidate keyword matching templates with lower credibility can be filtered according to a preset credibility threshold. The preset confidence threshold here may be, for example, but not limited to, 80%, 95%, etc.
After two times of filtering, the keyword matching template with higher credibility is left at present. By calculating the credibility of the keyword matching template, some candidate keyword matching templates which cannot be really used can be eliminated in advance, such as the above-mentioned "have a sum and a rule". The accuracy of the determined second keyword may then be improved.
In some optional implementations, the method for generating a thesaurus described above may further include stopping continuing to perform the method for generating a thesaurus in response to determining that an event of at least one of:
and in the event 1, a second preset number of second keywords continuously determined by the keyword matching template coincide with the keywords in the keyword library.
The method for generating a keyword library described above may be stopped from continuing to be performed in case event 1 is satisfied. That is, in the corpus successfully matched with the keyword matching template, if a plurality of continuously determined second keywords can find the same keyword in the keyword library, it can be considered that the currently obtained corpus has no value of continuously mining the second keyword, and then the method for generating the keyword library may be stopped. The second preset number here may be, for example, but not limited to, 30, 40, etc.
And 2, when the using times of the keyword matching template reach a preset time threshold value.
The method for generating a keyword library described above may be stopped from continuing to be performed in case event 2 is satisfied. That is, if it is determined that the number of times of use of the keyword matching template reaches the preset number threshold, the method for generating the keyword library may also be stopped. The preset number threshold may be, for example, but not limited to, 10 times, 20 times, and the like. Here, the number of times of use can be increased by 1 time per use of the keyword matching template. When the increased number of times reaches the preset number threshold, it may be considered that the obtained corpus does not continue to mine the value of the second keyword, and then the method for generating the keyword library may be stopped.
Referring to fig. 3, which is a schematic structural diagram illustrating an embodiment of an apparatus for generating a keyword library according to the present disclosure, as shown in fig. 3, the apparatus for generating a keyword library includes a first determining module 301, a second determining module 302 and a generating module 303. The first determining module 301 is configured to determine a keyword matching template based on a preset first keyword; a second determining module 302, configured to apply the keyword matching template to the obtained corpus, and determine a second keyword; a generating module 303 for generating a keyword library based on the first keyword and the second keyword
It should be noted that specific processing of the first determining module 301, the second determining module 302, and the generating module 303 of the apparatus for generating a keyword library and technical effects brought by the processing can refer to the related descriptions of step 101 to step 103 in the corresponding embodiment of fig. 1, which are not described herein again.
In some optional implementations of this embodiment, the first determining module 301 is further configured to: searching a target corpus containing a first keyword in the obtained corpus; and extracting adjacent characters of a preset number of first keywords from the target corpus, and generating a corresponding keyword matching template based on the adjacent characters.
In some optional implementations of this embodiment, the means for generating a keyword library further includes: the classification module is used for respectively storing the target linguistic data and other non-target linguistic data; and, the second determination module 302 is further configured to: matching the keyword matching template with the non-target corpus to obtain at least one matching result; a second keyword is determined from the at least one matching result.
In some optional implementations of the present embodiment, the second determining module 302 is further configured to: determining at least one candidate second keyword from the at least one matching result; and removing the duplication of at least one candidate second keyword to obtain the second keyword.
In some optional implementations of this embodiment, the first determining module 301 is further configured to: searching a target corpus containing a first keyword in the obtained corpus; and performing semantic analysis on the target corpus, and determining a keyword matching template based on a semantic analysis result.
In some optional implementations of this embodiment, the first determining module 301 is further configured to: determining at least one candidate keyword matching template according to the first keyword; determining the credibility of each candidate keyword matching template according to a detection matching result obtained by applying at least one candidate keyword matching template to the detection corpus; and determining the candidate keyword matching template with the corresponding credibility value larger than a preset credibility threshold value as the keyword matching template.
In some optional implementations of this embodiment, the means for generating a keyword library further includes: a termination module for stopping continuing execution of the method for generating a corpus of keywords in response to determining that an event of at least one of: continuously determining the second keywords with preset number obtained by the keyword matching template to be coincided with the keywords in the keyword library; the using times of the keyword matching template reach a preset time threshold.
Referring to fig. 4, an exemplary system architecture in which the method for generating a keyword library of one embodiment of the present disclosure may be applied is illustrated.
As shown in FIG. 4, the system architecture may include terminal devices 401, 402, 403, a network 404, a server 405, a network 404, a medium by which the network 404 provides communication links between the terminal devices 401, 402, 403 and the server 405. the network 404 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The terminal devices 401, 402, 403 may interact with a server 405 over a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have various client applications installed thereon, such as a video distribution application, a search-type application, and a news-information-type application.
When the terminal devices 401, 402, 403 are hardware, they may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio L layer III, motion Picture Experts compression standard Audio layer 3), MP4(Moving Picture Experts Group Audio L layer IV, motion Picture Experts compression standard Audio layer 4) players, laptop portable computers, desktop computers, and the like.
The server 405 may be a server that can provide various services, for example, receive the corpus acquisition request sent by the terminal devices 401, 402, 403, analyze the corpus acquisition request, and send the analysis result (e.g., corpus corresponding to the acquisition request) to the terminal devices 401, 402, 403.
It should be noted that the method for generating the keyword library provided by the embodiment of the present disclosure may be executed by a terminal device, and accordingly, the apparatus for generating the keyword library may be disposed in the terminal device.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, a block diagram of an electronic device (e.g., the server of FIG. 4) suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, the electronic device may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
In general, input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 507 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 508 including, for example, magnetic tape, hard disk, etc., and communication devices 509. the communication devices 509 may allow the electronic device to communicate wirelessly or wiredly with other devices to exchange data.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: determining a keyword matching template based on a preset first keyword; applying the keyword matching template to the obtained corpus to determine a second keyword; a keyword library is generated based on the first keyword and the second keyword.
Computer program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including but not limited to AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a module does not in some cases constitute a limitation of the unit itself, for example, the generating module may also be described as a "module that generates a keyword library based on a first keyword and a second keyword".
For example, without limitation, exemplary types of hardware logic that may be used include Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex programmable logic devices (CP L D), and so forth.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Claims (16)
1. A method for generating a corpus of keywords, comprising:
determining a keyword matching template based on a preset first keyword;
applying the keyword matching template to the obtained corpus to determine a second keyword;
and generating a keyword library based on the first keyword and the second keyword.
2. The method according to claim 1, wherein determining the keyword matching template based on the preset first keyword comprises:
searching a target corpus containing the first keyword in the acquired corpus; and
extracting a first preset number of adjacent characters of the first keyword from the target corpus, and generating a corresponding keyword matching template based on the adjacent characters.
3. The method of claim 2, further comprising:
respectively storing the target linguistic data and other non-target linguistic data; and
the step of applying the keyword matching template to the obtained corpus and determining a second keyword comprises the following steps:
matching the keyword matching template with the non-target corpus to obtain at least one matching result;
determining the second keyword from the at least one matching result.
4. The method of claim 3, wherein said determining said second keyword from said at least one of said matching results comprises:
determining at least one candidate second keyword from the at least one matching result;
and removing the duplication of the at least one candidate second keyword to obtain the second keyword.
5. The method according to claim 1, wherein determining the keyword matching template based on the preset first keyword comprises:
searching a target corpus containing the first keyword in the acquired corpus; and
and performing semantic analysis on the target corpus, and determining a keyword matching template based on a semantic analysis result.
6. The method according to claim 1, wherein determining the keyword matching template based on the preset first keyword comprises:
determining at least one candidate keyword matching template according to the first keyword;
determining the credibility of each candidate keyword matching template according to a detection matching result obtained by applying the at least one candidate keyword matching template to the detection corpus;
and determining the candidate keyword matching template with the corresponding credibility value larger than a preset credibility threshold value as the keyword matching template.
7. The method of claim 1, further comprising:
stopping continuing execution of the method for generating a corpus of keywords in response to determining that an event of at least one of:
continuously determining second preset number of second keywords obtained by the keyword matching template to be overlapped with the keywords in the keyword library;
the using times of the keyword matching template reach a preset time threshold.
8. An apparatus for generating a corpus of keywords, comprising:
the first determining module is used for determining a keyword matching template based on a preset first keyword;
the second determining module is used for applying the keyword matching template to the obtained corpus and determining a second keyword;
and the generating module is used for generating a keyword library based on the first keyword and the second keyword.
9. The apparatus of claim 8, wherein the first determining module is further configured to:
searching a target corpus containing the first keyword in the acquired corpus; and
extracting a first preset number of adjacent characters of the first keyword from the target corpus, and generating a corresponding keyword matching template based on the adjacent characters.
10. The apparatus of claim 9, further comprising:
the classification module is used for respectively storing the target linguistic data and other non-target linguistic data; and
the second determination module is further to:
matching the keyword matching template with the non-target corpus to obtain at least one matching result;
determining the second keyword from the at least one matching result.
11. The apparatus of claim 10, wherein the second determining module is further configured to:
determining at least one candidate second keyword from the at least one matching result;
and removing the duplication of the at least one candidate second keyword to obtain the second keyword.
12. The apparatus of claim 8, wherein the first determining module is further configured to:
searching a target corpus containing the first keyword in the acquired corpus; and
and performing semantic analysis on the target corpus, and determining a keyword matching template based on a semantic analysis result.
13. The apparatus of claim 8, wherein the first determining module is further configured to:
determining at least one candidate keyword matching template according to the first keyword;
determining the credibility of each candidate keyword matching template according to a detection matching result obtained by applying the at least one candidate keyword matching template to the detection corpus;
and determining the candidate keyword matching template with the corresponding credibility value larger than a preset credibility threshold value as the keyword matching template.
14. The apparatus of claim 8, further comprising:
a termination module for stopping continuing execution of the method for generating a corpus of keywords in response to determining that an event of at least one of:
continuously determining second preset number of second keywords obtained by the keyword matching template to be overlapped with the keywords in the keyword library;
the using times of the keyword matching template reach a preset time threshold.
15. An electronic device, comprising:
one or more processors;
storage means having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.
16. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010272926.1A CN111488450A (en) | 2020-04-08 | 2020-04-08 | Method and device for generating keyword library and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010272926.1A CN111488450A (en) | 2020-04-08 | 2020-04-08 | Method and device for generating keyword library and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111488450A true CN111488450A (en) | 2020-08-04 |
Family
ID=71791865
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010272926.1A Pending CN111488450A (en) | 2020-04-08 | 2020-04-08 | Method and device for generating keyword library and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111488450A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113360779A (en) * | 2021-08-09 | 2021-09-07 | 智者四海(北京)技术有限公司 | Content recommendation method and device, computer equipment and readable medium |
CN114372446A (en) * | 2021-12-13 | 2022-04-19 | 北京五八信息技术有限公司 | Vehicle attribute labeling method, device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060206306A1 (en) * | 2005-02-09 | 2006-09-14 | Microsoft Corporation | Text mining apparatus and associated methods |
CN101369265A (en) * | 2008-01-14 | 2009-02-18 | 北京百问百答网络技术有限公司 | Method and system for automatically generating semantic template of problem |
CN108959256A (en) * | 2018-06-29 | 2018-12-07 | 北京百度网讯科技有限公司 | Generation method, device, storage medium and the terminal device of short text |
CN110059163A (en) * | 2019-04-29 | 2019-07-26 | 百度在线网络技术(北京)有限公司 | Generate method and apparatus, the electronic equipment, computer-readable medium of template |
CN110287466A (en) * | 2019-06-24 | 2019-09-27 | 腾讯科技(深圳)有限公司 | A kind of physical template generation method and device |
CN110427492A (en) * | 2019-07-10 | 2019-11-08 | 阿里巴巴集团控股有限公司 | Generate the method, apparatus and electronic equipment of keywords database |
-
2020
- 2020-04-08 CN CN202010272926.1A patent/CN111488450A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060206306A1 (en) * | 2005-02-09 | 2006-09-14 | Microsoft Corporation | Text mining apparatus and associated methods |
CN101369265A (en) * | 2008-01-14 | 2009-02-18 | 北京百问百答网络技术有限公司 | Method and system for automatically generating semantic template of problem |
CN108959256A (en) * | 2018-06-29 | 2018-12-07 | 北京百度网讯科技有限公司 | Generation method, device, storage medium and the terminal device of short text |
CN110059163A (en) * | 2019-04-29 | 2019-07-26 | 百度在线网络技术(北京)有限公司 | Generate method and apparatus, the electronic equipment, computer-readable medium of template |
CN110287466A (en) * | 2019-06-24 | 2019-09-27 | 腾讯科技(深圳)有限公司 | A kind of physical template generation method and device |
CN110427492A (en) * | 2019-07-10 | 2019-11-08 | 阿里巴巴集团控股有限公司 | Generate the method, apparatus and electronic equipment of keywords database |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113360779A (en) * | 2021-08-09 | 2021-09-07 | 智者四海(北京)技术有限公司 | Content recommendation method and device, computer equipment and readable medium |
CN114372446A (en) * | 2021-12-13 | 2022-04-19 | 北京五八信息技术有限公司 | Vehicle attribute labeling method, device and storage medium |
CN114372446B (en) * | 2021-12-13 | 2023-02-17 | 北京爱上车科技有限公司 | Vehicle attribute labeling method, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107679039B (en) | Method and device for determining statement intention | |
US11521603B2 (en) | Automatically generating conference minutes | |
US10740545B2 (en) | Information extraction from open-ended schema-less tables | |
US9923860B2 (en) | Annotating content with contextually relevant comments | |
CN109241286B (en) | Method and device for generating text | |
CN113657113B (en) | Text processing method and device and electronic equipment | |
CN108121699B (en) | Method and apparatus for outputting information | |
CN111797272A (en) | Video content segmentation and search | |
CN111538903B (en) | Method and device for determining search recommended word, electronic equipment and computer readable medium | |
CN110674297B (en) | Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment | |
CN111428011B (en) | Word recommendation method, device, equipment and storage medium | |
CN113011169B (en) | Method, device, equipment and medium for processing conference summary | |
US11182545B1 (en) | Machine learning on mixed data documents | |
CN111078849B (en) | Method and device for outputting information | |
CN113220999A (en) | User feature generation method and device, electronic equipment and storage medium | |
CN111488450A (en) | Method and device for generating keyword library and electronic equipment | |
CN110245334B (en) | Method and device for outputting information | |
CN111815274A (en) | Information processing method and device and electronic equipment | |
CN111555960A (en) | Method for generating information | |
US11361031B2 (en) | Dynamic linguistic assessment and measurement | |
CN110895587B (en) | Method and device for determining target user | |
US11437038B2 (en) | Recognition and restructuring of previously presented materials | |
CN114298007A (en) | Text similarity determination method, device, equipment and medium | |
CN117131152B (en) | Information storage method, apparatus, electronic device, and computer readable medium | |
US9946765B2 (en) | Building a domain knowledge and term identity using crowd sourcing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200804 |