CN110489757A - A kind of keyword extracting method and device - Google Patents
A kind of keyword extracting method and device Download PDFInfo
- Publication number
- CN110489757A CN110489757A CN201910789844.1A CN201910789844A CN110489757A CN 110489757 A CN110489757 A CN 110489757A CN 201910789844 A CN201910789844 A CN 201910789844A CN 110489757 A CN110489757 A CN 110489757A
- Authority
- CN
- China
- Prior art keywords
- text
- candidate word
- node
- weight
- processed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a kind of keyword extracting method and devices, the corresponding target text library of the text type of available text to be processed, based on target text library, calculate respective first weight of each candidate word of text to be processed, co-occurrence number based on the corresponding candidate word of every two node in candidate word figure, respective second weight of each candidate word is calculated, the first weight and the second weight are based on, the keyword of text to be processed is determined from each candidate word.Based on above-mentioned processing, since the text for including in target text library is identical as the text type of text to be processed, therefore, the first weight determined according to target text library can effectively embody whether each candidate word can effectively express the theme of text to be processed, in addition, the second weight that the co-occurrence number based on candidate word is determined, the correlation degree between candidate word can be embodied, in turn, based on the first weight and the second weight is combined, the accuracy of the keyword for the text to be processed determined is higher.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of keyword extracting method and device.
Background technique
With the development of computer technology, internet provides text on the line of magnanimity, since keyword can be briefly general
The theme of a text representation is included, user can pass through keyword from when selecting the target text needed on magnanimity line in text
It is retrieved, it in turn, can be quickly from the target text for determining to need in text on the line of magnanimity.
It in the prior art, can be to be processed when determining the keyword of text (being properly termed as text to be processed)
Text carries out word segmentation processing, obtains multiple candidate words.For each candidate word in multiple candidate words, the candidate word can be calculated
Word frequency in text to be processed, calculating includes that the number of text of the candidate word (is properly termed as first in pre-set text library
Number), and the number (being properly termed as the second number) for all texts for including according to pre-set text library, obtain the inverse of the candidate word
To text frequency, the reverse text frequency of the candidate word is to take logarithm to obtain the ratio of the second number and the first number.So
Afterwards, can by the product of word frequency of the candidate word in text to be processed and reverse text frequency, as the weight of the candidate word,
It in turn, can keyword by the higher preset number candidate word of weight in multiple candidate words, as text to be processed.
However, since the correlation of the text and text to be processed that include in pre-set text library is lower, according to pre-set text
The validity in library, the weight for the candidate word determined is lower, in turn, causes the accuracy for the keyword determined lower.
Summary of the invention
The embodiment of the present invention is designed to provide a kind of keyword extracting method and device, can be improved determine to
Handle the accuracy of the keyword of text.
In a first aspect, in order to achieve the above object, the embodiment of the invention provides a kind of keyword extracting method, the side
Method includes:
Obtain the corresponding target text library of text type of text to be processed, wherein include in the target text library
Text is identical as the text type of the text to be processed;
Based on the target text library, respective first weight of each candidate word of the text to be processed is calculated, wherein institute
Stating the first weight is according to each candidate word in word frequency in the text to be processed and inverse in the target text library
It is determined to text frequency;
Based on the co-occurrence number of the corresponding candidate word of every two node in candidate word figure, it is respective to calculate each candidate word
Second weight, wherein each node and each candidate word in the candidate word figure correspond;
Based on first weight and second weight, the pass of the text to be processed is determined from each candidate word
Keyword.
Optionally, before the corresponding target text library of text type for obtaining text to be processed, the method is also
Include:
The corresponding term vector of each candidate word is obtained, as candidate term vector;
According to the type prediction network model that the candidate word vector sum is trained in advance, the text of the text to be processed is determined
This type;
The type prediction network model is trained to obtain according to default training set, wraps in the default training set
Multiple training samples are included, a training sample includes the corresponding term vector of candidate word and the sample text of a sample text
Corresponding type distribution vector, the type distribution vector are used to indicate that the text type of the sample text to be to preset each text class
The probability of type.
Optionally, the co-occurrence number based on the corresponding candidate word of every two node in candidate word figure calculates described each
Respective second weight of candidate word, comprising:
The score of each node in the candidate word figure is calculated according to iterative formula, wherein the iterative formula are as follows:
viIndicate i-th of node in the candidate word figure, S (vi) indicate the node viScore, d indicate damping system
Number, In (vi) indicate to be directed toward the node v in the candidate word figureiNode set, Out (vi) indicate the node viIn
The set of pointed node, v in the candidate word figurejIndicate j-th of node in the candidate word figure, WijIndicate the section
Point viCorresponding candidate word and the node vjThe co-occurrence number of corresponding candidate word, S (vj) indicate the node vjScore, vk
Indicate Out (vi) in k-th of node, WjkIndicate the node vjCorresponding candidate word and the node vkCorresponding candidate word
Co-occurrence number;
When meeting the default condition of convergence, by the score of each node, as the corresponding candidate word of each node
Second weight.
Optionally, the second weight in the score by each node, as the corresponding candidate word of each node
Before, the method also includes:
For each node, calculates the score that the node current iteration is calculated and obtained with what last iterative calculation obtained
The absolute value for the difference divided, the score difference as the node;
If each score difference being calculated is respectively less than default value, determine to meet the default condition of convergence.
Optionally, described to be based on first weight and second weight, determined from each candidate word it is described to
Handle the keyword of text, comprising:
The time is calculated according to the first weight, the second weight and the first preset formula of the candidate word for each candidate word
Select the target weight of word, first preset formula are as follows:
W=α × P+ β × S
W indicates that the target weight of the candidate word, P indicate the first weight of the candidate word, and α indicates the first coefficient, and S is indicated should
Second weight of candidate word, β indicate the second coefficient;
According to the size for each target weight being calculated, preset number candidate word is chosen from each candidate word,
Keyword as the text to be processed.
Second aspect, in order to achieve the above object, the embodiment of the invention provides a kind of keyword extracting device, the dress
It sets and includes:
Obtain module, the text type corresponding target text library for obtaining text to be processed, wherein the target text
The text for including in this library is identical as the text type of the text to be processed;
First processing module calculates each candidate word of the text to be processed respectively for being based on the target text library
The first weight, wherein first weight is according to word frequency of each candidate word in the text to be processed and in institute
State what the reverse text frequency in target text library determined;
Second processing module is calculated for the co-occurrence number based on the corresponding candidate word of every two node in candidate word figure
Respective second weight of each candidate word, wherein each node and each candidate word in the candidate word figure correspond;
Determining module, described in being determined from each candidate word based on first weight and second weight
The keyword of text to be processed.
Optionally, described device further include:
Third processing module, for obtaining the corresponding term vector of each candidate word, as candidate term vector;
According to the type prediction network model that the candidate word vector sum is trained in advance, the text of the text to be processed is determined
This type;
The type prediction network model is trained to obtain according to default training set, wraps in the default training set
Multiple training samples are included, a training sample includes the corresponding term vector of candidate word and the sample text of a sample text
Corresponding type distribution vector, the type distribution vector are used to indicate that the text type of the sample text to be to preset each text class
The probability of type.
Optionally, the Second processing module, specifically for calculating each section in the candidate word figure according to iterative formula
The score of point, wherein the iterative formula are as follows:
viIndicate i-th of node in the candidate word figure, S (vi) indicate the node viScore, d indicate damping system
Number, In (vi) indicate to be directed toward the node v in the candidate word figureiNode set, Out (vi) indicate the node viIn
The set of pointed node, v in the candidate word figurejIndicate j-th of node in the candidate word figure, WijIndicate the section
Point viCorresponding candidate word and the node vjThe co-occurrence number of corresponding candidate word, S (vj) indicate the node vjScore, vk
Indicate Out (vi) in k-th of node, WjkIndicate the node vjCorresponding candidate word and the node vkCorresponding candidate word
Co-occurrence number;
When meeting the default condition of convergence, by the score of each node, as the corresponding candidate word of each node
Second weight.
Optionally, the Second processing module is also used to calculate the node current iteration for each node and be calculated
Score and the obtained absolute value of the difference of score of last iterative calculation, the score difference as the node;
If each score difference being calculated is respectively less than default value, determine to meet the default condition of convergence.
Optionally, the determining module is specifically used for being directed to each candidate word, according to the first weight of the candidate word, the
Two weights and the first preset formula calculate the target weight of the candidate word, first preset formula are as follows:
W=α × P+ β × S
W indicates that the target weight of the candidate word, P indicate the first weight of the candidate word, and α indicates the first coefficient, and S is indicated should
Second weight of candidate word, β indicate the second coefficient;
According to the size for each target weight being calculated, preset number candidate word is chosen from each candidate word,
Keyword as the text to be processed.
The third aspect, in order to achieve the above object, the embodiment of the invention discloses a kind of electronic equipment, including processor,
Communication interface, memory and communication bus, wherein the processor, the communication interface, the memory pass through the communication
Bus completes mutual communication;
The memory, for storing computer program;
The processor when for executing the program stored on the memory, realizes any of the above-described keyword extraction
The step of method.
At the another aspect that the present invention is implemented, the embodiment of the invention also provides a kind of computer readable storage medium, institutes
It states and is stored with instruction in computer readable storage medium, when run on a computer, so that computer execution is any of the above-described
The step of keyword extracting method.
At the another aspect that the present invention is implemented, the embodiment of the invention also provides a kind of, and the computer program comprising instruction is produced
Product, when run on a computer, so that the step of computer executes any of the above-described keyword extracting method.
The text type of a kind of keyword extracting method provided in an embodiment of the present invention, available text to be processed is corresponding
Target text library, be based on target text library, calculate respective first weight of each candidate word of text to be processed, be based on candidate word
The co-occurrence number of the corresponding candidate word of every two node in figure calculates respective second weight of each candidate word, is based on the first weight
With the second weight, the keyword of text to be processed is determined from each candidate word.
Based on above-mentioned processing, since the text for including in target text library is identical as the text type of text to be processed, because
This, the first weight determined according to target text library can effectively embody whether each candidate word can effectively be expressed wait locate
The theme for managing text, in addition, the second weight that the co-occurrence number based on candidate word is determined, can embody the pass between candidate word
Connection degree, in turn, based on combining the first weight and the second weight, the accuracy of the keyword for the text to be processed determined compared with
It is high.
Certainly, implement any of the products of the present invention or method it is not absolutely required at the same reach all the above excellent
Point.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of flow chart of keyword extracting method provided in an embodiment of the present invention;
Fig. 2 is a kind of exemplary flow chart of keyword extracting method provided in an embodiment of the present invention;
Fig. 3 is a kind of structure chart of type prediction network model provided in an embodiment of the present invention;
Fig. 4 is a kind of rate of precision of keyword extracting method-recall rate curve comparison figure provided in an embodiment of the present invention;
Fig. 5 is a kind of structure chart of keyword extracting device provided in an embodiment of the present invention;
Fig. 6 is the structure chart of a kind of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Below by specific embodiment, keyword extracting method provided in an embodiment of the present invention is described in detail.
Referring to Fig. 1, Fig. 1 is a kind of flow chart of keyword extracting method provided in an embodiment of the present invention, and this method can be with
Applied to electronic equipment, which can be server, or terminal.
This method may include steps of:
S101: the corresponding target text library of text type of text to be processed is obtained.
Wherein, the text for including in target text library is identical as the text type of text to be processed.
The text type of one text is for indicating theme expressed by the content of text of the text, for example, a text
Text type can be social news, or entertainment news or sports news, but it is not limited to this.
All texts that electronic equipment can include in the text library for locally presetting multiple and different text types, each library
This text type is identical.Electronic equipment can obtain identical with the text type of text to be processed from multiple text libraries
Text library, as target text library.
Correspondingly, electronic equipment can also obtain the content of text of text to be processed, right before obtaining target text library
The content of text of text to be processed is analyzed, and determines the text type of text to be processed.
In addition, in order to further increase the accuracy of the text type of determining text to be processed, electronic equipment can root
According to type prediction network model, the text type of text to be processed is determined.
Optionally, before S101, this method may also comprise the following steps::
Step 1: the corresponding term vector of each candidate word is obtained, as candidate term vector.
In a kind of implementation, the content of text of the available text to be processed of electronic equipment, to the text of text to be processed
This content carries out word segmentation processing, obtains each candidate word of text to be processed.
Electronic equipment can carry out subordinate sentence processing to the content of text of text to be processed, multiple sentences be obtained, for each
A sentence, electronic equipment can also carry out word segmentation processing again, in turn, obtain the word that text to be processed includes and (are properly termed as alternative
Word), then, electronic equipment can delete the default stop words in alternative word, obtain the candidate word of text to be processed, preset and deactivate
Word is the function word of not no physical meaning, for example, " then ", " secondly ", " ", the words such as " ".It is understood that deleting pre-
If the candidate word obtained after stop words is usually the word of the parts of speech such as noun, verb.
In turn, electronic equipment can be according to Word2Vec (word to vector, word to vector) model, to each candidate word
Mapping processing is carried out, the corresponding term vector of each candidate word (i.e. candidate term vector) is obtained.
Step 2: the type prediction network model trained in advance according to candidate word vector sum determines the text of text to be processed
This type.
Wherein, type prediction network model is trained to obtain according to default training set, includes in default training set
Multiple training samples, a training sample include the corresponding term vector of candidate word and the sample text pair of a sample text
The type distribution vector answered, type distribution vector are used to indicate that the text type of the sample text to be preset each text type general
Rate.
Type prediction network model can be LSTM (Long Short-Term Memory, shot and long term memory) network mould
Type or other network models for being used to classify.
It is understood that sample text is the text for having determined text type.
According to type prediction network model trained in advance, before the text type for determining text to be processed, electronics is set
It is standby sample text to be handled, generate default training set.In turn, according to default training set, to type prediction network mould
Type is trained.
In a kind of implementation, electronic equipment can carry out word segmentation processing to the content of text of sample text, obtain sample
The candidate word of text carries out mapping processing according to candidate word of the Word2Vec model to sample text, obtains the time of sample text
The corresponding term vector of word is selected, and according to the text type of sample text and presets each text type, determines that sample text is corresponding
Type distribution vector.
Illustratively, presetting each text type may include: text type A, text type B, text type C, if sample
The text type of this text is text type A, then the corresponding type distribution vector of sample text are as follows: [1,0,0];If sample is literary
This text type is text type B, then the corresponding type distribution vector of sample text are as follows: [0,1,0].
Then, electronic equipment can be using the corresponding term vector of the candidate word of sample text as type prediction network model
Parameter is inputted, using the corresponding type distribution vector of the sample text as corresponding output parameter, to type prediction network model
It is trained, until type prediction network model reaches the condition of convergence, obtains trained type prediction network model.
In turn, candidate term vector can be inputted trained type prediction network model by electronic equipment.
The full articulamentum of type prediction network model can be according to np.concatenate (array splicing) function, to candidate
Term vector carries out splicing, obtains the corresponding multiple vectors of candidate term vector (being properly termed as first vector), and by the
One vector is transmitted to the hidden layer of type prediction network model.
The hidden layer of type prediction network model can be according to precedence relationship of each sentence in text to be processed, to what is obtained
First vector is handled, and the multiple vectors (being properly termed as second vector) that can indicate text to be processed are obtained, and
Second vector is transmitted to the mean value pond layer of type prediction network model.
The mean value pond layer of type prediction network model can be carried out to multiple second vector of text to be processed are indicated
The processing of mean value pondization, obtains the text vector for indicating text to be processed, and text vector is transmitted to type prediction network model
Output layer.
The output layer of type prediction network model can be according to softmax (normalization) function, to expression text to be processed
Text vector carry out recurrence processing, obtain the corresponding type distribution vector of text to be processed.
In turn, electronic equipment can will be preset in each text type according to the corresponding type distribution vector of text to be processed
The text type of corresponding maximum probability, the text type as text to be processed.
Illustratively, presetting each text type can be with are as follows: text type A, text type B, text type C, electronic equipment
According to the candidate word vector sum type prediction network model of text to be processed, the corresponding type of text to be processed determined be distributed to
Amount are as follows: [0.5,0.7,0.3], then electronic equipment can be using text type B as the text type of text to be processed.
S102: it is based on target text library, calculates respective first weight of each candidate word of text to be processed.
Wherein, the first weight is according to each candidate word in word frequency in text to be processed and reverse in target text library
What text frequency determined.
After obtaining target text corresponding with the text type of text to be processed library, for each of text to be processed
Candidate word, electronic equipment can calculate word frequency of the candidate word in text to be processed, in addition, electronic equipment can also calculate this
Reverse text frequency of the candidate word in target text library calculates the first weight of the candidate word in turn.
In a kind of implementation, for each candidate word, electronic equipment can calculate the candidate according to word frequency calculation formula
Word frequency of the word in text to be processed, wherein word frequency calculation formula are as follows:
Tf indicates the word frequency of the candidate word, and n indicates that the number that the candidate word occurs in text to be processed, m are indicated wait locate
Manage the total number of the candidate word of text, nkIndicate the number that k-th of candidate word occurs in text to be processed,It indicates
The sum of the number that each candidate word occurs in text to be processed.
Then, electronic equipment can calculate the candidate word in target text library according to reverse text frequency calculation formula
Word frequency, wherein reverse text frequency calculation formula are as follows:
tiIndicate the candidate word, djIndicate text to be processed, idfiIndicate the reverse text frequency of the candidate word, | D | it indicates
The number for all texts for including in target text library, | { j:ti∈dj| indicate that in target text library include the candidate word
The number of text.In addition, if the text comprising the candidate word is not present in target text library, | { j:ti∈dj| value be
Zero, therefore, in order to avoid calculating mistake, the dividend in formula (2) is set as 1+ | { j:ti∈dj}|。
In turn, electronic equipment can calculate the first weight of the candidate word according to the first weight calculation formula, wherein the
One weight calculation formula are as follows:
P=tf × idfi (3)
P indicates the first weight of the candidate word, and tf indicates the word frequency of the candidate word, idfiIndicate the reverse text of the candidate word
This frequency.
Illustratively, if the candidate word of text to be processed includes: candidate word A, candidate word B, candidate word C, candidate word D,
And candidate word A, candidate word B, candidate word C, the number that candidate word D occurs in text to be processed are respectively as follows: 3 times, and 7 times, 4 times, 1
Secondary, electronic equipment can be according to the number (i.e. 3) that candidate word A occurs in text to be processed, and each candidate word is in text to be processed
The sum of number of middle appearance (i.e. 15) and formula (1), are calculated word frequency of the candidate word A in text to be processed are as follows: tf=
5。
If the number for all texts for including in target text library is 100, in target text library includes candidate word A's
The number of text is 9, then electronic equipment can be according to the number (i.e. 100) for all texts for including in target text library, target
The number (i.e. 9) and formula (2) of the text comprising candidate word A in text library, are calculated candidate word A in target text library
In reverse text frequency are as follows: idfi=1.
In turn, electronic equipment can be according to word frequency of the candidate word A in text to be processed (i.e. 5), and candidate word A is in target text
Reverse text frequency (i.e. 1) and formula (3) in this library, are calculated the first weight of candidate word A are as follows: P=5.
In addition, in order to improve the accuracy of the first weight of each candidate word being calculated, electronic equipment can also be arranged
Update cycle updates the text in target text library when reaching the update cycle.Wherein, the update cycle can be by technical staff
Rule of thumb it is arranged, for example, the update cycle can be 1 day, the update cycle may be 2 days, and but it is not limited to this.
S103: the co-occurrence number based on the corresponding candidate word of every two node in candidate word figure calculates each candidate word respectively
The second weight.
Wherein, each node in candidate word figure and each candidate word correspond.
In inventive embodiments, electronic equipment can determine every in each candidate word according to the length of preset co-occurrence window
Co-occurrence number of two candidate words in text to be processed.
Wherein, the length of preset co-occurrence window can be rule of thumb arranged by technical staff, for example, preset co-occurrence window
The length of mouth can be 8, and the length of preset co-occurrence window can be 10, and but it is not limited to this.
Illustratively, the content of text of text to be processed can be with are as follows: " living needs oneself go to create, and need oneself to go to advise
Draw life, grow with each passing hour, our writers as life should listen attentively to the cry in epoch, again the heart have it is sincere keep, do not exceed square ".
Electronic equipment can carry out word segmentation processing to text to be processed, and it includes: " raw for obtaining the candidate word of text to be processed
Work/needs/oneself/go/create/need/oneself/go/plan/life/grow with each passing hour/we/conduct/life/writer/both/
Want/listen attentively to/epoch/cry/again// heart/have/sincere keep/or not do not exceed square ".
If the length of preset co-occurrence window is 10, available when co-occurrence window slides backward:
[life, needs, oneself, it goes, creates, need, oneself, go, plan, life],
[needing, oneself goes, creates, need, oneself goes, plans, life grows with each passing hour],
……
[life, writer both want, listen attentively to, the epoch, cry, in addition, the heart]
……
[listening attentively to, the epoch, cry, in addition, the heart has, sincere to keep, and or not exceedes square] multiple co-occurrence windows.
When calculating co-occurrence number, compute repeatedly in order to prevent, it can be using first candidate word in co-occurrence window as base
Standard calculates the co-occurrence number of other candidate words in the candidate word and co-occurrence window.For example, candidate word " life " and candidate word
The co-occurrence number of " needs " in text to be processed is 1, and candidate word " life " and candidate word " creation " are in text to be processed
Co-occurrence number is 1.
If the co-occurrence number of two candidate words in each candidate word is not 0, electronic equipment can determine this two times
Selecting word, there are co-occurrence sides, and in turn, electronic equipment can obtain the corresponding candidate word of each candidate word according to the co-occurrence side of each candidate word
Figure.
Optionally, electronic equipment can be according to co-occurrence of the every two candidate word in text to be processed in each candidate word time
Number calculates the score of each node in candidate word figure.
In a kind of implementation, electronic equipment can calculate the score of each node in candidate word figure according to iterative formula,
Wherein, iterative formula are as follows:
viIndicate i-th of node in candidate word figure, S (vi) indicate node viScore, d indicate damped coefficient, In (vi)
It indicates to be directed toward node v in candidate word figureiNode set, Out (vi) indicate node viThe pointed node in candidate word figure
Set, vjIndicate j-th of node in candidate word figure, WijIndicate node viCorresponding candidate word and node vjCorresponding candidate
The co-occurrence number of word, S (vj) indicate node vjScore, vkIndicate Out (vi) in k-th of node, WjkIndicate node vjIt is corresponding
Candidate word and node vkThe co-occurrence number of corresponding candidate word.Due to being nothing in the candidate word figure that is constructed in the embodiment of the present invention
Xiang Tu, therefore, In (vi) and Out (vi) indicate same node set.Damped coefficient d can be with value 0.85, and damped coefficient d can also
With value 0.7, but it is not limited to this.
In addition, electronic equipment can also be arranged in candidate word figure before the score for calculating each node according to iterative formula
Each node initial score, initial score can rule of thumb be arranged by technical staff, for example, initial score can be 1,
Initial score may be 2, and but it is not limited to this.
Optionally, for each node, electronic equipment can calculate the node current iteration after each iterative calculation
The absolute value of the difference for the score that the score and last iterative calculation being calculated obtain, the score difference as the node.
Electronic equipment may determine that whether the score difference of each node is less than default value, if each score difference being calculated is small
In default value, then determine to meet the default condition of convergence.
Wherein, default value can be rule of thumb arranged by technical staff, for example, default value can be 0.0001, in advance
If numerical value may be 0.00001, but it is not limited to this.
The second weight when meeting the default condition of convergence, by the score of each node, as the corresponding candidate word of each node.
After an iteration calculating, if electronic equipment determines the score difference of each node, respectively less than default value, then
After electronic equipment can calculate current iteration, the score of each node, the second weight as the corresponding candidate word of each node.
S104: being based on the first weight and the second weight, and the keyword of text to be processed is determined from each candidate word.
Electronic equipment is based on the first weight and the second weight, and the side of the keyword of text to be processed is determined from each candidate word
Method can be diversified, and optionally, S104 may comprise steps of:
Step 1: being directed to each candidate word, and electronic equipment can be according to the first weight of the candidate word, the second weight and the
One preset formula calculates the target weight of the candidate word, the first preset formula are as follows:
W=α × P+ β × S (5)
W indicates that the target weight of the candidate word, P indicate the first weight of the candidate word, and α indicates the first coefficient, and S is indicated should
Second weight of candidate word, β indicate the second coefficient.First coefficient and the second coefficient and value be 1.
Wherein, the first coefficient, the second coefficient can be rule of thumb arranged by technical staff, for example, the first coefficient can be
0.4, the second coefficient can be 0.6, alternatively, the first coefficient can be 0.2, the second coefficient can be 0.8, and but it is not limited to this.
Illustratively, if the first coefficient is 0.4, the second coefficient is 0.6, and the first weight of candidate word A is 3, the second power
Weight is 1, and the target weight of candidate word A can be calculated are as follows: W according to formula (5) in electronic equipmentA=1.8.
Step 2: according to the size for each target weight being calculated, it is candidate that preset number is chosen from each candidate word
Word, the keyword as text to be processed.
Wherein, preset number can be rule of thumb arranged by technical staff, for example, preset number can be 5, preset number
It may be 8, but it is not limited to this.
In a kind of implementation, electronic equipment can be carried out each candidate word according to the sequence of target weight from big to small
Sequence, obtains candidate word sequence, in turn, can by preset number candidate word more forward in candidate word sequence, as to
Handle the keyword of text.
Illustratively, preset number can be 2, if the candidate word of text to be processed includes: candidate word A, candidate word B,
Candidate word C, candidate word D, and candidate word A, candidate word B, candidate word C, the target weight of candidate word D are respectively as follows: 1.3,0.9,2,
1.7.Sequence of the electronic equipment according to target weight from big to small, is ranked up each candidate word, obtained candidate word sequence are as follows:
Candidate word C, candidate word D, candidate word A, candidate word B, then electronic equipment can be using candidate word C and candidate word D as text to be processed
This keyword.
Based on above-mentioned processing, since the text for including in target text library is identical as the text type of text to be processed, because
This, the first weight determined according to target text library can effectively embody whether each candidate word can effectively be expressed wait locate
The theme for managing text, in addition, the second weight that the co-occurrence number based on candidate word is determined, can embody the pass between candidate word
Connection degree, in turn, based on combining the first weight and the second weight, the accuracy of the keyword for the text to be processed determined compared with
It is high.
Referring to fig. 2, Fig. 2 is a kind of exemplary flow chart of keyword extracting method provided in an embodiment of the present invention.This method
It may comprise steps of:
S201: word segmentation processing is carried out to the content of text of text to be processed, obtains each candidate word of text to be processed.
S202: according to word to vector model, carrying out mapping processing to each candidate word, obtain the term vector of each candidate word, makees
For candidate term vector.
S203: the type prediction network model trained in advance according to candidate word vector sum determines the text of text to be processed
Type.
Wherein, type prediction network model can be LSTM network model.
S204: the corresponding target text library of text type of text to be processed is obtained.
Wherein, the text for including in target text library is identical as the text type of text to be processed.
S205: it is based on target text library, calculates respective first weight of each candidate word of text to be processed.
Wherein, the first weight is according to each candidate word in word frequency in text to be processed and reverse in target text library
What text frequency determined.
S206: it is calculated according to the co-occurrence number of the corresponding candidate word of every two node in candidate word figure and iterative formula each
The score of node.
Wherein, each node in candidate word figure and each candidate word correspond, iterative formula are as follows:
viIndicate i-th of node in candidate word figure, S (vi) indicate node viScore, d indicate damped coefficient, In (vi)
It indicates to be directed toward node v in candidate word figureiNode set, Out (vi) indicate node viThe pointed node in candidate word figure
Set, vjIndicate j-th of node in candidate word figure, WijIndicate node viCorresponding candidate word and node vjCorresponding candidate
The co-occurrence number of word, S (vj) indicate node vjScore, vkIndicate Out (vi) in k-th of node, WjkIndicate node vjIt is corresponding
Candidate word and node vkThe co-occurrence number of corresponding candidate word.
S207: being directed to each node, calculates the score that the node current iteration is calculated and iterates to calculate with the last time
The absolute value of the difference of the score arrived, as the score difference of the node, if the score difference of each node, respectively less than present count
Value then determines to meet the condition of convergence.
S208: the second power when meeting the condition of convergence, by the score of each node, as the corresponding candidate word of each node
Weight.
S209: being directed to each candidate word, according to the first weight, the second weight and the first preset formula of the candidate word, meter
Calculate the target weight of the candidate word.
Wherein, the first preset formula are as follows:
W=α × P+ β × S
W indicates that the target weight of the candidate word, P indicate the first weight of the candidate word, and α indicates the first coefficient, and S is indicated should
Second weight of candidate word, β indicate the second coefficient.
S2010: according to the size for each target weight being calculated, it is candidate that preset number is chosen from each candidate word
Word, the keyword as text to be processed.
Referring to Fig. 3, Fig. 3 is a kind of structure chart of type prediction network model provided in an embodiment of the present invention.The type is pre-
Surveying network model includes: input layer, full articulamentum, hidden layer, mean value pond layer, output layer.
Candidate term vector input type can be predicted network model by input layer by electronic equipment.
Full articulamentum can carry out splicing to candidate term vector, obtain candidate according to np.concatenate function
The corresponding multiple vectors of term vector (i.e. first vector).
Hidden layer can be according to precedence relationship of each sentence in text to be processed, at first obtained vector
Reason, obtains the multiple vectors (i.e. second vector) that can indicate text to be processed.
Mean value pond layer can carry out the processing of mean value pondization to multiple second vector of text to be processed are indicated, obtain table
Show the text vector of text to be processed.
Output layer can carry out recurrence processing to the text vector for indicating text to be processed, obtain according to softmax function
The corresponding type distribution vector of text to be processed.
In addition, in order to TF-IDF in the prior art (Term Frequency-inverse document
Frequency, the reverse text frequency of word frequency -) algorithm, TextRank (text alignment) algorithm distinguish, the embodiment of the present invention
The keyword extraction algorithm of offer is properly termed as TF-TR algorithm.By the keyword and 5 people couple that extract 100 texts simultaneously
100 text marking keywords as a result, being tested.The experimental result of three kinds of algorithms is compared, available Fig. 4 and
Table 1.
Referring to fig. 4, Fig. 4 is a kind of rate of precision of keyword extracting method-recall rate curve provided in an embodiment of the present invention
Comparison diagram.Wherein, the solid line with five-pointed star indicates that rate of precision-recall rate of TF-IDF algorithm is corresponding with keyword extraction number
Relationship, the solid line with cross indicate rate of precision-recall rate of TextRank algorithm and the corresponding relationship of keyword extraction number, band
The solid line of line segment indicates rate of precision-recall rate of TF-TR algorithm and the corresponding relationship of keyword extraction number.
Rate of precision calculation formula are as follows:
Precision indicates rate of precision, and N indicates number, and TP indicates that the keyword determined according to algorithm is determined by h-th of people
For the quantity of keyword, FP indicates that the keyword determined according to algorithm is determined as the quantity of non-key word by h-th of people.
Recall rate calculation formula are as follows:
Recall indicates recall rate, and N indicates number, and TP indicates that the keyword determined according to algorithm is determined as by h-th of people
The quantity of keyword, FN indicate that the non-key word determined according to algorithm is determined as the quantity of keyword by h-th of people.
Referring to table 1, table 1 is a kind of F1 value contrast table of keyword extracting method provided in an embodiment of the present invention, and F1 value is
For indicating the accuracy rate parameter balanced with recall rate.Wherein, F1 value calculation formula are as follows:
Precision indicates rate of precision, and Recall indicates recall rate.
Table 1
Algorithm | TF-IDF | TextRank | TF-TR |
F1 | 0.831456 | 0.823456 | 0.851383 |
Referring to Fig. 5, Fig. 5 is the embodiment of the invention provides a kind of keyword extracting device, and described device includes:
Obtain module 501, the text type corresponding target text library for obtaining text to be processed, wherein the mesh
The text for including in mark text library is identical as the text type of the text to be processed;
First processing module 502, for being based on the target text library, each candidate word for calculating the text to be processed is each
From the first weight, wherein first weight be according to word frequency of each candidate word in the text to be processed and
What the reverse text frequency in the target text library determined;
Second processing module 503, for the co-occurrence number based on the corresponding candidate word of every two node in candidate word figure, meter
Calculate respective second weight of each candidate word, wherein each node and each candidate word one in the candidate word figure are a pair of
It answers;
Determining module 504 determines institute for being based on first weight and second weight from each candidate word
State the keyword of text to be processed.
Optionally, described device further include:
Third processing module, for obtaining the corresponding term vector of each candidate word, as candidate term vector;
According to the type prediction network model that the candidate word vector sum is trained in advance, the text of the text to be processed is determined
This type;
The type prediction network model is trained to obtain according to default training set, wraps in the default training set
Multiple training samples are included, a training sample includes the corresponding term vector of candidate word and the sample text of a sample text
Corresponding type distribution vector, the type distribution vector are used to indicate that the text type of the sample text to be to preset each text class
The probability of type.
Optionally, the Second processing module 503, it is each in the candidate word figure specifically for being calculated according to iterative formula
The score of node, wherein the iterative formula are as follows:
viIndicate i-th of node in the candidate word figure, S (vi) indicate the node viScore, d indicate damping system
Number, In (vi) indicate to be directed toward the node v in the candidate word figureiNode set, Out (vi) indicate the node viIn
The set of pointed node, v in the candidate word figurejIndicate j-th of node in the candidate word figure, WijIndicate the section
Point viCorresponding candidate word and the node vjThe co-occurrence number of corresponding candidate word, S (vj) indicate the node vjScore, vk
Indicate Out (vi) in k-th of node, WjkIndicate the node vjCorresponding candidate word and the node vkCorresponding candidate word
Co-occurrence number;
When meeting the default condition of convergence, by the score of each node, as the corresponding candidate word of each node
Second weight.
Optionally, the Second processing module is also used to calculate the node current iteration for each node and be calculated
Score and the obtained absolute value of the difference of score of last iterative calculation, the score difference as the node;
If each score difference being calculated is respectively less than default value, determine to meet the default condition of convergence.
Optionally, the determining module 504, be specifically used for be directed to each candidate word, according to the first weight of the candidate word,
Second weight and the first preset formula calculate the target weight of the candidate word, first preset formula are as follows:
W=α × P+ β × S
W indicates that the target weight of the candidate word, P indicate the first weight of the candidate word, and α indicates the first coefficient, and S is indicated should
Second weight of candidate word, β indicate the second coefficient;
According to the size for each target weight being calculated, preset number candidate word is chosen from each candidate word,
Keyword as the text to be processed.
The embodiment of the invention also provides a kind of electronic equipment, as shown in fig. 6, include processor 601, communication interface 602,
Memory 603 and communication bus 604, wherein processor 601, communication interface 602, memory 603 are complete by communication bus 604
At mutual communication,
Memory 603, for storing computer program;
Processor 601 when for executing the program stored on memory 603, realizes following steps:
Obtain the corresponding target text library of text type of text to be processed, wherein include in the target text library
Text is identical as the text type of the text to be processed;
Based on the target text library, respective first weight of each candidate word of the text to be processed is calculated, wherein institute
Stating the first weight is according to each candidate word in word frequency in the text to be processed and inverse in the target text library
It is determined to text frequency;
Based on the co-occurrence number of the corresponding candidate word of every two node in candidate word figure, it is respective to calculate each candidate word
Second weight, wherein each node and each candidate word in the candidate word figure correspond;
Based on first weight and second weight, the pass of the text to be processed is determined from each candidate word
Keyword.
It should be noted that other implementations of above-mentioned keyword extracting method and preceding method embodiment part phase
Together, which is not described herein again.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component
Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard
Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just
It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.
Communication interface is for the communication between above-mentioned electronic equipment and other equipment.
Memory may include random access memory (Random Access Memory, RAM), also may include non-easy
The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also
To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit,
CPU), network processing unit (Network Processor, NP) etc.;It can also be digital signal processor (Digital Signal
Processing, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing
It is field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete
Door or transistor logic, discrete hardware components.
Based on above-mentioned processing, since the text for including in target text library is identical as the text type of text to be processed, because
This, the first weight determined according to target text library can effectively embody whether each candidate word can effectively be expressed wait locate
The theme for managing text, in addition, the second weight that the co-occurrence number based on candidate word is determined, can embody the pass between candidate word
Connection degree, in turn, based on combining the first weight and the second weight, the accuracy of the keyword for the text to be processed determined compared with
It is high.
In another embodiment provided by the invention, a kind of computer readable storage medium is additionally provided, which can
It reads to be stored with computer program in storage medium, the computer program realizes that any of the above-described keyword mentions when being executed by processor
The step of taking method.
Specifically, the above method includes:
Obtain the corresponding target text library of text type of text to be processed, wherein include in the target text library
Text is identical as the text type of the text to be processed;
Based on the target text library, respective first weight of each candidate word of the text to be processed is calculated, wherein institute
Stating the first weight is according to each candidate word in word frequency in the text to be processed and inverse in the target text library
It is determined to text frequency;
Based on the co-occurrence number of the corresponding candidate word of every two node in candidate word figure, it is respective to calculate each candidate word
Second weight, wherein each node and each candidate word in the candidate word figure correspond;
Based on first weight and second weight, the pass of the text to be processed is determined from each candidate word
Keyword.
It should be noted that other implementations of above-mentioned keyword extracting method and preceding method embodiment part phase
Together, which is not described herein again.
Based on above-mentioned processing, since the text for including in target text library is identical as the text type of text to be processed, because
This, the first weight determined according to target text library can effectively embody whether each candidate word can effectively be expressed wait locate
The theme for managing text, in addition, the second weight that the co-occurrence number based on candidate word is determined, can embody the pass between candidate word
Connection degree, in turn, based on combining the first weight and the second weight, the accuracy of the keyword for the text to be processed determined compared with
It is high.
In another embodiment provided by the invention, a kind of computer program product comprising instruction is additionally provided, when it
When running on computers, so that the step of computer executes any keyword extracting method in above-described embodiment.
Specifically, the above method includes:
Obtain the corresponding target text library of text type of text to be processed, wherein include in the target text library
Text is identical as the text type of the text to be processed;
Based on the target text library, respective first weight of each candidate word of the text to be processed is calculated, wherein institute
Stating the first weight is according to each candidate word in word frequency in the text to be processed and inverse in the target text library
It is determined to text frequency;
Based on the co-occurrence number of the corresponding candidate word of every two node in candidate word figure, it is respective to calculate each candidate word
Second weight, wherein each node and each candidate word in the candidate word figure correspond;
Based on first weight and second weight, the pass of the text to be processed is determined from each candidate word
Keyword.
It should be noted that other implementations of above-mentioned keyword extracting method and preceding method embodiment part phase
Together, which is not described herein again.
Based on above-mentioned processing, since the text for including in target text library is identical as the text type of text to be processed, because
This, the first weight determined according to target text library can effectively embody whether each candidate word can effectively be expressed wait locate
The theme for managing text, in addition, the second weight that the co-occurrence number based on candidate word is determined, can embody the pass between candidate word
Connection degree, in turn, based on combining the first weight and the second weight, the accuracy of the keyword for the text to be processed determined compared with
It is high.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real
It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program
Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or
It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter
Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium
In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer
Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center
User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or
Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or
It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with
It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk
Solid State Disk (SSD)) etc..
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device,
For electronic equipment, computer readable storage medium and computer program product embodiments, since it is substantially similar to method reality
Example is applied, so being described relatively simple, the relevent part can refer to the partial explaination of embodiments of method.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all
Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention
It is interior.
Claims (10)
1. a kind of keyword extracting method, which is characterized in that the described method includes:
Obtain the corresponding target text library of text type of text to be processed, wherein the text for including in the target text library
It is identical as the text type of the text to be processed;
Based on the target text library, respective first weight of each candidate word of the text to be processed is calculated, wherein described the
One weight is according to each candidate word in the word frequency in the text to be processed and the reverse text in the target text library
What this frequency determined;
Based on the co-occurrence number of the corresponding candidate word of every two node in candidate word figure, each candidate word respective second is calculated
Weight, wherein each node and each candidate word in the candidate word figure correspond;
Based on first weight and second weight, the key of the text to be processed is determined from each candidate word
Word.
2. the method according to claim 1, wherein corresponding in the text type for obtaining text to be processed
Before target text library, the method also includes:
The corresponding term vector of each candidate word is obtained, as candidate term vector;
According to the type prediction network model that the candidate word vector sum is trained in advance, the text class of the text to be processed is determined
Type;
The type prediction network model is trained to obtain according to default training set, includes more in the default training set
A training sample, a training sample include that the corresponding term vector of candidate word of a sample text and the sample text correspond to
Type distribution vector, the type distribution vector is for indicating that the text type of the sample text is to preset each text type
Probability.
3. the method according to claim 1, wherein described based on the corresponding time of every two node in candidate word figure
The co-occurrence number of word is selected, each respective second weight of candidate word is calculated, comprising:
The score of each node in the candidate word figure is calculated according to iterative formula, wherein the iterative formula are as follows:
viIndicate i-th of node in the candidate word figure, S (vi) indicate the node viScore, d indicate damped coefficient, In
(vi) indicate to be directed toward the node v in the candidate word figureiNode set, Out (vi) indicate the node vjIn the time
Select the set of node pointed in word figure, vjIndicate j-th of node in the candidate word figure, WijIndicate the node viIt is right
The candidate word answered and the node vjThe co-occurrence number of corresponding candidate word, S (vj) indicate the node vjScore, vkIt indicates
Out(vi) in k-th of node, WjkIndicate the node vjCorresponding candidate word and the node vkCorresponding candidate word is total to
Occurrence number;
When meeting the default condition of convergence, by the score of each node, as the corresponding candidate word of each node second
Weight.
4. according to the method described in claim 3, it is characterized in that, in the score by each node, as described each
Before second weight of the corresponding candidate word of node, the method also includes:
For each node, the score that the score that the node current iteration is calculated is obtained with last iterative calculation is calculated
The absolute value of difference, the score difference as the node;
If each score difference being calculated is respectively less than default value, determine to meet the default condition of convergence.
5. the method according to claim 1, wherein it is described be based on first weight and second weight,
The keyword of the text to be processed is determined from each candidate word, comprising:
The candidate word is calculated according to the first weight, the second weight and the first preset formula of the candidate word for each candidate word
Target weight, first preset formula are as follows:
W=α × P+ β × S
W indicates that the target weight of the candidate word, P indicate the first weight of the candidate word, and α indicates that the first coefficient, S indicate the candidate
Second weight of word, β indicate the second coefficient;
According to the size for each target weight being calculated, preset number candidate word is chosen from each candidate word, as
The keyword of the text to be processed.
6. a kind of keyword extracting device, which is characterized in that described device includes:
Obtain module, the text type corresponding target text library for obtaining text to be processed, wherein the target text library
In include text it is identical as the text type of the text to be processed;
First processing module calculates each candidate word respective the of the text to be processed for being based on the target text library
One weight, wherein first weight is according to word frequency of each candidate word in the text to be processed and in the mesh
Mark what the reverse text frequency in text library determined;
Second processing module, for the co-occurrence number based on the corresponding candidate word of every two node in candidate word figure, described in calculating
Each respective second weight of candidate word, wherein each node and each candidate word in the candidate word figure correspond;
Determining module determines described to from for being based on first weight and second weight from each candidate word
Manage the keyword of text.
7. device according to claim 6, which is characterized in that described device further include:
Third processing module, for obtaining the corresponding term vector of each candidate word, as candidate term vector;
According to the type prediction network model that the candidate word vector sum is trained in advance, the text class of the text to be processed is determined
Type;
The type prediction network model is trained to obtain according to default training set, includes more in the default training set
A training sample, a training sample include that the corresponding term vector of candidate word of a sample text and the sample text correspond to
Type distribution vector, the type distribution vector is for indicating that the text type of the sample text is to preset each text type
Probability.
8. device according to claim 6, which is characterized in that the Second processing module is specifically used for according to iteration public affairs
Formula calculates the score of each node in the candidate word figure, wherein the iterative formula are as follows:
viIndicate i-th of node in the candidate word figure, S (vi) indicate the node viScore, d indicate damped coefficient, In
(vi) indicate to be directed toward the node v in the candidate word figureiNode set, Out (vi) indicate the node viIn the time
Select the set of node pointed in word figure, vjIndicate j-th of node in the candidate word figure, WijIndicate the node viIt is right
The candidate word answered and the node vjThe co-occurrence number of corresponding candidate word, S (vj) indicate the node vjScore, vkIt indicates
Out(vi) in k-th of node, WjkIndicate the node vjCorresponding candidate word and the node vkCorresponding candidate word is total to
Occurrence number;
When meeting the default condition of convergence, by the score of each node, as the corresponding candidate word of each node second
Weight.
9. device according to claim 8, which is characterized in that the Second processing module is also used to for each node,
The absolute value for calculating the difference of the score that the node current iteration is calculated and the score that last iterative calculation obtains, as
The score difference of the node;
If each score difference being calculated is respectively less than default value, determine to meet the default condition of convergence.
10. device according to claim 6, which is characterized in that the determining module is specifically used for being directed to each candidate
Word calculates the target weight of the candidate word according to the first weight, the second weight and the first preset formula of the candidate word, described
First preset formula are as follows:
W=α × P+ β × S
W indicates that the target weight of the candidate word, P indicate the first weight of the candidate word, and α indicates that the first coefficient, S indicate the candidate
Second weight of word, β indicate the second coefficient;
According to the size for each target weight being calculated, preset number candidate word is chosen from each candidate word, as
The keyword of the text to be processed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910789844.1A CN110489757A (en) | 2019-08-26 | 2019-08-26 | A kind of keyword extracting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910789844.1A CN110489757A (en) | 2019-08-26 | 2019-08-26 | A kind of keyword extracting method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110489757A true CN110489757A (en) | 2019-11-22 |
Family
ID=68554062
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910789844.1A Pending CN110489757A (en) | 2019-08-26 | 2019-08-26 | A kind of keyword extracting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110489757A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111126060A (en) * | 2019-12-24 | 2020-05-08 | 东软集团股份有限公司 | Method, device and equipment for extracting subject term and storage medium |
CN111597310A (en) * | 2020-05-26 | 2020-08-28 | 成都卫士通信息产业股份有限公司 | Sensitive content detection method, device, equipment and medium |
CN111737553A (en) * | 2020-06-16 | 2020-10-02 | 苏州朗动网络科技有限公司 | Method and device for selecting enterprise associated words and storage medium |
CN112347790A (en) * | 2020-11-06 | 2021-02-09 | 北京乐学帮网络技术有限公司 | Text processing method and device, computer equipment and storage medium |
CN113051890A (en) * | 2019-12-27 | 2021-06-29 | 北京国双科技有限公司 | Method for processing domain feature keywords and related device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170139899A1 (en) * | 2015-11-18 | 2017-05-18 | Le Holdings (Beijing) Co., Ltd. | Keyword extraction method and electronic device |
CN108228566A (en) * | 2018-01-12 | 2018-06-29 | 中译语通科技股份有限公司 | More document keyword Automatic method and system, computer program |
CN109710916A (en) * | 2018-11-02 | 2019-05-03 | 武汉斗鱼网络科技有限公司 | A kind of tag extraction method, apparatus, electronic equipment and storage medium |
CN109918660A (en) * | 2019-03-04 | 2019-06-21 | 北京邮电大学 | A kind of keyword extracting method and device based on TextRank |
CN110008401A (en) * | 2019-02-21 | 2019-07-12 | 北京达佳互联信息技术有限公司 | Keyword extracting method, keyword extracting device and computer readable storage medium |
CN110083835A (en) * | 2019-04-24 | 2019-08-02 | 北京邮电大学 | A kind of keyword extracting method and device based on figure and words and phrases collaboration |
-
2019
- 2019-08-26 CN CN201910789844.1A patent/CN110489757A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170139899A1 (en) * | 2015-11-18 | 2017-05-18 | Le Holdings (Beijing) Co., Ltd. | Keyword extraction method and electronic device |
CN108228566A (en) * | 2018-01-12 | 2018-06-29 | 中译语通科技股份有限公司 | More document keyword Automatic method and system, computer program |
CN109710916A (en) * | 2018-11-02 | 2019-05-03 | 武汉斗鱼网络科技有限公司 | A kind of tag extraction method, apparatus, electronic equipment and storage medium |
CN110008401A (en) * | 2019-02-21 | 2019-07-12 | 北京达佳互联信息技术有限公司 | Keyword extracting method, keyword extracting device and computer readable storage medium |
CN109918660A (en) * | 2019-03-04 | 2019-06-21 | 北京邮电大学 | A kind of keyword extracting method and device based on TextRank |
CN110083835A (en) * | 2019-04-24 | 2019-08-02 | 北京邮电大学 | A kind of keyword extracting method and device based on figure and words and phrases collaboration |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111126060A (en) * | 2019-12-24 | 2020-05-08 | 东软集团股份有限公司 | Method, device and equipment for extracting subject term and storage medium |
CN113051890A (en) * | 2019-12-27 | 2021-06-29 | 北京国双科技有限公司 | Method for processing domain feature keywords and related device |
CN111597310A (en) * | 2020-05-26 | 2020-08-28 | 成都卫士通信息产业股份有限公司 | Sensitive content detection method, device, equipment and medium |
CN111597310B (en) * | 2020-05-26 | 2023-10-20 | 成都卫士通信息产业股份有限公司 | Sensitive content detection method, device, equipment and medium |
CN111737553A (en) * | 2020-06-16 | 2020-10-02 | 苏州朗动网络科技有限公司 | Method and device for selecting enterprise associated words and storage medium |
CN112347790A (en) * | 2020-11-06 | 2021-02-09 | 北京乐学帮网络技术有限公司 | Text processing method and device, computer equipment and storage medium |
CN112347790B (en) * | 2020-11-06 | 2024-01-16 | 北京乐学帮网络技术有限公司 | Text processing method, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110489757A (en) | A kind of keyword extracting method and device | |
US11227118B2 (en) | Methods, devices, and systems for constructing intelligent knowledge base | |
WO2020042925A1 (en) | Man-machine conversation method and apparatus, electronic device, and computer readable medium | |
CN105022754B (en) | Object classification method and device based on social network | |
CN110121705A (en) | Pragmatics principle is applied to the system and method interacted with visual analysis | |
CN103870001B (en) | A kind of method and electronic device for generating candidates of input method | |
CN109739978A (en) | A kind of Text Clustering Method, text cluster device and terminal device | |
US20230076387A1 (en) | Systems and methods for providing a comment-centered news reader | |
WO2014126657A1 (en) | Latent semantic analysis for application in a question answer system | |
CN108304375A (en) | A kind of information identifying method and its equipment, storage medium, terminal | |
CN112860866A (en) | Semantic retrieval method, device, equipment and storage medium | |
US10169452B2 (en) | Natural language interpretation of hierarchical data | |
CN107992477A (en) | Text subject determines method, apparatus and electronic equipment | |
CN110321561B (en) | Keyword extraction method and device | |
CN109726289A (en) | Event detecting method and device | |
CN107341233A (en) | A kind of position recommends method and computing device | |
Chatterjee et al. | Single document extractive text summarization using genetic algorithms | |
CN107220384A (en) | A kind of search word treatment method, device and computing device based on correlation | |
JP2022068120A (en) | Method for processing chat channel communications, chat channel processing system, and program (intelligent chat channel processor) | |
US10198497B2 (en) | Search term clustering | |
CN110222194A (en) | Data drawing list generation method and relevant apparatus based on natural language processing | |
CN103955480B (en) | A kind of method and apparatus for determining the target object information corresponding to user | |
CN106663123B (en) | Comment-centric news reader | |
KR101494795B1 (en) | Method for representing document as matrix | |
KR101931624B1 (en) | Trend Analyzing Method for Fassion Field and Storage Medium Having the Same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191122 |
|
RJ01 | Rejection of invention patent application after publication |