CN102222119B - Automatic personalized abstracting method in digital library system - Google Patents
Automatic personalized abstracting method in digital library system Download PDFInfo
- Publication number
- CN102222119B CN102222119B CN 201110213750 CN201110213750A CN102222119B CN 102222119 B CN102222119 B CN 102222119B CN 201110213750 CN201110213750 CN 201110213750 CN 201110213750 A CN201110213750 A CN 201110213750A CN 102222119 B CN102222119 B CN 102222119B
- Authority
- CN
- China
- Prior art keywords
- document
- correlation
- model
- key word
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 239000012634 fragment Substances 0.000 claims abstract description 36
- 238000012545 processing Methods 0.000 claims description 28
- 230000015572 biosynthetic process Effects 0.000 claims description 21
- 238000009499 grossing Methods 0.000 claims description 18
- 238000005315 distribution function Methods 0.000 claims description 9
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 27
- 230000008901 benefit Effects 0.000 abstract description 3
- 230000010365 information processing Effects 0.000 abstract description 2
- 230000002596 correlated effect Effects 0.000 abstract 1
- 239000000284 extract Substances 0.000 description 16
- 238000000605 extraction Methods 0.000 description 10
- 238000012360 testing method Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 230000002452 interceptive effect Effects 0.000 description 5
- 238000004088 simulation Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000013011 mating Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an automatic personalized abstracting method in a digital library system, and relates to the technical field of information processing. The method comprises the following steps of: a, inputting query information; b, setting up correlated models and uncorrelated models according to the input query information; c, for each word in a document of which the abstract informationis needed to be acquired, computing possibilities of generating the word in the related models and unrelated models; d, saving a correlation degree of each keyword into a queue; e, selecting correlation degrees of a series of continual keywords in the queue, and summing up, wherein a document fragment with the highest correlation degree is used as a piece of document abstract; f, determining whether to continue to look for the next abstract according to the magnitude of a threshold; and g, if necessary, continuing to perform the step e, if not necessary, returning all the documents in an abstract data set as abstract information. Compared with the conventional abstract algorithm, the method has the advantage that an article abstract acquired by the method is high in accuracy. Moreover, under the condition of simulating real data, the method has high anti-jamming capability.
Description
Technical field
The present invention relates to technical field of information processing, exactly relate to the personalized automaticabstracting in a kind of digital library system.
Background technology
Automatic abstract based on inquiry namely for given document, returns one or more summary info associated with the query, after a text collection establishes or upgrades, automatically document is divided into a plurality of discrete summary infos.
Present automatic abstract is processed, and a kind of method is according to some documents relevant with current document, pre-estimates summary info length, has had after the approximate size of documentation summary, searches with the information segment of inquiring about the designated length of mating most as article abstract.
Another kind method is by pre-service, first document is cut into one or more semantic information piece.After the semantic information piece was determined, the degree of association between matching inquiry statement and the semantic information piece was selected with the query statement degree of association the highlyest, and the message block that can cover the document main information is as documentation summary.
Yet the length of summary info is difficult to pre-determine in the first method; And second method, after pre-service, fixed the position at the whole story of summary info, and after the document pre-service, if the main information of document appears in several different segmentation, the summary info that extracts in this case is lower to the coverage rate of document main information.Such as, one piece of document can be a plurality of fragments that do not have coincidence by cutting, but cutting has a potential problem like this, when best documentation summary need to cover the content of two adjacent segment, because pre-service has separated the document fragment, the summary info that automatically extracts is imperfect.
Be CN 101231634 such as publication number, open day is to get Chinese patent literature on July 30th, 2008 to disclose the method that a kind of figure of utilization division methods is extracted many document abstracts automatically, may further comprise the steps: carry out the sentence boundary cutting, document is used the sentence expression that cuts out; Sentence expression is become vector, calculate the similarity of sentence between in twos and consist of the sentence incidence matrix, and by the threshold value of appointment incidence matrix is carried out yojan, carry out simultaneously standardization processing; In many document abstracts, introduce the excavation of the recessive logical organization of theme, document sets is divided into different recessive sub-topicses by theme, thereby the digest task is converted into choosing and extraction process sub-topics; The method of utilizing figure to divide both guaranteed the importance degree of sentence place sub-topics from global property, guarantee again the low redundancy of content between the different sub-topicses from local characteristics, thus Effective Raise the digest quality.
But the prior art take above-mentioned patent documentation as representative, still exist in following technical matters: CN 101231634 patents according to sentence Determining Weights vector, cause summary info to be cut apart by sentence, the summary info that extracts in this case is lower to the coverage rate of document main information.
Summary of the invention
For solving the problems of the technologies described above, the present invention proposes the personalized automaticabstracting in a kind of digital library system, adopt this method, can solve the technical matters of existing " summary info of extraction is lower to the coverage rate of document main information " in the above-mentioned prior art, and, the fixing length of summary info not, can obtain flexibly summary info, when extracting documentation summary, can well judge the degree of correlation of document fragment and inquiry, the summary info antijamming capability that extracts is strong, and the article abstract that adopts this method to obtain, and is higher than the accuracy rate of the article abstract that obtains with traditional digest algorithm.
The present invention is by adopting following technical proposals to realize:
Personalized automaticabstracting in a kind of digital library system is characterized in that comprising the steps:
A, input inquiry information, described Query Information comprise key word and and user's customized information;
B, set up correlation model and uncorrelated model according to the Query Information of inputting, described correlation model refers to the probability distribution function of the natural language model of query statement, uses the keyword query digital library system, obtains top 5-50 piece of writing document;
Described uncorrelated model is the additional probability distribution function of described correlation model, all collection of document in the index word book system;
Because in the language model the inside that makes up with whole document sets, the document that inquiry is relevant only has very little value, and inquires about the uncorrelated principal element that occupied, so can remove to make up uncorrelated model with whole collection of document
C, needs are obtained each word in the document of summary info, calculate the probability that described word produces under correlation model and uncorrelated models, and with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information;
D, the described degree of correlation of each key word is saved in the formation, and smoothing processing is carried out in formation;
E, choose one group of continuous key word degree of correlation addition in the described formation, the document fragment that the degree of correlation is the highest is made a summary as a document, the document fragment that this degree of correlation is the highest is put into the summary data set, and deletes the highest document fragment of this degree of correlation in described formation;
F, judge whether continue to seek lower bar digest according to threshold size;
G, if necessary continues the operation of e step, if do not need, just returns all documents in the summary data set as summary info.
In the described c step, calculate the probability that described word produces and specifically comprise under correlation model and uncorrelated model: the probabilistic method that described word produces under uncorrelated model is:
At given key word
In the situation of whole collection of document, key word appears in the document
Number of times use
Expression, and the quantity in the whole collection of document is used
Expression, key word in the uncorrelated model
The probability that produces is:
The step of the probabilistic method that described word produces under correlation model comprises:
1) uses inquiry
Come search file, and the relevant document definition of the inquiry that will be retrieved is
,
In each document
Have
, representative
Condition under the probability retrieved, calculate keyword at given document
Situation under
, and pass through the language model that whole document makes up
Do smoothing processing, wherein
Computing formula as follows:
Wherein,
Be
At document
The number of times of middle appearance,
Be the relevant documentation of choosing
Document length, and operation parameter
Control word frequency to the impact of this probability, this is processing common in the natural language model.
In the described d step, smoothing processing is carried out in formation specifically to be referred to: calculating need to obtain each word in the document of summary info and the degree of correlation of Query Information, the degree of correlation of ten words of each degree of correlation and front and back is more too high or excessively low, think that then current word is in the larger situation of fluctuation, removes it before computing.
In the described f step, judge whether that according to threshold size continuing to seek lower bar digest specifically refers to: the value that presets threshold value, the degree of correlation summation of the summary fragment of before taking out is divided by the degree of correlation summation of current summary fragment of the taking out threshold value less than described setting, then keep current digest information, and repeat the e step; As greater than as described in the threshold value of setting, then abandon current digest information, and finish the digest extraction algorithm, return all documents in the summary data set as summary info.
In the described c step, with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information, being distributed between [1,1] of the degree of correlation.
In the described a step, user's customized information refers to: the individual preference information that user's historical viewings data or user once used in digital library system.
Compared with prior art, the technique effect that reaches of the present invention is as follows:
The technical scheme that the present invention adopts the a-g step to form, adopt this method, can solve the technical matters of existing " summary info of extraction is lower to the coverage rate of document main information " in the above-mentioned prior art, and, the fixing length of summary info not, can obtain flexibly summary info, when extracting documentation summary, can well judge the degree of correlation of document fragment and inquiry, the summary info antijamming capability that extracts is strong, and the article abstract that adopts this method to obtain is higher than the accuracy rate of the article abstract that obtains with traditional digest algorithm.Be in particular in:
When a step input customized information, can realize personalized automatic abstract.
In the b step, correlation model obtains top 5-50 piece of writing document, through experiment (referring to following embodiment part), with respect to prior art, has extremely strong antijamming capability.
In the c step, with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information, eliminated error.
In the d step, after formation carried out smoothing processing, namely calculate each word in the document to obtain summary info and the degree of correlation of inquiry, the degree of correlation of ten words of each degree of correlation and front and back is more too high or excessively low, think that then current word is in the larger situation of fluctuation, removes it before computing.Thereby, each degree of correlation is carried out smoothing processing, can reduce auxiliary word commonly used, articles etc. are irrelevant but use than word more frequently the impact of summary accuracy with inquiry.
In the e step, choose one section the highest key combination of degree of correlation combination in the formation, namely be to have found out in the document and one section the closest key combination of query statement relation by algorithm, so it is the tightst to extract summary info energy and Query Information relation.
In the f step, setting threshold is big or small, can control the length of proposition summary info out.
Description of drawings
The present invention is described in further detail below in conjunction with specification drawings and specific embodiments, wherein:
Fig. 1 is the processing flow chart of robotization summary
Fig. 2 is the degree of correlation synoptic diagram of word w and inquiry among the document d
Fig. 2 a is the extraction algorithm synoptic diagram of documentation summary
Fig. 3 a is that three kinds of autoabstract generation methods are carried out the accuracy rate synoptic diagram under noisy condition
Fig. 3 b is that three kinds of autoabstract generation methods there be not execution accuracy rate synoptic diagram under the disturbed condition
Fig. 4 is the quantity of set of relevant documents and the graph of a relation of F-measure.
Embodiment
Embodiment 1
The basic embodiment of the present invention comprises the steps:
A, input inquiry information, described Query Information comprise key word and and user's customized information;
B, set up correlation model and uncorrelated model according to the Query Information of inputting, described correlation model refers to the probability distribution function of the natural language model of query statement, uses the keyword query digital library system, obtains top 5-50 piece of writing document;
Described uncorrelated model is the additional probability distribution function of described correlation model, all collection of document in the index word book system;
Because in the language model the inside that makes up with whole document sets, the document that inquiry is relevant only has very little value, and inquires about the uncorrelated principal element that occupied, so can remove to make up uncorrelated model with whole collection of document
C, needs are obtained each word in the document of summary info, calculate the probability that described word produces under correlation model and uncorrelated models, and with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information;
D, the described degree of correlation of each key word is saved in the formation, and smoothing processing is carried out in formation;
E, choose one group of continuous key word degree of correlation addition in the described formation, the document fragment that the degree of correlation is the highest is made a summary as a document, the document fragment that this degree of correlation is the highest is put into the summary data set, and deletes the highest document fragment of this degree of correlation in described formation;
F, judge whether continue to seek lower bar digest according to threshold size;
G, if necessary continues the operation of e step, if do not need, just returns all documents in the summary data set as summary info.
The technical scheme that adopts above-mentioned steps to form is in the data acquisition of standard, higher than the accuracy rate of the article abstract that obtains with traditional digest algorithm.And when the Reality simulation data cases, the algorithm that we propose has very strong antijamming capability.
The below describes preferred forms of the present invention in detail:
The automatic abstracting system that we introduce adopts the correlation technique of language model in the processing and weighting of document, and uses the method for word frequency statistics that sentence is weighted.Treatment scheme such as Fig. 1 of robotization summary.After having shown user input query information, set up correlation model and uncorrelated model according to the Query Information of user's input, (word sequence analysis is called for short: the process that WSA) produces personalized summary through the Sentence analysis model.
Carry out the process of abstract extraction by correlation model and uncorrelated model:
We extract summary info and are based on statistical language model, in our research, have constructed the bilingual model: a kind of is correlation model, is defined as
; Another kind is uncorrelated model, is defined as
Correlation model
Probability distribution function for the natural language model of query statement.On the contrary, uncorrelated model
Additional probability distribution function for above-mentioned language model.Utilize conditional probability in the natural language processing process, each key word w calculates by using
The probability that produces
, in addition, calculate again and pass through
The probability that produces
In the part, the definition of two formula will be set forth specifically in the back.
Use a series of key words
Expression document d, wherein n is the number of key word among the document d.In Fig. 2, represent the key word formation with the x axle, the y axle represents key word and inquiry degree of correlation situation of change.The best summary info of document then is the combination of the highest key word of a series of continuous degrees of correlation, such as Fig. 2.To each key word w in document d and the degree of correlation of inquiry, use
Quantize.It is in order to eliminate error that the probability that key word produces under the correlation model deducts the probability that key word produces under the uncorrelated model, certainly, this step can process with other form, but the experimentation by this paper, this processing is very effective, by such processing, the distribution of probability can drop between [1,1].
Can find out intuitively that from figure w and inquiry be when relevant, its quantitative value is for just, and when uncorrelated, quantitative value is to bear.Therefore, get interval [s, t] when best embodying the summary info of this paper at the x axle, following formula is satisfied in interval intercepting:
Corresponding documentation summary is
The extraction algorithm of documentation summary:
Input: Document d, query q, relevance thresholdθ
Output: Relevant snippet set S
Step 1: Plot d as word sequence and store in a queue
1.1 a=null; //a is as a formation of storage document keyword
1.2 for each word w ∈d
do
{
P (w/Mr)=get RMProb (w, q); // calculate key word w and inquiry q based on the probability under the correlation model
P (w/
)=getIRMProb (w, q) // calculating key word and inquiry are based on getting probability under the uncorrelated model
}
Step 2 :Identification of relevant snippets
2.1 S=null, Scur=null, Sprev=null//S are used for storing the set of the summary fragment that extracts.Scur, Spre preserve a summary temporarily
2.2 a.smooth (); // smoothing processing
2.3 Scur=a.getMaxInterval (); // the highest document fragment of key word degree of correlation combination is stored among the Scur as summary
2.4 Hcur=Scur.sum (); // value of the degree of correlation of Scur is saved among the Hcur
2.5 do{ // according to weights θ, the summary quantity that decision extracts all is saved in every summary among the S
Sprev =Scur;
Scur =a.getMaxInterval();
Hprev= Hcur;
Hcur = Scur.sum();
If (Hprev /Hcur <θ)
a.maskSubqueue(Scur);
S.add(Scur);
}
2.6 Return S
Arthmetic statement:
Collection of document for the needs inquiry.
Be query statement (or user personalized information).Threshold value
The summary quantity that control extracts.According to the degree of correlation of calculating each key word and inquiry, the algorithm automatic lifting takes out the summary that can express document information.And according to the size of threshold value, judge the quantity of summary.
Step 1: calculate respectively the P (w/Mr) of each w-P (w/
)=c, and value deposited among the formation a.
Step 2: a is carried out finding out the document set of segments Scur of c value summation maximum among the formation a after the smoothing processing, and be saved in the S set.Preserve simultaneously the c value summation of Scur in Hcur.Afterwards w remaining among the formation a is done same processing, finds out Scur and Hcur maximum in the remaining document fragment, preserve each Scur in S set until current Hcur divided by previous Hprev θ.
Algorithm is carried out synoptic diagram and is seen Fig. 2 a.
In the whole algorithmic procedure of describing with false code, have it is noted that the first at 2, before calculating maximal sequence, at first eliminate some larger fluctuations by smoothing processing.Specific implementation is to judge ten key words of each key word and front and back in the document and the degree of correlation of inquiry in program, if the degree of correlation of ten key words in the degree of correlation of current key word and inquiry and front and back and inquiry is more too high or excessively low, think that then current key word is in the larger situation of fluctuation, before computing it is removed, this step is in the algorithm
The second, this algorithm can extract a plurality of relevant text snippets.In order to realize this process, algorithm repeatedly extracts documentation summary and revises remaining text fragments, to each its degree of correlation of the summary record that extracts, repeatedly extract text snippet until the ratio of the relevance degree of nearest two summaries that extract less than threshold value
Probability calculation implementation method in correlation model and the uncorrelated model:
Below the concrete inquiry correlation model of setting forth
With the uncorrelated model of inquiry
Produce the probability of key word.At uncorrelated model
In, key word
Probability
Can pass through language model
Rule obtain, this be because, at the language model with whole document sets structure
The inside, the document that inquiry is relevant only has very little value, and inquires about the uncorrelated principal element that occupied.
At given key word
In the situation of whole document sets, occur in the document
Number of times use
Expression, and the quantity in the whole document is used
Expression (comprising multiplicity).Therefore, this paper estimates key word in the uncorrelated model
The probability that produces is:
To be inquired about preferably correlation model on the other hand
Parameter, need one with the higher sample set of the inquiry degree of correlation.But such sample information may be not difficult to find.Therefore, in order to address this problem, we suppose the inquiry correlation model
Can set up according to the document that checks out.
Particularly, implement by following step:
Use inquiry
Come search file, and the relevant document definition of the inquiry that will be retrieved is
(the processing here can realize with the software with function of search, such as: a series of document that Lucence) produces by this step all is the document that has comprised whole or most of searching keyword.In addition, exist
In each document
Have
, representative
Condition under the probability retrieved.
Calculate keyword at given document
Situation under
, and pass through the language model that whole document makes up
Do smoothing processing
Wherein,
Be
At document
The number of times of middle appearance.
Be the relevant documentation of choosing
Document length.And operation parameter
Control word frequency to the impact of this probability, this is processing common in the natural language model.By calculating
Come approximate evaluation
, this probability is:
When calculating
With
Generate after the probability of key word w under the condition, just can export text snippet associated with the query with the algorithm of front.
The advantage that correlation technique shows in concrete implementation among the application:
In order to set forth better the advantage of the automatic Summarization Technique that we propose, in the employed data of test phase from TREC DOE (U.S. Department of Energy).Data acquisition is the test data of industrywide standard.By our algorithm of test and other algorithms of closing at standard data set, in the accuracy of digest, on the antijamming capability, the algorithm that we propose all obviously is better than other algorithm.
The query statement that in this data acquisition, has comprised 226087 documents and 80 benchmark.Each document comes the publication summary to DOE, and the substantial distance of each document is 149 words, in these 226087 digest document, has 80 query statements of 2352 summaries and this relevant.
Before the processing, at first these documents are made up mutually, consist of some relevant and incoherent long documents, with these synthetic document test macros.These documents specifically are classified as follows:
(a) simple correlation data slot set
(b) compound related data set of segments
(c) the simple correlation data slot that comprises noise is gathered (set with interfering data)
Forms data set of segments (S1): each document in this set all is one piece of relevant documentation summary and the synthetic new document of some stochastic independences summary.
Most according to set of segments (S2): each document in this set all is to make up some uncorrelated documents at random by two to five pieces of relevant documentations to generate.And those two to five pieces of relevant documentations all are relevant with same inquiry.In the inquiry of 80 benchmark, only have interrelatedly with 54 inquiries, all only select wherein these 54 in test.
Comprise the simple correlation data slot set (S3) of interference: this data set is to add interfere information in first data acquisition.Insert the part phrase of query statement as interfere information in the uncorrelated documentation section to every piece of document in the S1 set.Can better the Reality simulation situation by this process.
To the document fragment that identifies, counting accuracy
With the value of recalling
Concrete grammar is as follows: definition
Be summary true length originally, and
Be the length of the document fragment that extracts,
The length of the part that repeats for the document fragment of truly making a summary and extracting.In addition, this paper also uses a coordination parameter
When this value is high, can embody degree of accuracy
With the value of recalling
All higher.
Except embodying the gesture of this paper algorithm, under same test data, other 2 abstract extraction algorithms have also been contrasted.First is the FL-Win algorithm, given one estimate the big or small k of summary after, check one group of text fragments of k word composition and the correlativity of inquiry, choose maximally related as summary info.In this experiment, window size is 149, i.e. the Average True true length degree of each text fragments.In the situation of the optimum length that does not have given summary, length of summarization is estimated first in this algorithm requirement, and supposes that this length is the summary optimum length, and in the situation that best average length of summarization is arranged, the method effect is fine.But under real implementation status, best length of summarization is to be difficult to obtain in advance.
Second is the Cos-Win method, and the method can be extracted the data slot of length variations as summary according to different inquiries.By comparing the similarity of text fragments and inquiry, will
The fragment that similarity is the highest is elected summary info as.
According to top two basic skills, the text separation algorithms is calculated in the place of per 25 words, to reduce the complexity of computing.In the Cos-Win method, according to identical querying condition, be to select representative text fragments as initial summary info in candidate's text fragments of 50 to 600 words in scope.Calculate each initial segment
After the similarity, increase progressively 25 words, again calculate new document fragment
Similarity is searched and is had maximum
The text fragments of similarity value.
In the experiment contrast stage, at first compare the algorithm of our proposition and the implementation effect of 2 basic digest extraction algorithms according to same data (data acquisition S1).Experimental result such as table 1.The table on, WSA recall rate and
On all be best, on most accuracy rate, WSA is higher, except with the comparison of FL-Win, because WSA is in order to export the lengthy document fragment, so in accuracy rate certain decline is arranged in design.But be significantly improved on the recall rate.And from table 1, can obtain, WSA's
Value generally can 0.799, embody preferably execution result of WSA.
Three kinds of abstract extraction algorithm contrasts of table 1
Equally, by the experimental data of table 1, can analyze and draw WSA and on the S2 data acquisition, also have preferably execution result.
In anti-interference test process, the experimental data that we adopt comprises and the relevant and uncorrelated document of inquiry, also is for the digest generative process in the Reality simulation situation better.In document sets S3, only have one for real related abstract, other be interfering data, comprise the part text message of query statement as each incoherent document fragment of interfering data.At Fig. 3 a implementation status in the data slot of three kinds of methods at different length has been described.In the drawings, the length according to the text fragments of testing is divided into different groups.Particularly, can be divided into 0-50,51-100,101-200,201-300,301-400 and 401-500.Wherein, the x axle represents the length of text fragments, after the comparison diagram 3b, can observe, and WSA is having in the interfering data situation, the result of execution with do not have disturbed condition under execution result probably consistent.But, FL-Win and Cos-Win's
Can obviously descend because of interfering data.Can draw from figure, WSA not only can extract accurately summary info under general case, comprising in the situation of interfere information, also can bring into play good effect.Because in noisy document, WSA can reduce to follow through on the surface and askes relevant and in fact uncorrelated document.
The parameter of using in the experimentation:
In the language model of composition WSA two main parameters are arranged, the firstth, the set of retrieval relevant documentation
Size, the secondth, smoothing parameter
Determining
During size, test is in the relevant documentation of varying number
Value (such as Fig. 4), as can be seen from the figure, gather at relevant documentation
In the time of between being in 5 to 40, execution result tends towards stability.When use is less than 5 pieces of documents, can not provide enough document information, but when number of documents more than 50 the time, can introduce too much interfere information again.In experiment, if there be not special indicating, come the computational language correlation model with 15 piece of writing documents.
In the parameter estimation procedure of language model, smoothing processing is very important.In language model, smoothing processing can guarantee not have the key word of zero probability to occur.In the experiment of this paper, observe and work as
Can obtain maximum when being set to 0.9
Namely when carrying out smoothing processing hardly, the comparatively ideal implementation effect of WSA.This is because set up when inquiring about relevant language model in this paper experiment, finishes according to the inquiry relevant documentation.So to most key word, do not have the situation of zero probability, naturally just do not need smoothing processing to eliminate zero probability yet.
Claims (5)
1. the personalized automaticabstracting in the digital library system is characterized in that comprising the steps:
A, input inquiry information, described Query Information comprise key word and and user's customized information;
B, set up correlation model and uncorrelated model according to the Query Information of inputting, described correlation model refers to the probability distribution function of the natural language model of Query Information in relevant documentation, described relevant documentation refers to use the keyword query digital library system, the top 5-50 piece of writing document that obtains;
Described uncorrelated model is the additional probability distribution function of described correlation model, namely refers to the probability distribution function of the natural language model of Query Information in uncorrelated document, all collection of document in the described uncorrelated document index word book system;
C, needs are obtained each key word in the document of summary info, calculate the probability that described key word produces under correlation model and uncorrelated model, and deduct probability under the uncorrelated model as the degree of correlation of described key word and Query Information with the probability under the correlation model;
D, the described degree of correlation of each key word is saved in the formation, and smoothing processing is carried out in formation;
E, choose one group of continuous key word degree of correlation addition in the described formation, the document fragment that the degree of correlation is the highest is made a summary as a document, the document fragment that this degree of correlation is the highest is put into the summary data set, and deletes the highest document fragment of this degree of correlation in described formation;
F, judge whether continue to seek lower bar document summary according to threshold size;
G, if necessary continues the operation of e step, if do not need, just returns all document fragments in the summary data set as summary info.
2. the personalized automaticabstracting in the digital library system according to claim 1, its feature
Be: in the described c step, calculate the probability that described key word produces and specifically comprise under correlation model and uncorrelated model: the probabilistic method that described key word produces under uncorrelated model is:
At given key word
In the situation of whole collection of document, key word appears in the document
Number of times use
Expression, and the quantity in the whole collection of document is used
Expression, key word in the uncorrelated model
The probability that produces is:
What represent is the language model that makes up with all documents in the whole digital library system;
The step of the probabilistic method that described key word produces under correlation model comprises:
1) uses inquiry
Come search file, and the relevant document definition of the inquiry that will be retrieved is
,
In each document
Have
, representative
Condition under the probability retrieved, calculate keyword at given document
Situation under
, and pass through the language model that whole document makes up
Do smoothing processing, wherein
Computing formula as follows:
3. the personalized automaticabstracting in the digital library system according to claim 1 and 2, it is characterized in that: in the described d step, smoothing processing is carried out in formation specifically to be referred to: calculating need to obtain each key word in the document of summary info and the degree of correlation of Query Information, the degree of correlation of ten key words of each degree of correlation and front and back is more too high or excessively low, think that then current key word is in the larger situation of fluctuation, removes it before computing.
4. the personalized automaticabstracting in the digital library system according to claim 1, it is characterized in that: in the described c step, deduct probability under the uncorrelated model as the degree of correlation of described key word and Query Information with the probability under the correlation model, being distributed between [1,1] of the degree of correlation.
5. the personalized automaticabstracting in the digital library system according to claim 1, it is characterized in that: in the described a step, user's customized information refers to: the individual preference information that user's historical viewings data or user once used in digital library system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110213750 CN102222119B (en) | 2011-07-28 | 2011-07-28 | Automatic personalized abstracting method in digital library system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110213750 CN102222119B (en) | 2011-07-28 | 2011-07-28 | Automatic personalized abstracting method in digital library system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102222119A CN102222119A (en) | 2011-10-19 |
CN102222119B true CN102222119B (en) | 2013-04-17 |
Family
ID=44778671
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110213750 Expired - Fee Related CN102222119B (en) | 2011-07-28 | 2011-07-28 | Automatic personalized abstracting method in digital library system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102222119B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9997157B2 (en) * | 2014-05-16 | 2018-06-12 | Microsoft Technology Licensing, Llc | Knowledge source personalization to improve language models |
CN105824915A (en) * | 2016-03-16 | 2016-08-03 | 上海珍岛信息技术有限公司 | Method and system for generating commenting digest of online shopped product |
CN107766419B (en) * | 2017-09-08 | 2021-08-31 | 广州汪汪信息技术有限公司 | Threshold denoising-based TextRank document summarization method and device |
CN115098667B (en) * | 2022-08-25 | 2023-01-03 | 北京聆心智能科技有限公司 | Abstract generation method, device and equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003012661A1 (en) * | 2001-07-31 | 2003-02-13 | Invention Machine Corporation | Computer based summarization of natural language documents |
JP4250024B2 (en) * | 2003-05-23 | 2009-04-08 | 日本電信電話株式会社 | Text summarization device and text summarization program |
-
2011
- 2011-07-28 CN CN 201110213750 patent/CN102222119B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003012661A1 (en) * | 2001-07-31 | 2003-02-13 | Invention Machine Corporation | Computer based summarization of natural language documents |
JP4250024B2 (en) * | 2003-05-23 | 2009-04-08 | 日本電信電話株式会社 | Text summarization device and text summarization program |
Non-Patent Citations (1)
Title |
---|
王海等.基于遗传算法的查询导向式自动文摘.《微计算机信息》.2009,第25卷(第28期),23-25. * |
Also Published As
Publication number | Publication date |
---|---|
CN102222119A (en) | 2011-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shahana et al. | Survey on feature subset selection for high dimensional data | |
KR102019194B1 (en) | Core keywords extraction system and method in document | |
US7444279B2 (en) | Question answering system and question answering processing method | |
US9519870B2 (en) | Weighting dictionary entities for language understanding models | |
Chen et al. | Two novel feature selection approaches for web page classification | |
CN104166651A (en) | Data searching method and device based on integration of data objects in same classes | |
CN102023986A (en) | Method and equipment for constructing text classifier by referencing external knowledge | |
De Boom et al. | Semantics-driven event clustering in Twitter feeds | |
CN102222119B (en) | Automatic personalized abstracting method in digital library system | |
JP6079270B2 (en) | Information provision device | |
CN106991171A (en) | Topic based on Intelligent campus information service platform finds method | |
Aung et al. | Random forest classifier for multi-category classification of web pages | |
Basili et al. | NLP-driven IR: Evaluating performances over a text classification task | |
CN107133321B (en) | Method and device for analyzing search characteristics of page | |
Ab Ghani et al. | Subspace Clustering in High-Dimensional Data Streams: A Systematic Literature Review. | |
Mathai et al. | An efficient approach for item set mining using both utility and frequency based methods | |
Schenker et al. | A comparison of two novel algorithms for clustering web documents | |
Al-Zeiadi et al. | Incremental Closed Frequent Itemsets Mining based Approach Using Maximal Candidates | |
Abinaya et al. | Effective Feature Selection For High Dimensional Data using Fast Algorithm | |
Shaohong et al. | Web page classification based on semi-supervised naïve bayesian em algorithm | |
Seyfi et al. | Mining discriminative itemsets in data streams | |
Kamepalli et al. | Implementation of Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data | |
Shukla et al. | A New Imputation Method for Missing Attribute Values in Data Mining. | |
Ajitha et al. | EFFECTIVE FEATURE EXTRACTION FOR DOCUMENT CLUSTERING TO ENHANCE SEARCH ENGINE USING XML. | |
Christy et al. | An enhanced method for topic modeling using concept-latent |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130417 Termination date: 20140728 |
|
EXPY | Termination of patent right or utility model |