[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN102222119B - Automatic personalized abstracting method in digital library system - Google Patents

Automatic personalized abstracting method in digital library system Download PDF

Info

Publication number
CN102222119B
CN102222119B CN 201110213750 CN201110213750A CN102222119B CN 102222119 B CN102222119 B CN 102222119B CN 201110213750 CN201110213750 CN 201110213750 CN 201110213750 A CN201110213750 A CN 201110213750A CN 102222119 B CN102222119 B CN 102222119B
Authority
CN
China
Prior art keywords
document
correlation
model
key word
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201110213750
Other languages
Chinese (zh)
Other versions
CN102222119A (en
Inventor
李庆
刘家芬
罗旭斌
张晨
胡川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU XICHUANG ZHANGZHONG TECHNOLOGY CO LTD
Original Assignee
CHENGDU XICHUANG ZHANGZHONG TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU XICHUANG ZHANGZHONG TECHNOLOGY CO LTD filed Critical CHENGDU XICHUANG ZHANGZHONG TECHNOLOGY CO LTD
Priority to CN 201110213750 priority Critical patent/CN102222119B/en
Publication of CN102222119A publication Critical patent/CN102222119A/en
Application granted granted Critical
Publication of CN102222119B publication Critical patent/CN102222119B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an automatic personalized abstracting method in a digital library system, and relates to the technical field of information processing. The method comprises the following steps of: a, inputting query information; b, setting up correlated models and uncorrelated models according to the input query information; c, for each word in a document of which the abstract informationis needed to be acquired, computing possibilities of generating the word in the related models and unrelated models; d, saving a correlation degree of each keyword into a queue; e, selecting correlation degrees of a series of continual keywords in the queue, and summing up, wherein a document fragment with the highest correlation degree is used as a piece of document abstract; f, determining whether to continue to look for the next abstract according to the magnitude of a threshold; and g, if necessary, continuing to perform the step e, if not necessary, returning all the documents in an abstract data set as abstract information. Compared with the conventional abstract algorithm, the method has the advantage that an article abstract acquired by the method is high in accuracy. Moreover, under the condition of simulating real data, the method has high anti-jamming capability.

Description

Personalized automaticabstracting in the digital library system
Technical field
The present invention relates to technical field of information processing, exactly relate to the personalized automaticabstracting in a kind of digital library system.
Background technology
Automatic abstract based on inquiry namely for given document, returns one or more summary info associated with the query, after a text collection establishes or upgrades, automatically document is divided into a plurality of discrete summary infos.
Present automatic abstract is processed, and a kind of method is according to some documents relevant with current document, pre-estimates summary info length, has had after the approximate size of documentation summary, searches with the information segment of inquiring about the designated length of mating most as article abstract.
Another kind method is by pre-service, first document is cut into one or more semantic information piece.After the semantic information piece was determined, the degree of association between matching inquiry statement and the semantic information piece was selected with the query statement degree of association the highlyest, and the message block that can cover the document main information is as documentation summary.
Yet the length of summary info is difficult to pre-determine in the first method; And second method, after pre-service, fixed the position at the whole story of summary info, and after the document pre-service, if the main information of document appears in several different segmentation, the summary info that extracts in this case is lower to the coverage rate of document main information.Such as, one piece of document can be a plurality of fragments that do not have coincidence by cutting, but cutting has a potential problem like this, when best documentation summary need to cover the content of two adjacent segment, because pre-service has separated the document fragment, the summary info that automatically extracts is imperfect.
Be CN 101231634 such as publication number, open day is to get Chinese patent literature on July 30th, 2008 to disclose the method that a kind of figure of utilization division methods is extracted many document abstracts automatically, may further comprise the steps: carry out the sentence boundary cutting, document is used the sentence expression that cuts out; Sentence expression is become vector, calculate the similarity of sentence between in twos and consist of the sentence incidence matrix, and by the threshold value of appointment incidence matrix is carried out yojan, carry out simultaneously standardization processing; In many document abstracts, introduce the excavation of the recessive logical organization of theme, document sets is divided into different recessive sub-topicses by theme, thereby the digest task is converted into choosing and extraction process sub-topics; The method of utilizing figure to divide both guaranteed the importance degree of sentence place sub-topics from global property, guarantee again the low redundancy of content between the different sub-topicses from local characteristics, thus Effective Raise the digest quality.
But the prior art take above-mentioned patent documentation as representative, still exist in following technical matters: CN 101231634 patents according to sentence Determining Weights vector, cause summary info to be cut apart by sentence, the summary info that extracts in this case is lower to the coverage rate of document main information.
Summary of the invention
For solving the problems of the technologies described above, the present invention proposes the personalized automaticabstracting in a kind of digital library system, adopt this method, can solve the technical matters of existing " summary info of extraction is lower to the coverage rate of document main information " in the above-mentioned prior art, and, the fixing length of summary info not, can obtain flexibly summary info, when extracting documentation summary, can well judge the degree of correlation of document fragment and inquiry, the summary info antijamming capability that extracts is strong, and the article abstract that adopts this method to obtain, and is higher than the accuracy rate of the article abstract that obtains with traditional digest algorithm.
The present invention is by adopting following technical proposals to realize:
Personalized automaticabstracting in a kind of digital library system is characterized in that comprising the steps:
A, input inquiry information, described Query Information comprise key word and and user's customized information;
B, set up correlation model and uncorrelated model according to the Query Information of inputting, described correlation model refers to the probability distribution function of the natural language model of query statement, uses the keyword query digital library system, obtains top 5-50 piece of writing document;
Described uncorrelated model is the additional probability distribution function of described correlation model, all collection of document in the index word book system;
Because in the language model the inside that makes up with whole document sets, the document that inquiry is relevant only has very little value, and inquires about the uncorrelated principal element that occupied, so can remove to make up uncorrelated model with whole collection of document
C, needs are obtained each word in the document of summary info, calculate the probability that described word produces under correlation model and uncorrelated models, and with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information;
D, the described degree of correlation of each key word is saved in the formation, and smoothing processing is carried out in formation;
E, choose one group of continuous key word degree of correlation addition in the described formation, the document fragment that the degree of correlation is the highest is made a summary as a document, the document fragment that this degree of correlation is the highest is put into the summary data set, and deletes the highest document fragment of this degree of correlation in described formation;
F, judge whether continue to seek lower bar digest according to threshold size;
G, if necessary continues the operation of e step, if do not need, just returns all documents in the summary data set as summary info.
In the described c step, calculate the probability that described word produces and specifically comprise under correlation model and uncorrelated model: the probabilistic method that described word produces under uncorrelated model is:
At given key word
Figure 499706DEST_PATH_IMAGE002
In the situation of whole collection of document, key word appears in the document
Figure 354530DEST_PATH_IMAGE002
Number of times use Expression, and the quantity in the whole collection of document is used
Figure 529476DEST_PATH_IMAGE006
Expression, key word in the uncorrelated model
Figure 281532DEST_PATH_IMAGE002
The probability that produces is:
Figure DEST_PATH_IMAGE008
The step of the probabilistic method that described word produces under correlation model comprises:
1) uses inquiry Come search file, and the relevant document definition of the inquiry that will be retrieved is
Figure DEST_PATH_IMAGE012
, In each document
Figure DEST_PATH_IMAGE014
Have
Figure 246394DEST_PATH_IMAGE016
, representative
Figure 445294DEST_PATH_IMAGE014
Condition under the probability retrieved, calculate keyword at given document
Figure 51856DEST_PATH_IMAGE014
Situation under , and pass through the language model that whole document makes up
Figure DEST_PATH_IMAGE021
Do smoothing processing, wherein
Figure 982903DEST_PATH_IMAGE019
Computing formula as follows:
Figure 737232DEST_PATH_IMAGE023
2) by calculating
Figure 411927DEST_PATH_IMAGE025
Come approximate evaluation
Figure 935312DEST_PATH_IMAGE027
, this probability is:
Wherein,
Figure 544465DEST_PATH_IMAGE031
Be
Figure 819589DEST_PATH_IMAGE002
At document
Figure 400743DEST_PATH_IMAGE014
The number of times of middle appearance, Be the relevant documentation of choosing Document length, and operation parameter Control word frequency to the impact of this probability, this is processing common in the natural language model.
In the described d step, smoothing processing is carried out in formation specifically to be referred to: calculating need to obtain each word in the document of summary info and the degree of correlation of Query Information, the degree of correlation of ten words of each degree of correlation and front and back is more too high or excessively low, think that then current word is in the larger situation of fluctuation, removes it before computing.
In the described f step, judge whether that according to threshold size continuing to seek lower bar digest specifically refers to: the value that presets threshold value, the degree of correlation summation of the summary fragment of before taking out is divided by the degree of correlation summation of current summary fragment of the taking out threshold value less than described setting, then keep current digest information, and repeat the e step; As greater than as described in the threshold value of setting, then abandon current digest information, and finish the digest extraction algorithm, return all documents in the summary data set as summary info.
In the described c step, with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information, being distributed between [1,1] of the degree of correlation.
In the described a step, user's customized information refers to: the individual preference information that user's historical viewings data or user once used in digital library system.
Compared with prior art, the technique effect that reaches of the present invention is as follows:
The technical scheme that the present invention adopts the a-g step to form, adopt this method, can solve the technical matters of existing " summary info of extraction is lower to the coverage rate of document main information " in the above-mentioned prior art, and, the fixing length of summary info not, can obtain flexibly summary info, when extracting documentation summary, can well judge the degree of correlation of document fragment and inquiry, the summary info antijamming capability that extracts is strong, and the article abstract that adopts this method to obtain is higher than the accuracy rate of the article abstract that obtains with traditional digest algorithm.Be in particular in:
When a step input customized information, can realize personalized automatic abstract.
In the b step, correlation model obtains top 5-50 piece of writing document, through experiment (referring to following embodiment part), with respect to prior art, has extremely strong antijamming capability.
In the c step, with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information, eliminated error.
In the d step, after formation carried out smoothing processing, namely calculate each word in the document to obtain summary info and the degree of correlation of inquiry, the degree of correlation of ten words of each degree of correlation and front and back is more too high or excessively low, think that then current word is in the larger situation of fluctuation, removes it before computing.Thereby, each degree of correlation is carried out smoothing processing, can reduce auxiliary word commonly used, articles etc. are irrelevant but use than word more frequently the impact of summary accuracy with inquiry.
In the e step, choose one section the highest key combination of degree of correlation combination in the formation, namely be to have found out in the document and one section the closest key combination of query statement relation by algorithm, so it is the tightst to extract summary info energy and Query Information relation.
In the f step, setting threshold is big or small, can control the length of proposition summary info out.
Description of drawings
The present invention is described in further detail below in conjunction with specification drawings and specific embodiments, wherein:
Fig. 1 is the processing flow chart of robotization summary
Fig. 2 is the degree of correlation synoptic diagram of word w and inquiry among the document d
Fig. 2 a is the extraction algorithm synoptic diagram of documentation summary
Fig. 3 a is that three kinds of autoabstract generation methods are carried out the accuracy rate synoptic diagram under noisy condition
Fig. 3 b is that three kinds of autoabstract generation methods there be not execution accuracy rate synoptic diagram under the disturbed condition
Fig. 4 is the quantity of set of relevant documents and the graph of a relation of F-measure.
Embodiment
Embodiment 1
The basic embodiment of the present invention comprises the steps:
A, input inquiry information, described Query Information comprise key word and and user's customized information;
B, set up correlation model and uncorrelated model according to the Query Information of inputting, described correlation model refers to the probability distribution function of the natural language model of query statement, uses the keyword query digital library system, obtains top 5-50 piece of writing document;
Described uncorrelated model is the additional probability distribution function of described correlation model, all collection of document in the index word book system;
Because in the language model the inside that makes up with whole document sets, the document that inquiry is relevant only has very little value, and inquires about the uncorrelated principal element that occupied, so can remove to make up uncorrelated model with whole collection of document
C, needs are obtained each word in the document of summary info, calculate the probability that described word produces under correlation model and uncorrelated models, and with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information;
D, the described degree of correlation of each key word is saved in the formation, and smoothing processing is carried out in formation;
E, choose one group of continuous key word degree of correlation addition in the described formation, the document fragment that the degree of correlation is the highest is made a summary as a document, the document fragment that this degree of correlation is the highest is put into the summary data set, and deletes the highest document fragment of this degree of correlation in described formation;
F, judge whether continue to seek lower bar digest according to threshold size;
G, if necessary continues the operation of e step, if do not need, just returns all documents in the summary data set as summary info.
The technical scheme that adopts above-mentioned steps to form is in the data acquisition of standard, higher than the accuracy rate of the article abstract that obtains with traditional digest algorithm.And when the Reality simulation data cases, the algorithm that we propose has very strong antijamming capability.
Embodiment 2
The below describes preferred forms of the present invention in detail:
The automatic abstracting system that we introduce adopts the correlation technique of language model in the processing and weighting of document, and uses the method for word frequency statistics that sentence is weighted.Treatment scheme such as Fig. 1 of robotization summary.After having shown user input query information, set up correlation model and uncorrelated model according to the Query Information of user's input, (word sequence analysis is called for short: the process that WSA) produces personalized summary through the Sentence analysis model.
Carry out the process of abstract extraction by correlation model and uncorrelated model:
We extract summary info and are based on statistical language model, in our research, have constructed the bilingual model: a kind of is correlation model, is defined as
Figure DEST_PATH_IMAGE036
; Another kind is uncorrelated model, is defined as
Figure DEST_PATH_IMAGE038
Correlation model
Figure DEST_PATH_IMAGE040
Probability distribution function for the natural language model of query statement.On the contrary, uncorrelated model
Figure 916989DEST_PATH_IMAGE038
Additional probability distribution function for above-mentioned language model.Utilize conditional probability in the natural language processing process, each key word w calculates by using The probability that produces
Figure 2011102137503100002DEST_PATH_IMAGE043
, in addition, calculate again and pass through
Figure 626319DEST_PATH_IMAGE044
The probability that produces
Figure 514640DEST_PATH_IMAGE046
In the part, the definition of two formula will be set forth specifically in the back.
Use a series of key words
Figure 397146DEST_PATH_IMAGE048
Expression document d, wherein n is the number of key word among the document d.In Fig. 2, represent the key word formation with the x axle, the y axle represents key word and inquiry degree of correlation situation of change.The best summary info of document then is the combination of the highest key word of a series of continuous degrees of correlation, such as Fig. 2.To each key word w in document d and the degree of correlation of inquiry, use
Figure 749630DEST_PATH_IMAGE050
Formula 1
Quantize.It is in order to eliminate error that the probability that key word produces under the correlation model deducts the probability that key word produces under the uncorrelated model, certainly, this step can process with other form, but the experimentation by this paper, this processing is very effective, by such processing, the distribution of probability can drop between [1,1].
Can find out intuitively that from figure w and inquiry be when relevant, its quantitative value is for just, and when uncorrelated, quantitative value is to bear.Therefore, get interval [s, t] when best embodying the summary info of this paper at the x axle, following formula is satisfied in interval intercepting:
Formula 2
Corresponding documentation summary is
The extraction algorithm of documentation summary:
Input: Document d, query q, relevance thresholdθ
Output: Relevant snippet set S
Step 1: Plot d as word sequence and store in a queue
1.1 a=null; //a is as a formation of storage document keyword
1.2 for each word w ∈d
do
{
P (w/Mr)=get RMProb (w, q); // calculate key word w and inquiry q based on the probability under the correlation model
P (w/
Figure DEST_PATH_IMAGE055
)=getIRMProb (w, q) // calculating key word and inquiry are based on getting probability under the uncorrelated model
C=P (w/Mr)-P (w/
Figure 426096DEST_PATH_IMAGE055
The degree of correlation of key word and inquiry in the) // document
}
Step 2 :Identification of relevant snippets
2.1 S=null, Scur=null, Sprev=null//S are used for storing the set of the summary fragment that extracts.Scur, Spre preserve a summary temporarily
2.2 a.smooth (); // smoothing processing
2.3 Scur=a.getMaxInterval (); // the highest document fragment of key word degree of correlation combination is stored among the Scur as summary
2.4 Hcur=Scur.sum (); // value of the degree of correlation of Scur is saved among the Hcur
2.5 do{ // according to weights θ, the summary quantity that decision extracts all is saved in every summary among the S
Sprev =Scur;
Scur =a.getMaxInterval();
Hprev= Hcur;
Hcur = Scur.sum();
If (Hprev /Hcur <θ)
a.maskSubqueue(Scur);
S.add(Scur);
}
2.6 Return S
Arthmetic statement:
Figure 898665DEST_PATH_IMAGE014
Collection of document for the needs inquiry.
Figure 949798DEST_PATH_IMAGE056
Be query statement (or user personalized information).Threshold value The summary quantity that control extracts.According to the degree of correlation of calculating each key word and inquiry, the algorithm automatic lifting takes out the summary that can express document information.And according to the size of threshold value, judge the quantity of summary.
Step 1: calculate respectively the P (w/Mr) of each w-P (w/
Figure 2011102137503100002DEST_PATH_IMAGE059
)=c, and value deposited among the formation a.
Step 2: a is carried out finding out the document set of segments Scur of c value summation maximum among the formation a after the smoothing processing, and be saved in the S set.Preserve simultaneously the c value summation of Scur in Hcur.Afterwards w remaining among the formation a is done same processing, finds out Scur and Hcur maximum in the remaining document fragment, preserve each Scur in S set until current Hcur divided by previous Hprev θ.
Algorithm is carried out synoptic diagram and is seen Fig. 2 a.
In the whole algorithmic procedure of describing with false code, have it is noted that the first at 2, before calculating maximal sequence, at first eliminate some larger fluctuations by smoothing processing.Specific implementation is to judge ten key words of each key word and front and back in the document and the degree of correlation of inquiry in program, if the degree of correlation of ten key words in the degree of correlation of current key word and inquiry and front and back and inquiry is more too high or excessively low, think that then current key word is in the larger situation of fluctuation, before computing it is removed, this step is in the algorithm
Figure DEST_PATH_IMAGE061
The second, this algorithm can extract a plurality of relevant text snippets.In order to realize this process, algorithm repeatedly extracts documentation summary and revises remaining text fragments, to each its degree of correlation of the summary record that extracts, repeatedly extract text snippet until the ratio of the relevance degree of nearest two summaries that extract less than threshold value
Probability calculation implementation method in correlation model and the uncorrelated model:
Below the concrete inquiry correlation model of setting forth
Figure 98517DEST_PATH_IMAGE040
With the uncorrelated model of inquiry
Figure 320551DEST_PATH_IMAGE055
Produce the probability of key word.At uncorrelated model
Figure 2011102137503100002DEST_PATH_IMAGE062
In, key word
Figure DEST_PATH_IMAGE063
Probability Can pass through language model
Figure 670760DEST_PATH_IMAGE065
Rule obtain, this be because, at the language model with whole document sets structure
Figure 636442DEST_PATH_IMAGE065
The inside, the document that inquiry is relevant only has very little value, and inquires about the uncorrelated principal element that occupied.
At given key word
Figure 818025DEST_PATH_IMAGE002
In the situation of whole document sets, occur in the document
Figure 7698DEST_PATH_IMAGE002
Number of times use
Figure 2011102137503100002DEST_PATH_IMAGE066
Expression, and the quantity in the whole document is used
Figure 845204DEST_PATH_IMAGE067
Expression (comprising multiplicity).Therefore, this paper estimates key word in the uncorrelated model
Figure 411314DEST_PATH_IMAGE002
The probability that produces is:
Figure 650666DEST_PATH_IMAGE008
Formula 3
To be inquired about preferably correlation model on the other hand Parameter, need one with the higher sample set of the inquiry degree of correlation.But such sample information may be not difficult to find.Therefore, in order to address this problem, we suppose the inquiry correlation model
Figure 336042DEST_PATH_IMAGE040
Can set up according to the document that checks out.
Particularly, implement by following step:
Use inquiry
Figure 705844DEST_PATH_IMAGE010
Come search file, and the relevant document definition of the inquiry that will be retrieved is
Figure DEST_PATH_IMAGE068
(the processing here can realize with the software with function of search, such as: a series of document that Lucence) produces by this step all is the document that has comprised whole or most of searching keyword.In addition, exist
Figure 534122DEST_PATH_IMAGE068
In each document
Figure 331177DEST_PATH_IMAGE014
Have
Figure 2011102137503100002DEST_PATH_IMAGE069
, representative
Figure 877696DEST_PATH_IMAGE014
Figure 51188DEST_PATH_IMAGE056
Condition under the probability retrieved.
Calculate keyword at given document
Figure 11272DEST_PATH_IMAGE014
Situation under
Figure DEST_PATH_IMAGE070
, and pass through the language model that whole document makes up
Figure 979228DEST_PATH_IMAGE065
Do smoothing processing
Figure 13043DEST_PATH_IMAGE023
Formula 4
Wherein,
Figure DEST_PATH_IMAGE071
Be
Figure 724647DEST_PATH_IMAGE002
At document
Figure 527518DEST_PATH_IMAGE014
The number of times of middle appearance.
Figure 666375DEST_PATH_IMAGE033
Be the relevant documentation of choosing
Figure 249803DEST_PATH_IMAGE014
Document length.And operation parameter
Figure 437202DEST_PATH_IMAGE035
Control word frequency to the impact of this probability, this is processing common in the natural language model.By calculating
Figure 422476DEST_PATH_IMAGE025
Come approximate evaluation
Figure DEST_PATH_IMAGE072
, this probability is:
Figure 669917DEST_PATH_IMAGE029
Formula 5
When calculating
Figure 740641DEST_PATH_IMAGE041
With
Figure 731731DEST_PATH_IMAGE059
Generate after the probability of key word w under the condition, just can export text snippet associated with the query with the algorithm of front.
The advantage that correlation technique shows in concrete implementation among the application:
In order to set forth better the advantage of the automatic Summarization Technique that we propose, in the employed data of test phase from TREC DOE (U.S. Department of Energy).Data acquisition is the test data of industrywide standard.By our algorithm of test and other algorithms of closing at standard data set, in the accuracy of digest, on the antijamming capability, the algorithm that we propose all obviously is better than other algorithm.
The query statement that in this data acquisition, has comprised 226087 documents and 80 benchmark.Each document comes the publication summary to DOE, and the substantial distance of each document is 149 words, in these 226087 digest document, has 80 query statements of 2352 summaries and this relevant.
Before the processing, at first these documents are made up mutually, consist of some relevant and incoherent long documents, with these synthetic document test macros.These documents specifically are classified as follows:
(a) simple correlation data slot set
(b) compound related data set of segments
(c) the simple correlation data slot that comprises noise is gathered (set with interfering data)
Forms data set of segments (S1): each document in this set all is one piece of relevant documentation summary and the synthetic new document of some stochastic independences summary.
Most according to set of segments (S2): each document in this set all is to make up some uncorrelated documents at random by two to five pieces of relevant documentations to generate.And those two to five pieces of relevant documentations all are relevant with same inquiry.In the inquiry of 80 benchmark, only have interrelatedly with 54 inquiries, all only select wherein these 54 in test.
Comprise the simple correlation data slot set (S3) of interference: this data set is to add interfere information in first data acquisition.Insert the part phrase of query statement as interfere information in the uncorrelated documentation section to every piece of document in the S1 set.Can better the Reality simulation situation by this process.
To the document fragment that identifies, counting accuracy
Figure DEST_PATH_IMAGE074
With the value of recalling
Figure 571511DEST_PATH_IMAGE076
Concrete grammar is as follows: definition Be summary true length originally, and
Figure 282295DEST_PATH_IMAGE080
Be the length of the document fragment that extracts,
Figure 139393DEST_PATH_IMAGE082
The length of the part that repeats for the document fragment of truly making a summary and extracting.In addition, this paper also uses a coordination parameter
Figure 771362DEST_PATH_IMAGE084
When this value is high, can embody degree of accuracy With the value of recalling All higher.
Figure 801132DEST_PATH_IMAGE086
Except embodying the gesture of this paper algorithm, under same test data, other 2 abstract extraction algorithms have also been contrasted.First is the FL-Win algorithm, given one estimate the big or small k of summary after, check one group of text fragments of k word composition and the correlativity of inquiry, choose maximally related as summary info.In this experiment, window size is 149, i.e. the Average True true length degree of each text fragments.In the situation of the optimum length that does not have given summary, length of summarization is estimated first in this algorithm requirement, and supposes that this length is the summary optimum length, and in the situation that best average length of summarization is arranged, the method effect is fine.But under real implementation status, best length of summarization is to be difficult to obtain in advance.
Second is the Cos-Win method, and the method can be extracted the data slot of length variations as summary according to different inquiries.By comparing the similarity of text fragments and inquiry, will
Figure DEST_PATH_IMAGE088
The fragment that similarity is the highest is elected summary info as.
According to top two basic skills, the text separation algorithms is calculated in the place of per 25 words, to reduce the complexity of computing.In the Cos-Win method, according to identical querying condition, be to select representative text fragments as initial summary info in candidate's text fragments of 50 to 600 words in scope.Calculate each initial segment
Figure 287609DEST_PATH_IMAGE088
After the similarity, increase progressively 25 words, again calculate new document fragment
Figure 375650DEST_PATH_IMAGE088
Similarity is searched and is had maximum
Figure 642684DEST_PATH_IMAGE088
The text fragments of similarity value.
In the experiment contrast stage, at first compare the algorithm of our proposition and the implementation effect of 2 basic digest extraction algorithms according to same data (data acquisition S1).Experimental result such as table 1.The table on, WSA recall rate and
Figure DEST_PATH_IMAGE089
On all be best, on most accuracy rate, WSA is higher, except with the comparison of FL-Win, because WSA is in order to export the lengthy document fragment, so in accuracy rate certain decline is arranged in design.But be significantly improved on the recall rate.And from table 1, can obtain, WSA's
Figure 779267DEST_PATH_IMAGE090
Value generally can 0.799, embody preferably execution result of WSA.
Three kinds of abstract extraction algorithm contrasts of table 1
Figure DEST_PATH_IMAGE091
Equally, by the experimental data of table 1, can analyze and draw WSA and on the S2 data acquisition, also have preferably execution result.
In anti-interference test process, the experimental data that we adopt comprises and the relevant and uncorrelated document of inquiry, also is for the digest generative process in the Reality simulation situation better.In document sets S3, only have one for real related abstract, other be interfering data, comprise the part text message of query statement as each incoherent document fragment of interfering data.At Fig. 3 a implementation status in the data slot of three kinds of methods at different length has been described.In the drawings, the length according to the text fragments of testing is divided into different groups.Particularly, can be divided into 0-50,51-100,101-200,201-300,301-400 and 401-500.Wherein, the x axle represents the length of text fragments, after the comparison diagram 3b, can observe, and WSA is having in the interfering data situation, the result of execution with do not have disturbed condition under execution result probably consistent.But, FL-Win and Cos-Win's
Figure 120249DEST_PATH_IMAGE084
Can obviously descend because of interfering data.Can draw from figure, WSA not only can extract accurately summary info under general case, comprising in the situation of interfere information, also can bring into play good effect.Because in noisy document, WSA can reduce to follow through on the surface and askes relevant and in fact uncorrelated document.
The parameter of using in the experimentation:
In the language model of composition WSA two main parameters are arranged, the firstth, the set of retrieval relevant documentation Size, the secondth, smoothing parameter
Figure 316876DEST_PATH_IMAGE035
Determining
Figure 71205DEST_PATH_IMAGE093
During size, test is in the relevant documentation of varying number
Figure 808217DEST_PATH_IMAGE089
Value (such as Fig. 4), as can be seen from the figure, gather at relevant documentation
Figure 269285DEST_PATH_IMAGE093
In the time of between being in 5 to 40, execution result tends towards stability.When use is less than 5 pieces of documents, can not provide enough document information, but when number of documents more than 50 the time, can introduce too much interfere information again.In experiment, if there be not special indicating, come the computational language correlation model with 15 piece of writing documents.
In the parameter estimation procedure of language model, smoothing processing is very important.In language model, smoothing processing can guarantee not have the key word of zero probability to occur.In the experiment of this paper, observe and work as Can obtain maximum when being set to 0.9
Figure 612859DEST_PATH_IMAGE090
Namely when carrying out smoothing processing hardly, the comparatively ideal implementation effect of WSA.This is because set up when inquiring about relevant language model in this paper experiment, finishes according to the inquiry relevant documentation.So to most key word, do not have the situation of zero probability, naturally just do not need smoothing processing to eliminate zero probability yet.

Claims (5)

1. the personalized automaticabstracting in the digital library system is characterized in that comprising the steps:
A, input inquiry information, described Query Information comprise key word and and user's customized information;
B, set up correlation model and uncorrelated model according to the Query Information of inputting, described correlation model refers to the probability distribution function of the natural language model of Query Information in relevant documentation, described relevant documentation refers to use the keyword query digital library system, the top 5-50 piece of writing document that obtains;
Described uncorrelated model is the additional probability distribution function of described correlation model, namely refers to the probability distribution function of the natural language model of Query Information in uncorrelated document, all collection of document in the described uncorrelated document index word book system;
C, needs are obtained each key word in the document of summary info, calculate the probability that described key word produces under correlation model and uncorrelated model, and deduct probability under the uncorrelated model as the degree of correlation of described key word and Query Information with the probability under the correlation model;
D, the described degree of correlation of each key word is saved in the formation, and smoothing processing is carried out in formation;
E, choose one group of continuous key word degree of correlation addition in the described formation, the document fragment that the degree of correlation is the highest is made a summary as a document, the document fragment that this degree of correlation is the highest is put into the summary data set, and deletes the highest document fragment of this degree of correlation in described formation;
F, judge whether continue to seek lower bar document summary according to threshold size;
G, if necessary continues the operation of e step, if do not need, just returns all document fragments in the summary data set as summary info.
2. the personalized automaticabstracting in the digital library system according to claim 1, its feature
Be: in the described c step, calculate the probability that described key word produces and specifically comprise under correlation model and uncorrelated model: the probabilistic method that described key word produces under uncorrelated model is:
At given key word
Figure DEST_PATH_IMAGE002AAAAAAAAA
In the situation of whole collection of document, key word appears in the document Number of times use
Figure DEST_PATH_IMAGE005
Expression, and the quantity in the whole collection of document is used
Figure DEST_PATH_IMAGE007
Expression, key word in the uncorrelated model
Figure DEST_PATH_IMAGE009A
The probability that produces is:
Figure DEST_PATH_IMAGE011A
Figure 890586DEST_PATH_IMAGE013
What represent is the language model that makes up with all documents in the whole digital library system;
Figure 358739DEST_PATH_IMAGE015
Represent uncorrelated model;
The expression key word
Figure DEST_PATH_IMAGE018
At uncorrelated model
Figure 507611DEST_PATH_IMAGE019
The probability of middle generation;
Figure 259666DEST_PATH_IMAGE021
The expression key word Language model at whole document sets structure
Figure DEST_PATH_IMAGE023AAA
In probability;
The step of the probabilistic method that described key word produces under correlation model comprises:
1) uses inquiry
Figure DEST_PATH_IMAGE025
Come search file, and the relevant document definition of the inquiry that will be retrieved is
Figure DEST_PATH_IMAGE027
,
Figure DEST_PATH_IMAGE029
In each document
Figure DEST_PATH_IMAGE031
Have
Figure DEST_PATH_IMAGE033AAA
, representative
Figure DEST_PATH_IMAGE035
Figure DEST_PATH_IMAGE037
Condition under the probability retrieved, calculate keyword at given document Situation under
Figure DEST_PATH_IMAGE041
, and pass through the language model that whole document makes up
Figure DEST_PATH_IMAGE043
Do smoothing processing, wherein
Figure DEST_PATH_IMAGE044A
Computing formula as follows:
Figure DEST_PATH_IMAGE046A
Figure DEST_PATH_IMAGE047
The expression key word At given document The probability of middle generation;
2) by calculating
Figure DEST_PATH_IMAGE050
Come approximate evaluation
Figure DEST_PATH_IMAGE052A
, this probability is:
Figure DEST_PATH_IMAGE054
Figure DEST_PATH_IMAGE056
The expression key word
Figure DEST_PATH_IMAGE018AA
At correlation model
Figure DEST_PATH_IMAGE058
The probability of middle generation;
Figure DEST_PATH_IMAGE059
The expression key word
Figure DEST_PATH_IMAGE018AAA
At relevant documentation
Figure DEST_PATH_IMAGE060
The probability of middle generation;
Wherein,
Figure DEST_PATH_IMAGE062
Be
Figure DEST_PATH_IMAGE003AA
At document
Figure DEST_PATH_IMAGE031A
The number of times of middle appearance,
Figure DEST_PATH_IMAGE066
Be the relevant documentation of choosing
Figure DEST_PATH_IMAGE031AA
Document length, and operation parameter
Figure DEST_PATH_IMAGE069
Control word frequency to the impact of this probability, this is processing common in the natural language model.
3. the personalized automaticabstracting in the digital library system according to claim 1 and 2, it is characterized in that: in the described d step, smoothing processing is carried out in formation specifically to be referred to: calculating need to obtain each key word in the document of summary info and the degree of correlation of Query Information, the degree of correlation of ten key words of each degree of correlation and front and back is more too high or excessively low, think that then current key word is in the larger situation of fluctuation, removes it before computing.
4. the personalized automaticabstracting in the digital library system according to claim 1, it is characterized in that: in the described c step, deduct probability under the uncorrelated model as the degree of correlation of described key word and Query Information with the probability under the correlation model, being distributed between [1,1] of the degree of correlation.
5. the personalized automaticabstracting in the digital library system according to claim 1, it is characterized in that: in the described a step, user's customized information refers to: the individual preference information that user's historical viewings data or user once used in digital library system.
CN 201110213750 2011-07-28 2011-07-28 Automatic personalized abstracting method in digital library system Expired - Fee Related CN102222119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110213750 CN102222119B (en) 2011-07-28 2011-07-28 Automatic personalized abstracting method in digital library system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110213750 CN102222119B (en) 2011-07-28 2011-07-28 Automatic personalized abstracting method in digital library system

Publications (2)

Publication Number Publication Date
CN102222119A CN102222119A (en) 2011-10-19
CN102222119B true CN102222119B (en) 2013-04-17

Family

ID=44778671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110213750 Expired - Fee Related CN102222119B (en) 2011-07-28 2011-07-28 Automatic personalized abstracting method in digital library system

Country Status (1)

Country Link
CN (1) CN102222119B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9997157B2 (en) * 2014-05-16 2018-06-12 Microsoft Technology Licensing, Llc Knowledge source personalization to improve language models
CN105824915A (en) * 2016-03-16 2016-08-03 上海珍岛信息技术有限公司 Method and system for generating commenting digest of online shopped product
CN107766419B (en) * 2017-09-08 2021-08-31 广州汪汪信息技术有限公司 Threshold denoising-based TextRank document summarization method and device
CN115098667B (en) * 2022-08-25 2023-01-03 北京聆心智能科技有限公司 Abstract generation method, device and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003012661A1 (en) * 2001-07-31 2003-02-13 Invention Machine Corporation Computer based summarization of natural language documents
JP4250024B2 (en) * 2003-05-23 2009-04-08 日本電信電話株式会社 Text summarization device and text summarization program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003012661A1 (en) * 2001-07-31 2003-02-13 Invention Machine Corporation Computer based summarization of natural language documents
JP4250024B2 (en) * 2003-05-23 2009-04-08 日本電信電話株式会社 Text summarization device and text summarization program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王海等.基于遗传算法的查询导向式自动文摘.《微计算机信息》.2009,第25卷(第28期),23-25. *

Also Published As

Publication number Publication date
CN102222119A (en) 2011-10-19

Similar Documents

Publication Publication Date Title
Shahana et al. Survey on feature subset selection for high dimensional data
KR102019194B1 (en) Core keywords extraction system and method in document
US7444279B2 (en) Question answering system and question answering processing method
US9519870B2 (en) Weighting dictionary entities for language understanding models
Chen et al. Two novel feature selection approaches for web page classification
CN104166651A (en) Data searching method and device based on integration of data objects in same classes
CN102023986A (en) Method and equipment for constructing text classifier by referencing external knowledge
De Boom et al. Semantics-driven event clustering in Twitter feeds
CN102222119B (en) Automatic personalized abstracting method in digital library system
JP6079270B2 (en) Information provision device
CN106991171A (en) Topic based on Intelligent campus information service platform finds method
Aung et al. Random forest classifier for multi-category classification of web pages
Basili et al. NLP-driven IR: Evaluating performances over a text classification task
CN107133321B (en) Method and device for analyzing search characteristics of page
Ab Ghani et al. Subspace Clustering in High-Dimensional Data Streams: A Systematic Literature Review.
Mathai et al. An efficient approach for item set mining using both utility and frequency based methods
Schenker et al. A comparison of two novel algorithms for clustering web documents
Al-Zeiadi et al. Incremental Closed Frequent Itemsets Mining based Approach Using Maximal Candidates
Abinaya et al. Effective Feature Selection For High Dimensional Data using Fast Algorithm
Shaohong et al. Web page classification based on semi-supervised naïve bayesian em algorithm
Seyfi et al. Mining discriminative itemsets in data streams
Kamepalli et al. Implementation of Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data
Shukla et al. A New Imputation Method for Missing Attribute Values in Data Mining.
Ajitha et al. EFFECTIVE FEATURE EXTRACTION FOR DOCUMENT CLUSTERING TO ENHANCE SEARCH ENGINE USING XML.
Christy et al. An enhanced method for topic modeling using concept-latent

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130417

Termination date: 20140728

EXPY Termination of patent right or utility model