CN102222119B

CN102222119B - Automatic personalized abstracting method in digital library system

Info

Publication number: CN102222119B
Application number: CN 201110213750
Authority: CN
Inventors: 李庆; 刘家芬; 罗旭斌; 张晨; 胡川
Original assignee: CHENGDU XICHUANG ZHANGZHONG TECHNOLOGY CO LTD
Current assignee: CHENGDU XICHUANG ZHANGZHONG TECHNOLOGY CO LTD
Priority date: 2011-07-28
Filing date: 2011-07-28
Publication date: 2013-04-17
Anticipated expiration: 2031-07-28
Also published as: CN102222119A

Abstract

The invention discloses an automatic personalized abstracting method in a digital library system, and relates to the technical field of information processing. The method comprises the following steps of: a, inputting query information; b, setting up correlated models and uncorrelated models according to the input query information; c, for each word in a document of which the abstract informationis needed to be acquired, computing possibilities of generating the word in the related models and unrelated models; d, saving a correlation degree of each keyword into a queue; e, selecting correlation degrees of a series of continual keywords in the queue, and summing up, wherein a document fragment with the highest correlation degree is used as a piece of document abstract; f, determining whether to continue to look for the next abstract according to the magnitude of a threshold; and g, if necessary, continuing to perform the step e, if not necessary, returning all the documents in an abstract data set as abstract information. Compared with the conventional abstract algorithm, the method has the advantage that an article abstract acquired by the method is high in accuracy. Moreover, under the condition of simulating real data, the method has high anti-jamming capability.

Description

Personalized automaticabstracting in the digital library system

Technical field

The present invention relates to technical field of information processing, exactly relate to the personalized automaticabstracting in a kind of digital library system.

Background technology

Automatic abstract based on inquiry namely for given document, returns one or more summary info associated with the query, after a text collection establishes or upgrades, automatically document is divided into a plurality of discrete summary infos.

Present automatic abstract is processed, and a kind of method is according to some documents relevant with current document, pre-estimates summary info length, has had after the approximate size of documentation summary, searches with the information segment of inquiring about the designated length of mating most as article abstract.

Another kind method is by pre-service, first document is cut into one or more semantic information piece.After the semantic information piece was determined, the degree of association between matching inquiry statement and the semantic information piece was selected with the query statement degree of association the highlyest, and the message block that can cover the document main information is as documentation summary.

Yet the length of summary info is difficult to pre-determine in the first method; And second method, after pre-service, fixed the position at the whole story of summary info, and after the document pre-service, if the main information of document appears in several different segmentation, the summary info that extracts in this case is lower to the coverage rate of document main information.Such as, one piece of document can be a plurality of fragments that do not have coincidence by cutting, but cutting has a potential problem like this, when best documentation summary need to cover the content of two adjacent segment, because pre-service has separated the document fragment, the summary info that automatically extracts is imperfect.

Be CN 101231634 such as publication number, open day is to get Chinese patent literature on July 30th, 2008 to disclose the method that a kind of figure of utilization division methods is extracted many document abstracts automatically, may further comprise the steps: carry out the sentence boundary cutting, document is used the sentence expression that cuts out; Sentence expression is become vector, calculate the similarity of sentence between in twos and consist of the sentence incidence matrix, and by the threshold value of appointment incidence matrix is carried out yojan, carry out simultaneously standardization processing; In many document abstracts, introduce the excavation of the recessive logical organization of theme, document sets is divided into different recessive sub-topicses by theme, thereby the digest task is converted into choosing and extraction process sub-topics; The method of utilizing figure to divide both guaranteed the importance degree of sentence place sub-topics from global property, guarantee again the low redundancy of content between the different sub-topicses from local characteristics, thus Effective Raise the digest quality.

But the prior art take above-mentioned patent documentation as representative, still exist in following technical matters: CN 101231634 patents according to sentence Determining Weights vector, cause summary info to be cut apart by sentence, the summary info that extracts in this case is lower to the coverage rate of document main information.

Summary of the invention

For solving the problems of the technologies described above, the present invention proposes the personalized automaticabstracting in a kind of digital library system, adopt this method, can solve the technical matters of existing " summary info of extraction is lower to the coverage rate of document main information " in the above-mentioned prior art, and, the fixing length of summary info not, can obtain flexibly summary info, when extracting documentation summary, can well judge the degree of correlation of document fragment and inquiry, the summary info antijamming capability that extracts is strong, and the article abstract that adopts this method to obtain, and is higher than the accuracy rate of the article abstract that obtains with traditional digest algorithm.

The present invention is by adopting following technical proposals to realize:

Personalized automaticabstracting in a kind of digital library system is characterized in that comprising the steps:

A, input inquiry information, described Query Information comprise key word and and user's customized information;

B, set up correlation model and uncorrelated model according to the Query Information of inputting, described correlation model refers to the probability distribution function of the natural language model of query statement, uses the keyword query digital library system, obtains top 5-50 piece of writing document;

Described uncorrelated model is the additional probability distribution function of described correlation model, all collection of document in the index word book system;

Because in the language model the inside that makes up with whole document sets, the document that inquiry is relevant only has very little value, and inquires about the uncorrelated principal element that occupied, so can remove to make up uncorrelated model with whole collection of document

C, needs are obtained each word in the document of summary info, calculate the probability that described word produces under correlation model and uncorrelated models, and with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information;

D, the described degree of correlation of each key word is saved in the formation, and smoothing processing is carried out in formation;

E, choose one group of continuous key word degree of correlation addition in the described formation, the document fragment that the degree of correlation is the highest is made a summary as a document, the document fragment that this degree of correlation is the highest is put into the summary data set, and deletes the highest document fragment of this degree of correlation in described formation;

F, judge whether continue to seek lower bar digest according to threshold size;

G, if necessary continues the operation of e step, if do not need, just returns all documents in the summary data set as summary info.

In the described c step, calculate the probability that described word produces and specifically comprise under correlation model and uncorrelated model: the probabilistic method that described word produces under uncorrelated model is:

At given key word

In the situation of whole collection of document, key word appears in the document

Number of times use Expression, and the quantity in the whole collection of document is used

Expression, key word in the uncorrelated model

The probability that produces is:

；

The step of the probabilistic method that described word produces under correlation model comprises:

1) uses inquiry Come search file, and the relevant document definition of the inquiry that will be retrieved is

, In each document

Have

, representative

Condition under the probability retrieved, calculate keyword at given document

Situation under , and pass through the language model that whole document makes up

Do smoothing processing, wherein

Computing formula as follows:

2) by calculating

Come approximate evaluation

, this probability is:

Wherein,

Be

At document

The number of times of middle appearance, Be the relevant documentation of choosing Document length, and operation parameter Control word frequency to the impact of this probability, this is processing common in the natural language model.

In the described d step, smoothing processing is carried out in formation specifically to be referred to: calculating need to obtain each word in the document of summary info and the degree of correlation of Query Information, the degree of correlation of ten words of each degree of correlation and front and back is more too high or excessively low, think that then current word is in the larger situation of fluctuation, removes it before computing.

In the described f step, judge whether that according to threshold size continuing to seek lower bar digest specifically refers to: the value that presets threshold value, the degree of correlation summation of the summary fragment of before taking out is divided by the degree of correlation summation of current summary fragment of the taking out threshold value less than described setting, then keep current digest information, and repeat the e step; As greater than as described in the threshold value of setting, then abandon current digest information, and finish the digest extraction algorithm, return all documents in the summary data set as summary info.

In the described c step, with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information, being distributed between [1,1] of the degree of correlation.

In the described a step, user's customized information refers to: the individual preference information that user's historical viewings data or user once used in digital library system.

Compared with prior art, the technique effect that reaches of the present invention is as follows:

The technical scheme that the present invention adopts the a-g step to form, adopt this method, can solve the technical matters of existing " summary info of extraction is lower to the coverage rate of document main information " in the above-mentioned prior art, and, the fixing length of summary info not, can obtain flexibly summary info, when extracting documentation summary, can well judge the degree of correlation of document fragment and inquiry, the summary info antijamming capability that extracts is strong, and the article abstract that adopts this method to obtain is higher than the accuracy rate of the article abstract that obtains with traditional digest algorithm.Be in particular in:

When a step input customized information, can realize personalized automatic abstract.

In the b step, correlation model obtains top 5-50 piece of writing document, through experiment (referring to following embodiment part), with respect to prior art, has extremely strong antijamming capability.

In the c step, with the probability under the correlation model deduct under the uncorrelated model probability as the degree of correlation of described word and Query Information, eliminated error.

In the d step, after formation carried out smoothing processing, namely calculate each word in the document to obtain summary info and the degree of correlation of inquiry, the degree of correlation of ten words of each degree of correlation and front and back is more too high or excessively low, think that then current word is in the larger situation of fluctuation, removes it before computing.Thereby, each degree of correlation is carried out smoothing processing, can reduce auxiliary word commonly used, articles etc. are irrelevant but use than word more frequently the impact of summary accuracy with inquiry.

In the e step, choose one section the highest key combination of degree of correlation combination in the formation, namely be to have found out in the document and one section the closest key combination of query statement relation by algorithm, so it is the tightst to extract summary info energy and Query Information relation.

In the f step, setting threshold is big or small, can control the length of proposition summary info out.

Description of drawings

The present invention is described in further detail below in conjunction with specification drawings and specific embodiments, wherein:

Fig. 1 is the processing flow chart of robotization summary

Fig. 2 is the degree of correlation synoptic diagram of word w and inquiry among the document d

Fig. 2 a is the extraction algorithm synoptic diagram of documentation summary

Fig. 3 a is that three kinds of autoabstract generation methods are carried out the accuracy rate synoptic diagram under noisy condition

Fig. 3 b is that three kinds of autoabstract generation methods there be not execution accuracy rate synoptic diagram under the disturbed condition

Fig. 4 is the quantity of set of relevant documents and the graph of a relation of F-measure.

Embodiment

Embodiment 1

The basic embodiment of the present invention comprises the steps:

F, judge whether continue to seek lower bar digest according to threshold size;

The technical scheme that adopts above-mentioned steps to form is in the data acquisition of standard, higher than the accuracy rate of the article abstract that obtains with traditional digest algorithm.And when the Reality simulation data cases, the algorithm that we propose has very strong antijamming capability.

Embodiment 2

The below describes preferred forms of the present invention in detail:

The automatic abstracting system that we introduce adopts the correlation technique of language model in the processing and weighting of document, and uses the method for word frequency statistics that sentence is weighted.Treatment scheme such as Fig. 1 of robotization summary.After having shown user input query information, set up correlation model and uncorrelated model according to the Query Information of user's input, (word sequence analysis is called for short: the process that WSA) produces personalized summary through the Sentence analysis model.

Carry out the process of abstract extraction by correlation model and uncorrelated model:

We extract summary info and are based on statistical language model, in our research, have constructed the bilingual model: a kind of is correlation model, is defined as

; Another kind is uncorrelated model, is defined as

Correlation model

Probability distribution function for the natural language model of query statement.On the contrary, uncorrelated model

Additional probability distribution function for above-mentioned language model.Utilize conditional probability in the natural language processing process, each key word w calculates by using The probability that produces

Figure 2011102137503100002DEST_PATH_IMAGE043

, in addition, calculate again and pass through

The probability that produces

In the part, the definition of two formula will be set forth specifically in the back.

Use a series of key words

Expression document d, wherein n is the number of key word among the document d.In Fig. 2, represent the key word formation with the x axle, the y axle represents key word and inquiry degree of correlation situation of change.The best summary info of document then is the combination of the highest key word of a series of continuous degrees of correlation, such as Fig. 2.To each key word w in document d and the degree of correlation of inquiry, use

Formula 1

Quantize.It is in order to eliminate error that the probability that key word produces under the correlation model deducts the probability that key word produces under the uncorrelated model, certainly, this step can process with other form, but the experimentation by this paper, this processing is very effective, by such processing, the distribution of probability can drop between [1,1].

Can find out intuitively that from figure w and inquiry be when relevant, its quantitative value is for just, and when uncorrelated, quantitative value is to bear.Therefore, get interval [s, t] when best embodying the summary info of this paper at the x axle, following formula is satisfied in interval intercepting:

Formula 2

Corresponding documentation summary is

The extraction algorithm of documentation summary:

Input: Document d, query q, relevance thresholdθ

Output: Relevant snippet set S

Step 1: Plot d as word sequence and store in a queue

1.1 a=null; //a is as a formation of storage document keyword

1.2 for each word w ∈d

do

{

P (w/Mr)=get RMProb (w, q); // calculate key word w and inquiry q based on the probability under the correlation model

P (w/

)=getIRMProb (w, q) // calculating key word and inquiry are based on getting probability under the uncorrelated model

C=P (w/Mr)-P (w/

The degree of correlation of key word and inquiry in the) // document

}

Step 2 :Identification of relevant snippets

2.1 S=null, Scur=null, Sprev=null//S are used for storing the set of the summary fragment that extracts.Scur, Spre preserve a summary temporarily

2.2 a.smooth (); // smoothing processing

2.3 Scur=a.getMaxInterval (); // the highest document fragment of key word degree of correlation combination is stored among the Scur as summary

2.4 Hcur=Scur.sum (); // value of the degree of correlation of Scur is saved among the Hcur

2.5 do{ // according to weights θ, the summary quantity that decision extracts all is saved in every summary among the S

Sprev =Scur;

Scur =a.getMaxInterval();

Hprev= Hcur;

Hcur = Scur.sum();

If (Hprev /Hcur <θ)

a.maskSubqueue(Scur);

S.add(Scur);

}

2.6 Return S

Arthmetic statement:

Collection of document for the needs inquiry.

Be query statement (or user personalized information).Threshold value The summary quantity that control extracts.According to the degree of correlation of calculating each key word and inquiry, the algorithm automatic lifting takes out the summary that can express document information.And according to the size of threshold value, judge the quantity of summary.

Step 1: calculate respectively the P (w/Mr) of each w-P (w/

Figure 2011102137503100002DEST_PATH_IMAGE059

)=c, and value deposited among the formation a.

Step 2: a is carried out finding out the document set of segments Scur of c value summation maximum among the formation a after the smoothing processing, and be saved in the S set.Preserve simultaneously the c value summation of Scur in Hcur.Afterwards w remaining among the formation a is done same processing, finds out Scur and Hcur maximum in the remaining document fragment, preserve each Scur in S set until current Hcur divided by previous Hprev θ.

Algorithm is carried out synoptic diagram and is seen Fig. 2 a.

In the whole algorithmic procedure of describing with false code, have it is noted that the first at 2, before calculating maximal sequence, at first eliminate some larger fluctuations by smoothing processing.Specific implementation is to judge ten key words of each key word and front and back in the document and the degree of correlation of inquiry in program, if the degree of correlation of ten key words in the degree of correlation of current key word and inquiry and front and back and inquiry is more too high or excessively low, think that then current key word is in the larger situation of fluctuation, before computing it is removed, this step is in the algorithm

The second, this algorithm can extract a plurality of relevant text snippets.In order to realize this process, algorithm repeatedly extracts documentation summary and revises remaining text fragments, to each its degree of correlation of the summary record that extracts, repeatedly extract text snippet until the ratio of the relevance degree of nearest two summaries that extract less than threshold value

Probability calculation implementation method in correlation model and the uncorrelated model:

Below the concrete inquiry correlation model of setting forth

With the uncorrelated model of inquiry

Produce the probability of key word.At uncorrelated model

Figure 2011102137503100002DEST_PATH_IMAGE062

In, key word

Probability Can pass through language model

Rule obtain, this be because, at the language model with whole document sets structure

The inside, the document that inquiry is relevant only has very little value, and inquires about the uncorrelated principal element that occupied.

At given key word

In the situation of whole document sets, occur in the document

Number of times use

Figure 2011102137503100002DEST_PATH_IMAGE066

Expression, and the quantity in the whole document is used

Expression (comprising multiplicity).Therefore, this paper estimates key word in the uncorrelated model

The probability that produces is:

Formula 3

To be inquired about preferably correlation model on the other hand Parameter, need one with the higher sample set of the inquiry degree of correlation.But such sample information may be not difficult to find.Therefore, in order to address this problem, we suppose the inquiry correlation model

Can set up according to the document that checks out.

Particularly, implement by following step:

Use inquiry

Come search file, and the relevant document definition of the inquiry that will be retrieved is

(the processing here can realize with the software with function of search, such as: a series of document that Lucence) produces by this step all is the document that has comprised whole or most of searching keyword.In addition, exist

In each document

Have

Figure 2011102137503100002DEST_PATH_IMAGE069

, representative

Condition under the probability retrieved.

Calculate keyword at given document

Situation under

, and pass through the language model that whole document makes up

Do smoothing processing

Formula 4

Wherein,

Be

At document

The number of times of middle appearance.

Be the relevant documentation of choosing

Document length.And operation parameter

Control word frequency to the impact of this probability, this is processing common in the natural language model.By calculating

Come approximate evaluation

, this probability is:

Formula 5

When calculating

With

Generate after the probability of key word w under the condition, just can export text snippet associated with the query with the algorithm of front.

The advantage that correlation technique shows in concrete implementation among the application:

In order to set forth better the advantage of the automatic Summarization Technique that we propose, in the employed data of test phase from TREC DOE (U.S. Department of Energy).Data acquisition is the test data of industrywide standard.By our algorithm of test and other algorithms of closing at standard data set, in the accuracy of digest, on the antijamming capability, the algorithm that we propose all obviously is better than other algorithm.

The query statement that in this data acquisition, has comprised 226087 documents and 80 benchmark.Each document comes the publication summary to DOE, and the substantial distance of each document is 149 words, in these 226087 digest document, has 80 query statements of 2352 summaries and this relevant.

Before the processing, at first these documents are made up mutually, consist of some relevant and incoherent long documents, with these synthetic document test macros.These documents specifically are classified as follows:

(a) simple correlation data slot set

(b) compound related data set of segments

(c) the simple correlation data slot that comprises noise is gathered (set with interfering data)

Forms data set of segments (S1): each document in this set all is one piece of relevant documentation summary and the synthetic new document of some stochastic independences summary.

Most according to set of segments (S2): each document in this set all is to make up some uncorrelated documents at random by two to five pieces of relevant documentations to generate.And those two to five pieces of relevant documentations all are relevant with same inquiry.In the inquiry of 80 benchmark, only have interrelatedly with 54 inquiries, all only select wherein these 54 in test.

Comprise the simple correlation data slot set (S3) of interference: this data set is to add interfere information in first data acquisition.Insert the part phrase of query statement as interfere information in the uncorrelated documentation section to every piece of document in the S1 set.Can better the Reality simulation situation by this process.

To the document fragment that identifies, counting accuracy

With the value of recalling

Concrete grammar is as follows: definition Be summary true length originally, and

Be the length of the document fragment that extracts,

The length of the part that repeats for the document fragment of truly making a summary and extracting.In addition, this paper also uses a coordination parameter

When this value is high, can embody degree of accuracy With the value of recalling All higher.

Except embodying the gesture of this paper algorithm, under same test data, other 2 abstract extraction algorithms have also been contrasted.First is the FL-Win algorithm, given one estimate the big or small k of summary after, check one group of text fragments of k word composition and the correlativity of inquiry, choose maximally related as summary info.In this experiment, window size is 149, i.e. the Average True true length degree of each text fragments.In the situation of the optimum length that does not have given summary, length of summarization is estimated first in this algorithm requirement, and supposes that this length is the summary optimum length, and in the situation that best average length of summarization is arranged, the method effect is fine.But under real implementation status, best length of summarization is to be difficult to obtain in advance.

Second is the Cos-Win method, and the method can be extracted the data slot of length variations as summary according to different inquiries.By comparing the similarity of text fragments and inquiry, will

The fragment that similarity is the highest is elected summary info as.

According to top two basic skills, the text separation algorithms is calculated in the place of per 25 words, to reduce the complexity of computing.In the Cos-Win method, according to identical querying condition, be to select representative text fragments as initial summary info in candidate's text fragments of 50 to 600 words in scope.Calculate each initial segment

After the similarity, increase progressively 25 words, again calculate new document fragment

Similarity is searched and is had maximum

The text fragments of similarity value.

In the experiment contrast stage, at first compare the algorithm of our proposition and the implementation effect of 2 basic digest extraction algorithms according to same data (data acquisition S1).Experimental result such as table 1.The table on, WSA recall rate and

On all be best, on most accuracy rate, WSA is higher, except with the comparison of FL-Win, because WSA is in order to export the lengthy document fragment, so in accuracy rate certain decline is arranged in design.But be significantly improved on the recall rate.And from table 1, can obtain, WSA's

Value generally can 0.799, embody preferably execution result of WSA.

Three kinds of abstract extraction algorithm contrasts of table 1

Equally, by the experimental data of table 1, can analyze and draw WSA and on the S2 data acquisition, also have preferably execution result.

In anti-interference test process, the experimental data that we adopt comprises and the relevant and uncorrelated document of inquiry, also is for the digest generative process in the Reality simulation situation better.In document sets S3, only have one for real related abstract, other be interfering data, comprise the part text message of query statement as each incoherent document fragment of interfering data.At Fig. 3 a implementation status in the data slot of three kinds of methods at different length has been described.In the drawings, the length according to the text fragments of testing is divided into different groups.Particularly, can be divided into 0-50,51-100,101-200,201-300,301-400 and 401-500.Wherein, the x axle represents the length of text fragments, after the comparison diagram 3b, can observe, and WSA is having in the interfering data situation, the result of execution with do not have disturbed condition under execution result probably consistent.But, FL-Win and Cos-Win's

Can obviously descend because of interfering data.Can draw from figure, WSA not only can extract accurately summary info under general case, comprising in the situation of interfere information, also can bring into play good effect.Because in noisy document, WSA can reduce to follow through on the surface and askes relevant and in fact uncorrelated document.

The parameter of using in the experimentation:

In the language model of composition WSA two main parameters are arranged, the firstth, the set of retrieval relevant documentation Size, the secondth, smoothing parameter

Determining

During size, test is in the relevant documentation of varying number

Value (such as Fig. 4), as can be seen from the figure, gather at relevant documentation

In the time of between being in 5 to 40, execution result tends towards stability.When use is less than 5 pieces of documents, can not provide enough document information, but when number of documents more than 50 the time, can introduce too much interfere information again.In experiment, if there be not special indicating, come the computational language correlation model with 15 piece of writing documents.

In the parameter estimation procedure of language model, smoothing processing is very important.In language model, smoothing processing can guarantee not have the key word of zero probability to occur.In the experiment of this paper, observe and work as Can obtain maximum when being set to 0.9

Namely when carrying out smoothing processing hardly, the comparatively ideal implementation effect of WSA.This is because set up when inquiring about relevant language model in this paper experiment, finishes according to the inquiry relevant documentation.So to most key word, do not have the situation of zero probability, naturally just do not need smoothing processing to eliminate zero probability yet.

Claims

1. the personalized automaticabstracting in the digital library system is characterized in that comprising the steps:

B, set up correlation model and uncorrelated model according to the Query Information of inputting, described correlation model refers to the probability distribution function of the natural language model of Query Information in relevant documentation, described relevant documentation refers to use the keyword query digital library system, the top 5-50 piece of writing document that obtains;

Described uncorrelated model is the additional probability distribution function of described correlation model, namely refers to the probability distribution function of the natural language model of Query Information in uncorrelated document, all collection of document in the described uncorrelated document index word book system;

C, needs are obtained each key word in the document of summary info, calculate the probability that described key word produces under correlation model and uncorrelated model, and deduct probability under the uncorrelated model as the degree of correlation of described key word and Query Information with the probability under the correlation model;

F, judge whether continue to seek lower bar document summary according to threshold size;

G, if necessary continues the operation of e step, if do not need, just returns all document fragments in the summary data set as summary info.

2. the personalized automaticabstracting in the digital library system according to claim 1, its feature

Be: in the described c step, calculate the probability that described key word produces and specifically comprise under correlation model and uncorrelated model: the probabilistic method that described key word produces under uncorrelated model is:

At given key word

In the situation of whole collection of document, key word appears in the document Number of times use

Expression, and the quantity in the whole collection of document is used

Expression, key word in the uncorrelated model

The probability that produces is:

；

What represent is the language model that makes up with all documents in the whole digital library system;

Represent uncorrelated model;

The expression key word

At uncorrelated model

The probability of middle generation;

The expression key word Language model at whole document sets structure

In probability;

The step of the probabilistic method that described key word produces under correlation model comprises:

1) uses inquiry

,

In each document

Have

, representative

Condition under the probability retrieved, calculate keyword at given document Situation under

, and pass through the language model that whole document makes up

Do smoothing processing, wherein

Computing formula as follows:

The expression key word At given document The probability of middle generation;

2) by calculating

Come approximate evaluation

, this probability is:

The expression key word

At correlation model

The probability of middle generation;

The expression key word

At relevant documentation

The probability of middle generation;

Wherein,

Be

At document

The number of times of middle appearance,

Be the relevant documentation of choosing

Document length, and operation parameter

Control word frequency to the impact of this probability, this is processing common in the natural language model.

3. the personalized automaticabstracting in the digital library system according to claim 1 and 2, it is characterized in that: in the described d step, smoothing processing is carried out in formation specifically to be referred to: calculating need to obtain each key word in the document of summary info and the degree of correlation of Query Information, the degree of correlation of ten key words of each degree of correlation and front and back is more too high or excessively low, think that then current key word is in the larger situation of fluctuation, removes it before computing.

4. the personalized automaticabstracting in the digital library system according to claim 1, it is characterized in that: in the described c step, deduct probability under the uncorrelated model as the degree of correlation of described key word and Query Information with the probability under the correlation model, being distributed between [1,1] of the degree of correlation.

5. the personalized automaticabstracting in the digital library system according to claim 1, it is characterized in that: in the described a step, user's customized information refers to: the individual preference information that user's historical viewings data or user once used in digital library system.