CN103778200B

CN103778200B - A kind of message information source abstracting method and its system

Info

Publication number: CN103778200B
Application number: CN201410010836.XA
Authority: CN
Inventors: 刘春阳; 程工; 张旭; 王卿; 程学旗; 吴琼; 徐学可
Original assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Priority date: 2014-01-09
Filing date: 2014-01-09
Publication date: 2017-08-08
Anticipated expiration: 2034-01-09
Also published as: CN103778200A

Abstract

Information source in the keyword extraction message for passing through match information source decimation rule storehouse the invention discloses a kind of message information source abstracting method and its system, this method, and the rule judgment information source type in match information source decimation rule storehouse, this method include：Packet parsing step and information source extraction step, packet parsing step is used for the text according to input, extract the character in text, and different subordinate sentences are processed as to character progress punctuate, information source extraction step is to carry out Keywords matching to subordinate sentence according to information source decimation rule storehouse, useful to subordinate sentence extraction to want prime sequences, and is wanted useful on prime sequences, information source is extracted, and passes through the rule judgment information source type in match information source decimation rule storehouse.

Description

A kind of message information source abstracting method and its system

Technical field

The present invention relates to text mining field, more particularly to a kind of message information source abstracting method and system.

Background technology

In recent years, with the development of Internet technology, the various information on network are able to wide-scale distribution, these information qualities With confidence level very different, existing regular traditional news media media relatively, the confidence level such as Ye You forums, blog, microblogging is relatively The emerging medium of difference.Useful information source so how to be extracted by studying a question as everybody extensive concern.

Information extraction（Information Extraction:IE）, it is that the information included in text is carried out as its name suggests Structuring is handled, and becomes the same organizational form of form.Input information extraction system is urtext, and output is fixed grating The information point of formula, information point is extracted from various documents, is then integrated in unified form, and this is just It is the main task of information extraction.

Information extraction technique is not intended to comprehensive understanding entire chapter document, and simply the part comprising relevant information in document is entered Row analysis, be as which information it is related, that by by system design when the territory fixed depending on.

Information extraction technique is highly useful for the specific fact that needs are extracted from substantial amounts of document.Interconnection So one document library is there is on the net, on the internet, the information of same subject is generally scattered to be stored on different web sites, The form of performance is also different, if can be stored by these informations together with structured form, that will be highly profitable 's.

The content of the invention

The technical problem to be solved in the present invention is the provision of a kind of message information source abstracting method and its system, to overcome The information extraction efficiency of information extraction technique is low in the prior art, the problem of complex operation.

In order to reach object above, the invention provides a kind of message information source abstracting method, it is characterised in that the side Information source in the keyword extraction message that method passes through match information source decimation rule storehouse, and match described information source decimation rule The rule judgment described information Source Type in storehouse, this method includes：

Packet parsing step：According to the text of input, the character in the text is extracted, and the character is made pauses in reading unpunctuated ancient writings It is processed as different subordinate sentences；

Information source extraction step：Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to institute State subordinate sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extraction information source, and by matching described information source The rule judgment information source type in decimation rule storehouse.

Above-mentioned message information source abstracting method, it is characterised in that described information source decimation rule storehouse further comprises：It is useful Element library, real information identifing source rule, information source type recognition rule and character types recognition rule.

Above-mentioned message information source abstracting method, it is characterised in that methods described enters one before the packet parsing step Step includes：

Message content adaptation step：For shielding the coding of message or the difference of storage mode, there is provided unified message word Accord with iteration and read interface.

Above-mentioned message information source abstracting method, it is characterised in that methods described further comprises：

Information source statistic procedure：Collect the extraction result of the extraction information source, calculate the statistical information in described information source.

Above-mentioned message information source abstracting method, it is characterised in that the packet parsing step also includes：

Message character read step：Message byte stream is read, and byte is assembled into according to coded system actual character；

Character types judgment step：According to the character types recognition rule, character is divided into different type；

Response events step：According to the different type of the character, user is notified to carry out the extraction behaviour of different type character Make.

Above-mentioned message information source abstracting method, it is characterised in that described information source extraction step also includes：

Index establishment step：TRIE keyword indexes are set up according to the useful element library；

Subordinate sentence step：The character in the response events step is subjected to punctuate and is processed as different subordinate sentences；

Extract process step：According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, taken out Information source is taken, and judges the authenticity in described information source, the differentiation of described information Source Type is completed；

Export step：The information in described information source and described information Source Type is exported.

Above-mentioned message information source abstracting method, it is characterised in that the extraction process step also includes：

Information source extraction step：Information source extraction is carried out by unit of the subordinate sentence, is set up according to the useful element library TRIE keyword indexes, extract candidate's information source or candidate's information source list；

Useful key element extraction step：According to candidate's information source or candidate's information source list, from described point The positional information described in useful key element and the useful key element in subordinate sentence is extracted in sentence；

Real information source judgment step：By pre-defined real information identifing source rule, the candidate is judged Whether information source is real information source；

Information source type extraction step：Entered by predefined described information Source Type recognition rule with the useful key element Row matching completes information source type and differentiated.

Above-mentioned message information source abstracting method, it is characterised in that the useful element library, which includes, uses key element, described useful Key element includes：Media name deictic words, date and time information, media report behavior word and media deictic words.

Above-mentioned message information source abstracting method, it is characterised in that the real information identifing source rule is heuristic rule, Manually formulated by observing message, rule can be added or changed.

Above-mentioned message information source abstracting method, it is characterised in that the real information identifing source rule is heuristic comprising one Rule：If candidate information source described in only one of which in subordinate sentence, and there is the media report behavior word, and meet the time Select the character of information source with the media name deictic words end up or the follow-up source string where subordinate sentence there is institute State in date and time information or follow-up source word symbol and the media deictic words occur, then judge that the candidate information source is true Information source.

Above-mentioned message information source abstracting method, it is characterised in that described information Source Type includes：It is news media, forum, rich Visitor and microblogging.

Above-mentioned message information source abstracting method, it is characterised in that in described information Source Type extraction step, for the letter Source Type is ceased for blog and/or the information source of microblogging, it is necessary to further extraction user's name or Blog Website information.

The present invention also provides a kind of message information source extraction system, and using described message information source abstracting method, it is special Levy and be, the system includes：

Packet parsing module：According to the text of input, code parsing is carried out, the character in the text is extracted, and to institute State character progress punctuate and be processed as different subordinate sentences；

Information source abstraction module：Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to institute State subordinate sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extraction information source, and by matching described information source The rule judgment information source type in decimation rule storehouse.

Above-mentioned message information source extraction system, it is characterised in that described information source decimation rule storehouse further comprises：It is useful Element library, real information identifing source rule, information source type recognition rule and character types recognition rule.

Above-mentioned message information source extraction system, it is characterised in that the system further comprises：

Message content adaptation module：For shielding the coding of message or the difference of storage mode, there is provided unified message word Accord with iteration and read interface.

Above-mentioned message information source abstracting method, it is characterised in that the system further comprises：

Information source statistical module：Collect the extraction result of the extraction information source, calculate the statistical information in described information source.

Above-mentioned message information source extraction system, it is characterised in that the packet parsing module also includes：

Message character read module：Message byte stream is read, and byte is assembled into according to coded system actual character；

Character types judge module：According to the character types recognition rule, character is divided into different type；

Response events module：According to the different type of the character, user is notified to carry out the extraction behaviour of different type character Make.

Above-mentioned message information source extraction system, it is characterised in that described information source abstraction module also includes：

Index sets up module：TRIE keyword indexes are set up according to the useful element library；

Subordinate sentence module：The character in the response events step is subjected to punctuate and is processed as different subordinate sentences；

Extract processing module：According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, taken out Information source is taken, and judges the authenticity in described information source, the differentiation of described information Source Type is completed；

Output module：The information in described information source and described information Source Type is exported.

Above-mentioned message information source extraction system, it is characterised in that the extraction processing module also includes：

Information source abstraction module：Information source extraction is carried out by unit of the subordinate sentence, is set up according to the useful element library TRIE keyword indexes, extract candidate's information source or candidate's information source list；

Useful key element abstraction module：According to candidate's information source or candidate's information source list, from described point The positional information described in useful key element and the useful key element in subordinate sentence is extracted in sentence；

Real information source judge module：By pre-defined real information identifing source rule, the candidate is judged Whether information source is real information source；

Information source type abstraction module：Entered by predefined described information Source Type recognition rule with the useful key element Row matching completes information source type and differentiated.

Above-mentioned message information source extraction system, it is characterised in that in described information Source Type abstraction module, for the letter Source Type is ceased for blog and the information source of microblogging, it is necessary to further extraction user's name and Blog Website information.

Compared with prior art, the beneficial effects of the present invention are：

1st, the present invention, can flexible expansion, the specific extraction of realization based on the general information extraction framework based on event response Task.

2nd, the present invention can effectively integrate information source decimation rule storehouse, and message source is extracted from message, and judge its type, carry High message information source extraction efficiency reduction operation difficulty.

Brief description of the drawings

Fig. 1 is abstracting method step schematic diagram in message information source of the present invention；

Fig. 2 is packet parsing step schematic diagram of the present invention；

Fig. 3 is information source extraction step schematic diagram of the present invention；

Fig. 4 extracts process step schematic diagram for the present invention；

Fig. 5 is extracting method embodiment step schematic diagram in message information source of the present invention；

Fig. 6 is embodiments of the invention packet parsing step schematic diagram；

Fig. 7 is embodiments of the invention message extraction step schematic diagram；

Fig. 8 is message information source extraction system structural representation of the present invention；

Fig. 9 is specific embodiment of the invention message information source extraction system structural representation.

Wherein, reference：

The information source abstraction module of 1 message content adaptation module 2

The information source statistical module of 3 packet parsing module 4

The character types judge module of 21 message character read module 22

23 response events modules

31 indexes set up the subordinate sentence module of module 32

33 extract the output module of processing module 34

The useful key element abstraction module of 331 information source abstraction module 332

The information source type abstraction module of 333 real information source judge module 334

S1~S4, S11~S13, S21~S24, S231~S234, S100~S102, S1031~S1034：It is of the invention each The administration step of embodiment.

Embodiment

The embodiment of the present invention is given below, detailed description is made that to the present invention with reference to diagram.

Fig. 1 is message information source abstracting method step schematic diagram of the present invention, as shown in figure 1, a kind of report that the present invention is provided Information source in literary information source abstracting method, the keyword extraction message that this method passes through match information source decimation rule storehouse, and The rule judgment described information Source Type in match information source decimation rule storehouse, this method includes：

Message content adaptation step S1：For shielding the coding of message or the difference of storage mode, there is provided unified message Character iteration reads interface；

Packet parsing step S2：According to the text of input, the character in text is extracted, and character progress punctuate is processed as Different subordinate sentences；

Information source extraction step S3：Keywords matching is carried out to subordinate sentence according to information source decimation rule storehouse, subordinate sentence is extracted It is useful to want prime sequences, and wanted useful on prime sequences, information source is extracted, and sentence by the rule in match information source decimation rule storehouse Disconnected information source type；

Information source statistic procedure S4：Collect the extraction result for extracting information source, calculate the statistical information of information source.

Information source decimation rule storehouse therein further comprises：Useful element library, real information identifing source rule, information source Type identification rule and character types recognition rule.

Fig. 2 is packet parsing step schematic diagram of the present invention, as shown in Fig. 2 wherein, packet parsing step S2 also includes：

Message character read step S21：Message byte stream is read, and byte is assembled into according to coded system actual word Symbol；

Character types judgment step S22：According to character types recognition rule, character is divided into different type；

Response events step S23：According to the different type of character, user is notified to carry out the extraction behaviour of different type character Make.

Fig. 3 is information source extraction step schematic diagram of the present invention, as shown in figure 3, wherein, information source extraction step S3 is also wrapped Include：

Index establishment step S31：TRIE keyword indexes are set up according to useful element library；

Subordinate sentence step S32：Character in response events step is subjected to punctuate and is processed as different subordinate sentences；

Extract process step S33：According to TRIE keyword indexes, Keywords matching is carried out to different subordinate sentences, letter is extracted Breath source, and judge the authenticity of information source, complete the differentiation of information source type；

Export step S34：The information of information source and information source type is exported.

Wherein, Fig. 4 is abstracting method detailed step schematic diagram in message information source of the present invention, as shown in figure 4, extracting processing step Rapid S33 also includes：

Information source extraction step S331：Information source extraction is carried out by unit of subordinate sentence, is set up according to useful element library TRIE keyword indexes, extract candidate's information source or candidate's information source list；

Useful key element extraction step S332：According to candidate's information source or candidate's information source list, from subordinate sentence Extract the positional information in subordinate sentence in useful key element and useful key element；

Real information source judgment step S333：By pre-defined real information identifing source rule, candidate information is judged Whether source is real information source；

Information source type extraction step S334：Pass through predefined information source type recognition rule and the progress of useful key element Differentiate with information source type is completed.

Useful element library therein, which includes, uses key element, and useful key element includes：Media name deictic words, date and time information, matchmaker Body reports behavior word and media deictic words.

Real information identifing source rule therein is heuristic rule, is manually formulated by observing message, rule can add Plus or modification.

Further, real information identifing source of the invention rule includes a heuristic rule：If there was only one in subordinate sentence Individual candidate information source, and there is media report behavior word, and the character in candidate information source is met with media name deictic words knot Subordinate sentence where tail or follow-up source string occurs media deictic words occur in date and time information or follow-up source word symbol, then Judge candidate information source for real information source.

Information source type therein includes：News media, forum, blog and microblogging.

In information source type extraction step S334, for information source that information source type is blog and/or microblogging, it is necessary to enter One step extracts user's name or Blog Website information.

The step of below in conjunction with the specific embodiment of the invention is illustrated, Fig. 5 is message information source of the present invention extracting method One embodiment step schematic diagram, as shown in figure 5, the specific embodiment operating procedure of the present invention, illustrates that message information source is extracted Process.

Present invention aims at a kind of information extraction technique of hommization is provided, occurred including being extracted from message Information source, automatically analyze the type of message source（News, forum, blog, microblogging）And title, extract the user of blog and microblogging Title.

To achieve these goals, the invention provides the rule that a kind of method of rule-based matching and information source are extracted Then storehouse, comprises the following steps：

Step S100：Rule base is read, therefrom extracting keywords and its type information, set up TRIE keyword indexes.

Step S101：According to the text of input, code parsing is carried out, i.e., extracts character stream from text, such as chinese character, Punctuate etc..

Step S102：Punctuate processing is carried out, input text is divided into different subordinate sentences.

Step S103：Step is handled as follows respectively to each subordinate sentence, including：

Step S1031：Multiple-fault diagnosis and date match are carried out using the TRIE books index set up in advance, by subordinate sentence point For " useful key element " sequence, while the positional information of record " useful key element " in subordinate sentence.Useful key element refers to including media name Show report behavior word, media deictic words of word, media etc..

Step S1032：Wanted useful on prime sequences, various pre-defined rules are matched one by one, it is new if there is candidate Information source is heard, candidate's information source is extracted, and determine whether real information source.

Step S1033：By matching pre-defined rule, further the information source to extraction judges its type.

Step S1034：As a result export.

Fig. 6 is embodiments of the invention packet parsing step schematic diagram, as shown in fig. 6, being specifically made up of three steps：

Step S200：Message character is read, and Parser reads interface by message character iteration and reads a character, also It is to say that message character iteration reads interface and reads message byte stream, and according to corresponding coded system, byte is assembled into reality Character, such as Chinese character returns to Parser.

Step S201：Judge the type of character, character is divided into difference according to its functional role in the extraction of different key elements Type, such as year, month, day and some special punctuation marks.

Step S202：Listeners response events are notified, according to the type of character, each Listeners are notified（Observation Person）Perform corresponding call back function and carry out response character reading event.

Information source extracts the realization for a specific Listener for actually corresponding to general extraction framework, by constantly ringing Answer character to read event and complete information source extract function.Fig. 7 is embodiments of the invention message extraction step schematic diagram, such as Fig. 7 Shown, the specific steps for the flow are described as follows：

Step S301：We utilize punctuation marks such as ", " to carry out subordinate sentence segmentation, and information source is then carried out by unit of subordinate sentence Extract.

Step S302：We extract candidate's information source（Generally with " " or《》Surround）Or candidate's information source row Table.

Step S303：If there is candidate's information source, then useful key element is extracted from subordinate sentence and its in subordinate sentence Positional information.These useful key elements and its positional information contribute to positioning real information source, and judge its type.Here, it is useful Key element includes following several types：

A) media name deictic words, such as " Times ", " net ", " news ", " blog ", " mhkc ", " evening paper " etc..Candidate is new Source string is heard using media name deictic words as ending, it is probably real media name to show candidate's news sources, such as " Sina Blog ",《Maeil Business Newspaper》Deng.

B) date and time information, general candidate's news sources are often with the report date：Such as " June 24-25 ", " April 1 ".

C) the report behavior word of media, such as " message ", " report ", " reprinting ", " comment ", " publication ", " issue " show The short sentence may state a news report behavior, thus help to judge the whether true news sources of candidate's news sources.

D) media deictic words, such as " domestic ", " according to ", " media ", " website ".Occur generally around candidate's news sources, table Bright candidate's news source string is probably media noun.

Step S304：On this basis, we can be easy to match various pre-defined rules one by one, judge candidate Information source（If any）Whether real information source.

Such as, wherein a simplest heuristic rule is as follows：If only one of which candidate information source in subordinate sentence, and There is the report behavior word of media, while meeting one of following condition, then may determine that candidate information source is real information source：

A) candidate's news source string is used as ending using media name deictic words.

B) there is date and time information in the short sentence where candidate's news source string.

C) occur the media deictic words such as " domestic ", " according to ", " media ", " website " around candidate's news source string.

Such as, the domestic daily magazine note in " NGO develops AC network " March 11 of subordinate sentence meets above heuristic rule, can extract letter " NGO develops AC network " is information source in breath source.

Here heuristic rule is mainly manually formulated by observing message, may include many complex rules, Er Qiegui It is also then continuous addition or modification.We realize an efficiently expansible information extraction system, can flexibly support rule Addition or modification.

Step S305：We further judge the information source of extraction its type, including news media, forum, blog and Microblogging, for blog and microblogging, we further extract user's name and blog or microblogging site information.Here, we equally make Series of rules is determined, completing information source type by matched rule one by one differentiates, what these rules were provided using step S303 The media name that useful element information is included in information source name indicates word information（If any）And other key elements of surrounding Information.Such as advise for www.xinhuanet.com micro-blog user " XXXX ", the information source type of extraction is microblogging, and its user's name is " XXXX ", microblogging website is " www.xinhuanet.com's microblogging ".

Step S306. we all information sources in message and its type information are exported.

Present invention also offers a kind of message information source extraction system, message information source abstracting method is employed, Fig. 8 is this Invention message information source extraction system structural representation, as shown in figure 8, the system includes：

Message content adaptation module 1：For shielding the coding of message or the difference of storage mode, there is provided unified message word Accord with iteration and read interface；

Packet parsing module 2：According to the text of input, code parsing is carried out, the character in text is extracted, and character is entered Row punctuate is processed as different subordinate sentences；

Information source abstraction module 3：Keywords matching is carried out to subordinate sentence according to information source decimation rule storehouse, subordinate sentence, which is extracted, to be had With wanting prime sequences, and wanted useful on prime sequences, extract information source, and pass through the rule judgment in match information source decimation rule storehouse Information source type；

Information source statistical module 4：Collect the extraction result for extracting information source, calculate the statistical information of information source.

Wherein, packet parsing module 2 also includes：

Message character read module 21：Message byte stream is read, and byte is assembled into according to coded system actual word Symbol；

Character types judge module 22：According to character types recognition rule, character is divided into different type；

Response events module 23：According to the different type of character, user is notified to carry out the extraction operation of different type character.

Wherein, information source abstraction module 3 also includes：

Index sets up module 31：TRIE keyword indexes are set up according to useful element library；

Subordinate sentence module 32：Character in response events step is subjected to punctuate and is processed as different subordinate sentences；

Extract processing module 33：According to TRIE keyword indexes, Keywords matching, Extracting Information are carried out to different subordinate sentences Source, and judge the authenticity of information source, complete the differentiation of information source type；

Output module 34：The information of information source and information source type is exported.

Wherein, extracting processing module 33 also includes：

Information source abstraction module 331：Information source extraction is carried out by unit of subordinate sentence, the TRIE set up according to useful element library Keyword index, extracts candidate's information source or candidate's information source list；

Useful key element abstraction module 332：According to candidate's information source or candidate's information source list, taken out from subordinate sentence Take the positional information in subordinate sentence in useful key element and useful key element；

Real information source judge module 333：By pre-defined real information identifing source rule, candidate information source is judged Whether it is real information source；

Information source type abstraction module 334：Pass through predefined information source type recognition rule and the progress of useful key element Differentiate with information source type is completed.

Wherein, in information source type abstraction module 334, for information source that information source type is blog and microblogging, it is necessary to Further extract user's name and Blog Website information.

Illustrate message information source extraction system below in conjunction with the specific embodiment of the invention, Fig. 9 is the specific embodiment of the invention Message information source extraction system structural representation, as shown in figure 9, the message information source extraction system of the present invention is included：Below four Individual level：

1) message content adaptation layer：The differences such as shielding message coding, storage mode provide consistent message for upper layer module Character iteration reads interface so that upper layer module only needs to be concerned about the logic extracted.

2) Parser layers：Information extraction overall procedure based on event response.Here designed a model using observer, Parser is actually a target（Subject）, and register with a series of observers（Observer）.Overall procedure is as follows：It is logical Spend content adaptation stacking generation and read message character, often read a character as an event, notify each observer to perform phase The call back function answered carrys out corresponding event.

3) Extractor layers：An observer Listener is actually corresponded to, by realizing that specific event response is moved Make, complete specific information extraction function etc..It is that one of Extractor layers is implemented that information source, which is extracted, according to input Message content, therefrom extract the type information sources such as news, forum, blog and microblogging；Name is provided for news, forum information source Claim standardization function；User's name and site name extract function are provided for blog and micro-blog information source.

4) information source statistics layer：Information source statistics reads message from message data storehouse traversal, and each message content is carried out Information source is extracted.Finally, collect all extraction results, calculate occurrence number, the message category distribution of extracted information source Etc. statistical information, by statistical result write into Databasce.

Certainly, the present invention can also have other various embodiments, ripe in the case of without departing substantially from spirit of the invention and its essence Various corresponding changes and deformation, but these corresponding changes and change ought can be made according to the present invention by knowing those skilled in the art Shape should all belong to the protection domain of appended claims of the invention.

Claims

1. a kind of message information source abstracting method, it is characterised in that the pass that methods described passes through match information source decimation rule storehouse Keyword extracts the information source in message, and match described information source decimation rule storehouse by adding of observing that message manually formulates Plus or modification rule judgment information source type, this method includes：

Packet parsing step：According to the text of input, the character in the text is extracted, and punctuate processing is carried out to the character For different subordinate sentences, the packet parsing step also includes：

Character types judgment step：According to character types recognition rule, character is divided into different type；

Response events step：According to the different type of the character, user is notified to carry out the extraction operation of different type character；

Information source extraction step：Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to described point Sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extract information source, and by matching the extraction of described information source The rule judgment information source type of rule base, described information source decimation rule storehouse further comprises：Useful element library, real information Identifing source rule, information source type recognition rule and character types recognition rule, described information source extraction step also include：

Extract process step：According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, letter is extracted Breath source, and judge the authenticity in described information source, complete the differentiation of described information Source Type；

2. message information source abstracting method according to claim 1, it is characterised in that methods described is walked in the packet parsing Before rapid, further comprise：

Message content adaptation step：Changed for shielding the coding of message or the difference of storage mode there is provided unified message character In generation, reads interface.

3. message information source abstracting method according to claim 2, it is characterised in that methods described further comprises：

4. message information source abstracting method according to claim 1, it is characterised in that the extraction process step also includes：

Useful key element extraction step：According to candidate's information source or candidate's information source list, from the subordinate sentence Extract the positional information in subordinate sentence described in useful key element and the useful key element；

Real information source judgment step：By pre-defined real information identifing source rule, candidate's news is judged Whether information source is real information source；

Information source type extraction step：Pass through predefined described information Source Type recognition rule and the useful key element progress Differentiate with information source type is completed.

5. message information source abstracting method according to claim 4, it is characterised in that the useful element library includes with will Element, the useful key element includes：Media name deictic words, date and time information, media report behavior word and media deictic words.

6. message information source abstracting method according to claim 5, it is characterised in that the real information identifing source rule is Heuristic rule, is manually formulated by observing message, and rule can be added or changed.

7. message information source abstracting method according to claim 6, it is characterised in that the real information identifing source rule bag Containing a heuristic rule：If candidate's information source described in only one of which in subordinate sentence, and would there is the media report behavior Word, and meet the character of candidate's information source with the media deictic words end up or follow-up source string where Subordinate sentence occurs the media deictic words occur in the date and time information or the follow-up source string, then judges the candidate Information source is real information source.

8. message information source abstracting method according to claim 1, it is characterised in that described information Source Type includes：News Media, forum, blog and microblogging.

9. message information source abstracting method according to claim 4, it is characterised in that described information Source Type extraction step In, for described information Source Type for blog and/or the information source of microblogging, it is necessary to further extraction user's name or Blog Website Information.

10. a kind of message information source extraction system, using message information source as claimed in any one of claims 1-9 wherein extraction side Method, it is characterised in that the system includes：

Packet parsing module：According to the text of input, code parsing is carried out, the character in the text is extracted, and to the word Symbol carries out punctuate and is processed as different subordinate sentences；

Information source abstraction module：Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to described point Sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extract information source, and by matching the extraction of described information source The rule judgment information source type of rule base.

11. message information source extraction system according to claim 10, it is characterised in that described information source decimation rule storehouse is entered One step includes：Useful element library, real information identifing source rule, information source type recognition rule and character types recognition rule.

12. message information source extraction system according to claim 10, it is characterised in that the system further comprises：

Message content adaptation module：Changed for shielding the coding of message or the difference of storage mode there is provided unified message character In generation, reads interface.

13. the message information source extraction system according to claim 10 or 11, it is characterised in that the system is further wrapped Include：

14. message information source extraction system according to claim 10, it is characterised in that the packet parsing module is also wrapped Include：

Response events module：According to the different type of the character, user is notified to carry out the extraction operation of different type character.

15. message information source extraction system according to claim 10, it is characterised in that described information source abstraction module is also wrapped Include：

Extract processing module：According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, letter is extracted Breath source, and judge the authenticity in described information source, complete the differentiation of described information Source Type；

16. the message information source extraction system according to claim 15, it is characterised in that the extraction processing module is also wrapped Include：

Useful key element abstraction module：According to candidate's information source or candidate's information source list, from the subordinate sentence Extract the positional information in subordinate sentence described in useful key element and the useful key element；

Real information source judge module：By pre-defined real information identifing source rule, candidate's news is judged Whether information source is real information source；

Information source type abstraction module：Pass through predefined described information Source Type recognition rule and the useful key element progress Differentiate with information source type is completed.

17. the message information source extraction system according to claim 16, it is characterised in that described information Source Type abstraction module In, for described information Source Type for blog and the information source of microblogging, it is necessary to further extract user's name and Blog Website letter Breath.