CN103778200B - A kind of message information source abstracting method and its system - Google Patents
A kind of message information source abstracting method and its system Download PDFInfo
- Publication number
- CN103778200B CN103778200B CN201410010836.XA CN201410010836A CN103778200B CN 103778200 B CN103778200 B CN 103778200B CN 201410010836 A CN201410010836 A CN 201410010836A CN 103778200 B CN103778200 B CN 103778200B
- Authority
- CN
- China
- Prior art keywords
- information source
- message
- extraction
- character
- rule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Information source in the keyword extraction message for passing through match information source decimation rule storehouse the invention discloses a kind of message information source abstracting method and its system, this method, and the rule judgment information source type in match information source decimation rule storehouse, this method include:Packet parsing step and information source extraction step, packet parsing step is used for the text according to input, extract the character in text, and different subordinate sentences are processed as to character progress punctuate, information source extraction step is to carry out Keywords matching to subordinate sentence according to information source decimation rule storehouse, useful to subordinate sentence extraction to want prime sequences, and is wanted useful on prime sequences, information source is extracted, and passes through the rule judgment information source type in match information source decimation rule storehouse.
Description
Technical field
The present invention relates to text mining field, more particularly to a kind of message information source abstracting method and system.
Background technology
In recent years, with the development of Internet technology, the various information on network are able to wide-scale distribution, these information qualities
With confidence level very different, existing regular traditional news media media relatively, the confidence level such as Ye You forums, blog, microblogging is relatively
The emerging medium of difference.Useful information source so how to be extracted by studying a question as everybody extensive concern.
Information extraction(Information Extraction:IE), it is that the information included in text is carried out as its name suggests
Structuring is handled, and becomes the same organizational form of form.Input information extraction system is urtext, and output is fixed grating
The information point of formula, information point is extracted from various documents, is then integrated in unified form, and this is just
It is the main task of information extraction.
Information extraction technique is not intended to comprehensive understanding entire chapter document, and simply the part comprising relevant information in document is entered
Row analysis, be as which information it is related, that by by system design when the territory fixed depending on.
Information extraction technique is highly useful for the specific fact that needs are extracted from substantial amounts of document.Interconnection
So one document library is there is on the net, on the internet, the information of same subject is generally scattered to be stored on different web sites,
The form of performance is also different, if can be stored by these informations together with structured form, that will be highly profitable
's.
The content of the invention
The technical problem to be solved in the present invention is the provision of a kind of message information source abstracting method and its system, to overcome
The information extraction efficiency of information extraction technique is low in the prior art, the problem of complex operation.
In order to reach object above, the invention provides a kind of message information source abstracting method, it is characterised in that the side
Information source in the keyword extraction message that method passes through match information source decimation rule storehouse, and match described information source decimation rule
The rule judgment described information Source Type in storehouse, this method includes:
Packet parsing step:According to the text of input, the character in the text is extracted, and the character is made pauses in reading unpunctuated ancient writings
It is processed as different subordinate sentences;
Information source extraction step:Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to institute
State subordinate sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extraction information source, and by matching described information source
The rule judgment information source type in decimation rule storehouse.
Above-mentioned message information source abstracting method, it is characterised in that described information source decimation rule storehouse further comprises:It is useful
Element library, real information identifing source rule, information source type recognition rule and character types recognition rule.
Above-mentioned message information source abstracting method, it is characterised in that methods described enters one before the packet parsing step
Step includes:
Message content adaptation step:For shielding the coding of message or the difference of storage mode, there is provided unified message word
Accord with iteration and read interface.
Above-mentioned message information source abstracting method, it is characterised in that methods described further comprises:
Information source statistic procedure:Collect the extraction result of the extraction information source, calculate the statistical information in described information source.
Above-mentioned message information source abstracting method, it is characterised in that the packet parsing step also includes:
Message character read step:Message byte stream is read, and byte is assembled into according to coded system actual character;
Character types judgment step:According to the character types recognition rule, character is divided into different type;
Response events step:According to the different type of the character, user is notified to carry out the extraction behaviour of different type character
Make.
Above-mentioned message information source abstracting method, it is characterised in that described information source extraction step also includes:
Index establishment step:TRIE keyword indexes are set up according to the useful element library;
Subordinate sentence step:The character in the response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract process step:According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, taken out
Information source is taken, and judges the authenticity in described information source, the differentiation of described information Source Type is completed;
Export step:The information in described information source and described information Source Type is exported.
Above-mentioned message information source abstracting method, it is characterised in that the extraction process step also includes:
Information source extraction step:Information source extraction is carried out by unit of the subordinate sentence, is set up according to the useful element library
TRIE keyword indexes, extract candidate's information source or candidate's information source list;
Useful key element extraction step:According to candidate's information source or candidate's information source list, from described point
The positional information described in useful key element and the useful key element in subordinate sentence is extracted in sentence;
Real information source judgment step:By pre-defined real information identifing source rule, the candidate is judged
Whether information source is real information source;
Information source type extraction step:Entered by predefined described information Source Type recognition rule with the useful key element
Row matching completes information source type and differentiated.
Above-mentioned message information source abstracting method, it is characterised in that the useful element library, which includes, uses key element, described useful
Key element includes:Media name deictic words, date and time information, media report behavior word and media deictic words.
Above-mentioned message information source abstracting method, it is characterised in that the real information identifing source rule is heuristic rule,
Manually formulated by observing message, rule can be added or changed.
Above-mentioned message information source abstracting method, it is characterised in that the real information identifing source rule is heuristic comprising one
Rule:If candidate information source described in only one of which in subordinate sentence, and there is the media report behavior word, and meet the time
Select the character of information source with the media name deictic words end up or the follow-up source string where subordinate sentence there is institute
State in date and time information or follow-up source word symbol and the media deictic words occur, then judge that the candidate information source is true
Information source.
Above-mentioned message information source abstracting method, it is characterised in that described information Source Type includes:It is news media, forum, rich
Visitor and microblogging.
Above-mentioned message information source abstracting method, it is characterised in that in described information Source Type extraction step, for the letter
Source Type is ceased for blog and/or the information source of microblogging, it is necessary to further extraction user's name or Blog Website information.
The present invention also provides a kind of message information source extraction system, and using described message information source abstracting method, it is special
Levy and be, the system includes:
Packet parsing module:According to the text of input, code parsing is carried out, the character in the text is extracted, and to institute
State character progress punctuate and be processed as different subordinate sentences;
Information source abstraction module:Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to institute
State subordinate sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extraction information source, and by matching described information source
The rule judgment information source type in decimation rule storehouse.
Above-mentioned message information source extraction system, it is characterised in that described information source decimation rule storehouse further comprises:It is useful
Element library, real information identifing source rule, information source type recognition rule and character types recognition rule.
Above-mentioned message information source extraction system, it is characterised in that the system further comprises:
Message content adaptation module:For shielding the coding of message or the difference of storage mode, there is provided unified message word
Accord with iteration and read interface.
Above-mentioned message information source abstracting method, it is characterised in that the system further comprises:
Information source statistical module:Collect the extraction result of the extraction information source, calculate the statistical information in described information source.
Above-mentioned message information source extraction system, it is characterised in that the packet parsing module also includes:
Message character read module:Message byte stream is read, and byte is assembled into according to coded system actual character;
Character types judge module:According to the character types recognition rule, character is divided into different type;
Response events module:According to the different type of the character, user is notified to carry out the extraction behaviour of different type character
Make.
Above-mentioned message information source extraction system, it is characterised in that described information source abstraction module also includes:
Index sets up module:TRIE keyword indexes are set up according to the useful element library;
Subordinate sentence module:The character in the response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract processing module:According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, taken out
Information source is taken, and judges the authenticity in described information source, the differentiation of described information Source Type is completed;
Output module:The information in described information source and described information Source Type is exported.
Above-mentioned message information source extraction system, it is characterised in that the extraction processing module also includes:
Information source abstraction module:Information source extraction is carried out by unit of the subordinate sentence, is set up according to the useful element library
TRIE keyword indexes, extract candidate's information source or candidate's information source list;
Useful key element abstraction module:According to candidate's information source or candidate's information source list, from described point
The positional information described in useful key element and the useful key element in subordinate sentence is extracted in sentence;
Real information source judge module:By pre-defined real information identifing source rule, the candidate is judged
Whether information source is real information source;
Information source type abstraction module:Entered by predefined described information Source Type recognition rule with the useful key element
Row matching completes information source type and differentiated.
Above-mentioned message information source extraction system, it is characterised in that in described information Source Type abstraction module, for the letter
Source Type is ceased for blog and the information source of microblogging, it is necessary to further extraction user's name and Blog Website information.
Compared with prior art, the beneficial effects of the present invention are:
1st, the present invention, can flexible expansion, the specific extraction of realization based on the general information extraction framework based on event response
Task.
2nd, the present invention can effectively integrate information source decimation rule storehouse, and message source is extracted from message, and judge its type, carry
High message information source extraction efficiency reduction operation difficulty.
Brief description of the drawings
Fig. 1 is abstracting method step schematic diagram in message information source of the present invention;
Fig. 2 is packet parsing step schematic diagram of the present invention;
Fig. 3 is information source extraction step schematic diagram of the present invention;
Fig. 4 extracts process step schematic diagram for the present invention;
Fig. 5 is extracting method embodiment step schematic diagram in message information source of the present invention;
Fig. 6 is embodiments of the invention packet parsing step schematic diagram;
Fig. 7 is embodiments of the invention message extraction step schematic diagram;
Fig. 8 is message information source extraction system structural representation of the present invention;
Fig. 9 is specific embodiment of the invention message information source extraction system structural representation.
Wherein, reference:
The information source abstraction module of 1 message content adaptation module 2
The information source statistical module of 3 packet parsing module 4
The character types judge module of 21 message character read module 22
23 response events modules
31 indexes set up the subordinate sentence module of module 32
33 extract the output module of processing module 34
The useful key element abstraction module of 331 information source abstraction module 332
The information source type abstraction module of 333 real information source judge module 334
S1~S4, S11~S13, S21~S24, S231~S234, S100~S102, S1031~S1034:It is of the invention each
The administration step of embodiment.
Embodiment
The embodiment of the present invention is given below, detailed description is made that to the present invention with reference to diagram.
Fig. 1 is message information source abstracting method step schematic diagram of the present invention, as shown in figure 1, a kind of report that the present invention is provided
Information source in literary information source abstracting method, the keyword extraction message that this method passes through match information source decimation rule storehouse, and
The rule judgment described information Source Type in match information source decimation rule storehouse, this method includes:
Message content adaptation step S1:For shielding the coding of message or the difference of storage mode, there is provided unified message
Character iteration reads interface;
Packet parsing step S2:According to the text of input, the character in text is extracted, and character progress punctuate is processed as
Different subordinate sentences;
Information source extraction step S3:Keywords matching is carried out to subordinate sentence according to information source decimation rule storehouse, subordinate sentence is extracted
It is useful to want prime sequences, and wanted useful on prime sequences, information source is extracted, and sentence by the rule in match information source decimation rule storehouse
Disconnected information source type;
Information source statistic procedure S4:Collect the extraction result for extracting information source, calculate the statistical information of information source.
Information source decimation rule storehouse therein further comprises:Useful element library, real information identifing source rule, information source
Type identification rule and character types recognition rule.
Fig. 2 is packet parsing step schematic diagram of the present invention, as shown in Fig. 2 wherein, packet parsing step S2 also includes:
Message character read step S21:Message byte stream is read, and byte is assembled into according to coded system actual word
Symbol;
Character types judgment step S22:According to character types recognition rule, character is divided into different type;
Response events step S23:According to the different type of character, user is notified to carry out the extraction behaviour of different type character
Make.
Fig. 3 is information source extraction step schematic diagram of the present invention, as shown in figure 3, wherein, information source extraction step S3 is also wrapped
Include:
Index establishment step S31:TRIE keyword indexes are set up according to useful element library;
Subordinate sentence step S32:Character in response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract process step S33:According to TRIE keyword indexes, Keywords matching is carried out to different subordinate sentences, letter is extracted
Breath source, and judge the authenticity of information source, complete the differentiation of information source type;
Export step S34:The information of information source and information source type is exported.
Wherein, Fig. 4 is abstracting method detailed step schematic diagram in message information source of the present invention, as shown in figure 4, extracting processing step
Rapid S33 also includes:
Information source extraction step S331:Information source extraction is carried out by unit of subordinate sentence, is set up according to useful element library
TRIE keyword indexes, extract candidate's information source or candidate's information source list;
Useful key element extraction step S332:According to candidate's information source or candidate's information source list, from subordinate sentence
Extract the positional information in subordinate sentence in useful key element and useful key element;
Real information source judgment step S333:By pre-defined real information identifing source rule, candidate information is judged
Whether source is real information source;
Information source type extraction step S334:Pass through predefined information source type recognition rule and the progress of useful key element
Differentiate with information source type is completed.
Useful element library therein, which includes, uses key element, and useful key element includes:Media name deictic words, date and time information, matchmaker
Body reports behavior word and media deictic words.
Real information identifing source rule therein is heuristic rule, is manually formulated by observing message, rule can add
Plus or modification.
Further, real information identifing source of the invention rule includes a heuristic rule:If there was only one in subordinate sentence
Individual candidate information source, and there is media report behavior word, and the character in candidate information source is met with media name deictic words knot
Subordinate sentence where tail or follow-up source string occurs media deictic words occur in date and time information or follow-up source word symbol, then
Judge candidate information source for real information source.
Information source type therein includes:News media, forum, blog and microblogging.
In information source type extraction step S334, for information source that information source type is blog and/or microblogging, it is necessary to enter
One step extracts user's name or Blog Website information.
The step of below in conjunction with the specific embodiment of the invention is illustrated, Fig. 5 is message information source of the present invention extracting method
One embodiment step schematic diagram, as shown in figure 5, the specific embodiment operating procedure of the present invention, illustrates that message information source is extracted
Process.
Present invention aims at a kind of information extraction technique of hommization is provided, occurred including being extracted from message
Information source, automatically analyze the type of message source(News, forum, blog, microblogging)And title, extract the user of blog and microblogging
Title.
To achieve these goals, the invention provides the rule that a kind of method of rule-based matching and information source are extracted
Then storehouse, comprises the following steps:
Step S100:Rule base is read, therefrom extracting keywords and its type information, set up TRIE keyword indexes.
Step S101:According to the text of input, code parsing is carried out, i.e., extracts character stream from text, such as chinese character,
Punctuate etc..
Step S102:Punctuate processing is carried out, input text is divided into different subordinate sentences.
Step S103:Step is handled as follows respectively to each subordinate sentence, including:
Step S1031:Multiple-fault diagnosis and date match are carried out using the TRIE books index set up in advance, by subordinate sentence point
For " useful key element " sequence, while the positional information of record " useful key element " in subordinate sentence.Useful key element refers to including media name
Show report behavior word, media deictic words of word, media etc..
Step S1032:Wanted useful on prime sequences, various pre-defined rules are matched one by one, it is new if there is candidate
Information source is heard, candidate's information source is extracted, and determine whether real information source.
Step S1033:By matching pre-defined rule, further the information source to extraction judges its type.
Step S1034:As a result export.
Fig. 6 is embodiments of the invention packet parsing step schematic diagram, as shown in fig. 6, being specifically made up of three steps:
Step S200:Message character is read, and Parser reads interface by message character iteration and reads a character, also
It is to say that message character iteration reads interface and reads message byte stream, and according to corresponding coded system, byte is assembled into reality
Character, such as Chinese character returns to Parser.
Step S201:Judge the type of character, character is divided into difference according to its functional role in the extraction of different key elements
Type, such as year, month, day and some special punctuation marks.
Step S202:Listeners response events are notified, according to the type of character, each Listeners are notified(Observation
Person)Perform corresponding call back function and carry out response character reading event.
Information source extracts the realization for a specific Listener for actually corresponding to general extraction framework, by constantly ringing
Answer character to read event and complete information source extract function.Fig. 7 is embodiments of the invention message extraction step schematic diagram, such as Fig. 7
Shown, the specific steps for the flow are described as follows:
Step S301:We utilize punctuation marks such as ", " to carry out subordinate sentence segmentation, and information source is then carried out by unit of subordinate sentence
Extract.
Step S302:We extract candidate's information source(Generally with " " or《》Surround)Or candidate's information source row
Table.
Step S303:If there is candidate's information source, then useful key element is extracted from subordinate sentence and its in subordinate sentence
Positional information.These useful key elements and its positional information contribute to positioning real information source, and judge its type.Here, it is useful
Key element includes following several types:
A) media name deictic words, such as " Times ", " net ", " news ", " blog ", " mhkc ", " evening paper " etc..Candidate is new
Source string is heard using media name deictic words as ending, it is probably real media name to show candidate's news sources, such as " Sina
Blog ",《Maeil Business Newspaper》Deng.
B) date and time information, general candidate's news sources are often with the report date:Such as " June 24-25 ", " April 1 ".
C) the report behavior word of media, such as " message ", " report ", " reprinting ", " comment ", " publication ", " issue " show
The short sentence may state a news report behavior, thus help to judge the whether true news sources of candidate's news sources.
D) media deictic words, such as " domestic ", " according to ", " media ", " website ".Occur generally around candidate's news sources, table
Bright candidate's news source string is probably media noun.
Step S304:On this basis, we can be easy to match various pre-defined rules one by one, judge candidate
Information source(If any)Whether real information source.
Such as, wherein a simplest heuristic rule is as follows:If only one of which candidate information source in subordinate sentence, and
There is the report behavior word of media, while meeting one of following condition, then may determine that candidate information source is real information source:
A) candidate's news source string is used as ending using media name deictic words.
B) there is date and time information in the short sentence where candidate's news source string.
C) occur the media deictic words such as " domestic ", " according to ", " media ", " website " around candidate's news source string.
Such as, the domestic daily magazine note in " NGO develops AC network " March 11 of subordinate sentence meets above heuristic rule, can extract letter
" NGO develops AC network " is information source in breath source.
Here heuristic rule is mainly manually formulated by observing message, may include many complex rules, Er Qiegui
It is also then continuous addition or modification.We realize an efficiently expansible information extraction system, can flexibly support rule
Addition or modification.
Step S305:We further judge the information source of extraction its type, including news media, forum, blog and
Microblogging, for blog and microblogging, we further extract user's name and blog or microblogging site information.Here, we equally make
Series of rules is determined, completing information source type by matched rule one by one differentiates, what these rules were provided using step S303
The media name that useful element information is included in information source name indicates word information(If any)And other key elements of surrounding
Information.Such as advise for www.xinhuanet.com micro-blog user " XXXX ", the information source type of extraction is microblogging, and its user's name is
" XXXX ", microblogging website is " www.xinhuanet.com's microblogging ".
Step S306. we all information sources in message and its type information are exported.
Present invention also offers a kind of message information source extraction system, message information source abstracting method is employed, Fig. 8 is this
Invention message information source extraction system structural representation, as shown in figure 8, the system includes:
Message content adaptation module 1:For shielding the coding of message or the difference of storage mode, there is provided unified message word
Accord with iteration and read interface;
Packet parsing module 2:According to the text of input, code parsing is carried out, the character in text is extracted, and character is entered
Row punctuate is processed as different subordinate sentences;
Information source abstraction module 3:Keywords matching is carried out to subordinate sentence according to information source decimation rule storehouse, subordinate sentence, which is extracted, to be had
With wanting prime sequences, and wanted useful on prime sequences, extract information source, and pass through the rule judgment in match information source decimation rule storehouse
Information source type;
Information source statistical module 4:Collect the extraction result for extracting information source, calculate the statistical information of information source.
Wherein, packet parsing module 2 also includes:
Message character read module 21:Message byte stream is read, and byte is assembled into according to coded system actual word
Symbol;
Character types judge module 22:According to character types recognition rule, character is divided into different type;
Response events module 23:According to the different type of character, user is notified to carry out the extraction operation of different type character.
Wherein, information source abstraction module 3 also includes:
Index sets up module 31:TRIE keyword indexes are set up according to useful element library;
Subordinate sentence module 32:Character in response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract processing module 33:According to TRIE keyword indexes, Keywords matching, Extracting Information are carried out to different subordinate sentences
Source, and judge the authenticity of information source, complete the differentiation of information source type;
Output module 34:The information of information source and information source type is exported.
Wherein, extracting processing module 33 also includes:
Information source abstraction module 331:Information source extraction is carried out by unit of subordinate sentence, the TRIE set up according to useful element library
Keyword index, extracts candidate's information source or candidate's information source list;
Useful key element abstraction module 332:According to candidate's information source or candidate's information source list, taken out from subordinate sentence
Take the positional information in subordinate sentence in useful key element and useful key element;
Real information source judge module 333:By pre-defined real information identifing source rule, candidate information source is judged
Whether it is real information source;
Information source type abstraction module 334:Pass through predefined information source type recognition rule and the progress of useful key element
Differentiate with information source type is completed.
Wherein, in information source type abstraction module 334, for information source that information source type is blog and microblogging, it is necessary to
Further extract user's name and Blog Website information.
Illustrate message information source extraction system below in conjunction with the specific embodiment of the invention, Fig. 9 is the specific embodiment of the invention
Message information source extraction system structural representation, as shown in figure 9, the message information source extraction system of the present invention is included:Below four
Individual level:
1) message content adaptation layer:The differences such as shielding message coding, storage mode provide consistent message for upper layer module
Character iteration reads interface so that upper layer module only needs to be concerned about the logic extracted.
2) Parser layers:Information extraction overall procedure based on event response.Here designed a model using observer,
Parser is actually a target(Subject), and register with a series of observers(Observer).Overall procedure is as follows:It is logical
Spend content adaptation stacking generation and read message character, often read a character as an event, notify each observer to perform phase
The call back function answered carrys out corresponding event.
3) Extractor layers:An observer Listener is actually corresponded to, by realizing that specific event response is moved
Make, complete specific information extraction function etc..It is that one of Extractor layers is implemented that information source, which is extracted, according to input
Message content, therefrom extract the type information sources such as news, forum, blog and microblogging;Name is provided for news, forum information source
Claim standardization function;User's name and site name extract function are provided for blog and micro-blog information source.
4) information source statistics layer:Information source statistics reads message from message data storehouse traversal, and each message content is carried out
Information source is extracted.Finally, collect all extraction results, calculate occurrence number, the message category distribution of extracted information source
Etc. statistical information, by statistical result write into Databasce.
Certainly, the present invention can also have other various embodiments, ripe in the case of without departing substantially from spirit of the invention and its essence
Various corresponding changes and deformation, but these corresponding changes and change ought can be made according to the present invention by knowing those skilled in the art
Shape should all belong to the protection domain of appended claims of the invention.
Claims (17)
1. a kind of message information source abstracting method, it is characterised in that the pass that methods described passes through match information source decimation rule storehouse
Keyword extracts the information source in message, and match described information source decimation rule storehouse by adding of observing that message manually formulates
Plus or modification rule judgment information source type, this method includes:
Packet parsing step:According to the text of input, the character in the text is extracted, and punctuate processing is carried out to the character
For different subordinate sentences, the packet parsing step also includes:
Message character read step:Message byte stream is read, and byte is assembled into according to coded system actual character;
Character types judgment step:According to character types recognition rule, character is divided into different type;
Response events step:According to the different type of the character, user is notified to carry out the extraction operation of different type character;
Information source extraction step:Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to described point
Sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extract information source, and by matching the extraction of described information source
The rule judgment information source type of rule base, described information source decimation rule storehouse further comprises:Useful element library, real information
Identifing source rule, information source type recognition rule and character types recognition rule, described information source extraction step also include:
Index establishment step:TRIE keyword indexes are set up according to the useful element library;
Subordinate sentence step:The character in the response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract process step:According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, letter is extracted
Breath source, and judge the authenticity in described information source, complete the differentiation of described information Source Type;
Export step:The information in described information source and described information Source Type is exported.
2. message information source abstracting method according to claim 1, it is characterised in that methods described is walked in the packet parsing
Before rapid, further comprise:
Message content adaptation step:Changed for shielding the coding of message or the difference of storage mode there is provided unified message character
In generation, reads interface.
3. message information source abstracting method according to claim 2, it is characterised in that methods described further comprises:
Information source statistic procedure:Collect the extraction result of the extraction information source, calculate the statistical information in described information source.
4. message information source abstracting method according to claim 1, it is characterised in that the extraction process step also includes:
Information source extraction step:Information source extraction is carried out by unit of the subordinate sentence, is set up according to the useful element library
TRIE keyword indexes, extract candidate's information source or candidate's information source list;
Useful key element extraction step:According to candidate's information source or candidate's information source list, from the subordinate sentence
Extract the positional information in subordinate sentence described in useful key element and the useful key element;
Real information source judgment step:By pre-defined real information identifing source rule, candidate's news is judged
Whether information source is real information source;
Information source type extraction step:Pass through predefined described information Source Type recognition rule and the useful key element progress
Differentiate with information source type is completed.
5. message information source abstracting method according to claim 4, it is characterised in that the useful element library includes with will
Element, the useful key element includes:Media name deictic words, date and time information, media report behavior word and media deictic words.
6. message information source abstracting method according to claim 5, it is characterised in that the real information identifing source rule is
Heuristic rule, is manually formulated by observing message, and rule can be added or changed.
7. message information source abstracting method according to claim 6, it is characterised in that the real information identifing source rule bag
Containing a heuristic rule:If candidate's information source described in only one of which in subordinate sentence, and would there is the media report behavior
Word, and meet the character of candidate's information source with the media deictic words end up or follow-up source string where
Subordinate sentence occurs the media deictic words occur in the date and time information or the follow-up source string, then judges the candidate
Information source is real information source.
8. message information source abstracting method according to claim 1, it is characterised in that described information Source Type includes:News
Media, forum, blog and microblogging.
9. message information source abstracting method according to claim 4, it is characterised in that described information Source Type extraction step
In, for described information Source Type for blog and/or the information source of microblogging, it is necessary to further extraction user's name or Blog Website
Information.
10. a kind of message information source extraction system, using message information source as claimed in any one of claims 1-9 wherein extraction side
Method, it is characterised in that the system includes:
Packet parsing module:According to the text of input, code parsing is carried out, the character in the text is extracted, and to the word
Symbol carries out punctuate and is processed as different subordinate sentences;
Information source abstraction module:Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to described point
Sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extract information source, and by matching the extraction of described information source
The rule judgment information source type of rule base.
11. message information source extraction system according to claim 10, it is characterised in that described information source decimation rule storehouse is entered
One step includes:Useful element library, real information identifing source rule, information source type recognition rule and character types recognition rule.
12. message information source extraction system according to claim 10, it is characterised in that the system further comprises:
Message content adaptation module:Changed for shielding the coding of message or the difference of storage mode there is provided unified message character
In generation, reads interface.
13. the message information source extraction system according to claim 10 or 11, it is characterised in that the system is further wrapped
Include:
Information source statistical module:Collect the extraction result of the extraction information source, calculate the statistical information in described information source.
14. message information source extraction system according to claim 10, it is characterised in that the packet parsing module is also wrapped
Include:
Message character read module:Message byte stream is read, and byte is assembled into according to coded system actual character;
Character types judge module:According to the character types recognition rule, character is divided into different type;
Response events module:According to the different type of the character, user is notified to carry out the extraction operation of different type character.
15. message information source extraction system according to claim 10, it is characterised in that described information source abstraction module is also wrapped
Include:
Index sets up module:TRIE keyword indexes are set up according to the useful element library;
Subordinate sentence module:The character in the response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract processing module:According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, letter is extracted
Breath source, and judge the authenticity in described information source, complete the differentiation of described information Source Type;
Output module:The information in described information source and described information Source Type is exported.
16. the message information source extraction system according to claim 15, it is characterised in that the extraction processing module is also wrapped
Include:
Information source abstraction module:Information source extraction is carried out by unit of the subordinate sentence, is set up according to the useful element library
TRIE keyword indexes, extract candidate's information source or candidate's information source list;
Useful key element abstraction module:According to candidate's information source or candidate's information source list, from the subordinate sentence
Extract the positional information in subordinate sentence described in useful key element and the useful key element;
Real information source judge module:By pre-defined real information identifing source rule, candidate's news is judged
Whether information source is real information source;
Information source type abstraction module:Pass through predefined described information Source Type recognition rule and the useful key element progress
Differentiate with information source type is completed.
17. the message information source extraction system according to claim 16, it is characterised in that described information Source Type abstraction module
In, for described information Source Type for blog and the information source of microblogging, it is necessary to further extract user's name and Blog Website letter
Breath.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410010836.XA CN103778200B (en) | 2014-01-09 | 2014-01-09 | A kind of message information source abstracting method and its system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410010836.XA CN103778200B (en) | 2014-01-09 | 2014-01-09 | A kind of message information source abstracting method and its system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103778200A CN103778200A (en) | 2014-05-07 |
CN103778200B true CN103778200B (en) | 2017-08-08 |
Family
ID=50570435
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410010836.XA Active CN103778200B (en) | 2014-01-09 | 2014-01-09 | A kind of message information source abstracting method and its system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103778200B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104408101B (en) * | 2014-11-19 | 2018-01-09 | 南京大学 | A kind of full range Web information extracts integrated approach |
CN106815203B (en) * | 2015-12-01 | 2021-03-30 | 北京国双科技有限公司 | Method and device for analyzing amount of money in referee document |
CN105447202A (en) * | 2015-12-31 | 2016-03-30 | 宁波公众信息产业有限公司 | Internet information collecting system |
CN106021439A (en) * | 2016-05-16 | 2016-10-12 | 腾讯科技(深圳)有限公司 | Communication number processing method and device |
CN106484767B (en) * | 2016-09-08 | 2019-06-21 | 中国科学院信息工程研究所 | A kind of event extraction method across media |
CN108268438B (en) * | 2016-12-30 | 2021-10-22 | 腾讯科技(深圳)有限公司 | Page content extraction method and device and client |
CN107423279B (en) * | 2017-04-11 | 2021-01-15 | 美林数据技术股份有限公司 | Information extraction and analysis method for financial credit short message |
CN107169061B (en) * | 2017-05-02 | 2020-12-11 | 广东工业大学 | Text multi-label classification method fusing double information sources |
CN111090744A (en) * | 2019-12-17 | 2020-05-01 | 中科鼎富(北京)科技发展有限公司 | Stock market operation risk information mining method and device |
CN112380257A (en) * | 2020-11-26 | 2021-02-19 | 厦门市美亚柏科信息股份有限公司 | Network data stream locking method, terminal equipment and storage medium |
CN112597405A (en) * | 2020-12-17 | 2021-04-02 | 中国科学院计算技术研究所数字经济产业研究院 | Event external information source extraction method based on microblog platform |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101071420A (en) * | 2007-06-22 | 2007-11-14 | 腾讯科技(深圳)有限公司 | Method and system for cutting index participle |
CN101344889A (en) * | 2008-07-31 | 2009-01-14 | 中国农业大学 | Method and system for network information extraction |
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
CN102999534A (en) * | 2011-09-19 | 2013-03-27 | 北京金和软件股份有限公司 | Chinese word segmentation algorithm based on reverse maximum matching |
CN103150432A (en) * | 2013-03-07 | 2013-06-12 | 宁波成电泰克电子信息技术发展有限公司 | Method for internet public opinion analysis |
-
2014
- 2014-01-09 CN CN201410010836.XA patent/CN103778200B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101071420A (en) * | 2007-06-22 | 2007-11-14 | 腾讯科技(深圳)有限公司 | Method and system for cutting index participle |
CN101344889A (en) * | 2008-07-31 | 2009-01-14 | 中国农业大学 | Method and system for network information extraction |
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
CN102999534A (en) * | 2011-09-19 | 2013-03-27 | 北京金和软件股份有限公司 | Chinese word segmentation algorithm based on reverse maximum matching |
CN103150432A (en) * | 2013-03-07 | 2013-06-12 | 宁波成电泰克电子信息技术发展有限公司 | Method for internet public opinion analysis |
Also Published As
Publication number | Publication date |
---|---|
CN103778200A (en) | 2014-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103778200B (en) | A kind of message information source abstracting method and its system | |
Karami et al. | Twitter and research: A systematic literature review through text mining | |
Bernard | Theory of the Hashtag | |
Rizzo et al. | NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. | |
Ratkiewicz et al. | Detecting and tracking the spread of astroturf memes in microblog streams | |
US8214366B2 (en) | Systems and methods for generating a language database that can be used for natural language communication with a computer | |
Stamatatos et al. | Overview of the PAN/CLEF 2015 evaluation lab | |
Kumar et al. | Analyzing Twitter sentiments through big data | |
Abdelrazeq et al. | Sentiment analysis of social media for evaluating universities | |
CN103488663A (en) | System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources | |
CN104820686A (en) | Network search method and network search system | |
Rao et al. | CMEE-IL: Code Mix Entity Extraction in Indian Languages from Social Media Text@ FIRE 2016-An Overview. | |
Kumar et al. | IIT-TUDA: System for sentiment analysis in Indian languages using lexical acquisition | |
CN106547875B (en) | Microblog online emergency detection method based on emotion analysis and label | |
CN103294664A (en) | Method and system for discovering new words in open fields | |
CN105843796A (en) | Microblog emotional tendency analysis method and device | |
CN103678362A (en) | Search method and search system | |
CN106503907B (en) | Service evaluation information determination method and server | |
CN103577404A (en) | Microblog-oriented discovery method for new emergencies | |
Anoop et al. | Leveraging heterogeneous data for fake news detection | |
Karim et al. | A step towards information extraction: Named entity recognition in Bangla using deep learning | |
Amina et al. | SCANCPECLENS: A framework for automatic lexicon generation and sentiment analysis of micro blogging data on China Pakistan economic corridor | |
Hernandez et al. | Constructing consumer profiles from social media data | |
Iliev et al. | Political rhetoric through the lens of non‐parametric statistics: are our legislators that different? | |
Khurdiya et al. | Extraction and Compilation of Events and Sub-events from Twitter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |