[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN103853834B - Text structure analysis-based Web document abstract generation method - Google Patents

Text structure analysis-based Web document abstract generation method Download PDF

Info

Publication number
CN103853834B
CN103853834B CN201410090200.0A CN201410090200A CN103853834B CN 103853834 B CN103853834 B CN 103853834B CN 201410090200 A CN201410090200 A CN 201410090200A CN 103853834 B CN103853834 B CN 103853834B
Authority
CN
China
Prior art keywords
text
sentence
cut
semantic
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410090200.0A
Other languages
Chinese (zh)
Other versions
CN103853834A (en
Inventor
沈怡涛
顾君忠
林晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201410090200.0A priority Critical patent/CN103853834B/en
Publication of CN103853834A publication Critical patent/CN103853834A/en
Application granted granted Critical
Publication of CN103853834B publication Critical patent/CN103853834B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text structure analysis-based Web document abstract generation method. The method comprises the steps of using a URL (uniform resource locator) as input, integrating the webpage main bodies of visual features and text features for extraction, partitioning the main bodies into a plurality of semantic paragraphs, and abstracting each semantic paragraph, so the generated abstract has higher coverage rate. The text structure analysis-based Web document summary generation method realizes the generation of the text abstract with better quality from a Webpage aiming at the conditions that the Webpage structure is complex, the main body is hard to identify and the Chinese automatic abstract is still positioned in the probe stage.

Description

The generation method of the Web document summary based on text retrieval conference TREC
Technical field
The present invention relates to Web page text extraction, natural language processing, Chinese Text Summarization technical field, specifically one Plant the generation method of the Web document summary based on text retrieval conference TREC.
Background technology
At present, Internet has had become as the main source that people obtain information.In particularly user generates in recent years Hold developing rapidly of (UGC), the information on Internet is just in explosive growth.Although search engine can require according to user Return Search Results.But user still needs and finds the webpage being best suitable for oneself needs from search listing, especially because mutually A large amount of search engine optimizations existing and reprint phenomenon in networking, to user fast and accurately find information bring very tired greatly Difficult.
Automatic abstracting system is quickly to process Web document using computer, therefrom captures out Web document by certain compression ratio Core content, user can therefrom obtain subject information and judge the value of this Web document, improve user and search for information Efficiency.
Noise information is there is in a large number, such as advertisement, navigation bar, user function bar, associated recommendation, copyright letter in Web document The information unrelated with theme such as breath.Web document is a kind of semi-structured information although having a fixed structure, but semanteme cannot be true Fixed.The page that expression in html source code for the content and final rendering obtain has very big difference.JS and AJAX skill in recent years The extensive application of art is so that web data is no longer static HTML code, and is dynamically generated, even for the behaviour of user Make behavior and also can produce corresponding change.So how to extract and structure correct content related with theme from Web document, There is certain difficulty.
The history of Chinese Text Summarization systematic research about more than two decades, but at present also in the exploratory stage, automatically The result of summary also far can not be satisfactory.The method of autoabstract is broadly divided into two big class, based on the automatic abstract understanding With the automatic abstract based on extraction.Because natural language processing technique is still without important breakthrough, so based on the method understanding simultaneously Can not be real realize automatic abstract.
And the research history of the autoabstract technology of web oriented document is shorter, " compared with traditional text, the text of webpage Loosely organized, title name is relatively less rigorous, and sentence terminates to be likely to be not over according with, and exist substantial amounts of with The incoherent content of text, this brings certain difficulty to the generation of summary.”
Content of the invention
It is an object of the invention to provide a kind of generation method of the Web document summary based on text retrieval conference TREC, the method The integrated use technology such as visual feature analysis, natural language analysis, text retrieval conference TREC, is each webpage in Search Results Generate based on semantic, the preferable web-page summarization of quality, provide the user reference.
The object of the present invention is achieved like this:
A kind of generation method of the Web document summary based on text retrieval conference TREC, it comprises the following steps:
1) input the URL of webpage to be made a summary;
2) analyze from webpage view-based access control model to be made a summary and extract Web page text, specifically include;
2.1) using browser core, Web document is parsed and rendered;
2.2) piecemeal is carried out to webpage using Visual tree (VIPS) algorithm, obtain position, the area of each block;
2.3) participle is carried out to each block;
2.4) to each component analysis text feature;
2.5) whether comprise text to each block to give a mark;
2.6) text being higher than a certain threshold value by score is linked in sequence;
2.7) export Web document text;
3) autoabstract based on text retrieval conference TREC is carried out to the text extracting, specifically include:
3.1) by step 2) obtain Web page text;
3.2) participle and part-of-speech tagging are carried out to text;
3.3) carry out Text Pretreatment:Basic structure in identification text, that is, identify article title, sentence completion, paragraph Cutting;
3.4) text is carried out with semantic section cutting, the position being changed by text retrieval conference TREC identification semanteme, as The mark of semantic section cutting;
3.5) to each semantic section, using the promotion method of TFIDF, to importance in the semantic section that is located for each sentence Measured, then required according to abstract word number, extract the sentence that some sentences can represent this semantic section theme;
3.6) each sentence is linked in sequence, exports digest.
Described step 2.4) in text feature be number of words, font size, assertive sentence quantity, non-assertive sentence quantity and text fragment Quantity.
Described step 2.5) described in judge whether each block comprises text and given a mark, using below equation calculate marking Score value:
V ( S ) = S 2 * P ( x 1 , y 1 , x 2 , y 2 ) N + 1
Wherein S represents assertive sentence quantity, and N represents non-assertive sentence quantity, and P obtains according to block size and position calculation One value, x1, y1Represent the coordinate in the block upper left corner, x2, y2Represent the coordinate in the block lower right corner.
Described step 3.4) in semanteme change position analysis identification be:
3.4-1) subordinate sentence is carried out to document D, between the adjacent sentence of each two, be cut-point undetermined;
3.4-2) each cut-point undetermined is given a mark, its formula is:
Q ( p i ) = &Sigma; i + 1 < j &le; i + a R ( s i , s j ) - &Sigma; i - a &le; j < i R ( s i , s j )
Wherein, R (si, sj) represent sentence siWith sentence sjSentence between semantic relevancy;piRepresent cut-point in sentence siWith si-1Between, if Q is (pi) > Q (pi-1) and Q (pi) > Q (pi+1), p is describediIt is the maximum point of cut-point weights, so piIt is Cut-point between semantic section in the text.A is an adjustable empirical parameter, represents that the semanteme when identifying cut-point divides The scope of analysis, that is, represent and consider each a sentence before and after cut-point.
If 3.4-3) score value of cut-point is more than a certain threshold value, and it is local maximum, that is, score value is higher than former and later two points The score value of cutpoint, this cut-point is exactly the cut-off of semantic section, i.e. step 3.4) described in the position that changes of semanteme.
The analysis identification step 2 of the position that described semanteme changes) between sentence the calculating of semantic relevancy include following Step:
3.4-2-1) sentence is cut into the set of word;
3.4-2-2) below equation is used to calculate semantic relevancy between sentence
R ( s 1 , s 2 ) = &Sigma; w i &Element; s 1 m a x ( R ( w i , w j ) ) ( w j &Element; s 2 )
Wherein R (wi, wj) represent word wiWith word wjWord between semantic relevancy.
Described step 3.5) in each sentence, the importance in the semantic section that is located carries out tolerance and uses below equation meter Calculate:
V(S1)=sum (w ∈ S1)*TFIDF(w)
Wherein, when calculating TFIDF (w), each paragraph is considered as independent file, several sections that entire article is comprised Fall to being considered as file set.
The present invention can filter out in webpage and the unrelated word of theme, link etc., identifies the literary composition included in webpage Zhang Zhengwen, accuracy rate is higher, and has higher robustness.Autoabstract flow process employs automatic based on text retrieval conference TREC Digest technology, the summary coverage rate of generation is high and makes a summary more smooth.
The present invention can be directed to Web document, and the compression ratio requirement specified by user is it is only necessary to input the URL of webpage to be made a summary So that it may within the time of several seconds, formation can cover the original text meaning for address, more accurate, smooth summary, helps user quickly accurate True finds information in the Internet.
Brief description
Fig. 1 is flow chart of the present invention;
Fig. 2 is Web-page preprocessing flow chart of the present invention;
Fig. 3 is autoabstract flow chart of the present invention.
Specific embodiment
The invention discloses a kind of Web document abstraction generating method of Search Engine-Oriented, a Web can be automatically analyzed Webpage, and the text snippet of reaction of formation Web page subject.
The present invention comprise a Web page text combining visual signature and text feature extract and one be based on by literary composition This structural analysis carries out the automatic text summarization of sub-topicses division.
The present invention, using a URL as input, through Web page text extraction, two stages of autoabstract, ultimately generates literary composition This summary.
Specific algorithm to described two stages below, makees furtherly in conjunction with as a example making a summary to a news web page Bright:
Fig. 1 describe from URL to be made a summary to generate summary overall procedure, which includes Web-page preprocessing flow process and from Dynamic summary flow process.
Specifically, in an embodiment, the present invention obtains in Web-page preprocessing flow process (see Fig. 2) URL input step and waits to pluck Want the URL of news web page.Web-page preprocessing flow process is passed through to analyze visual signature, can more accurately find the textual in webpage Point, have more high robust than additive method.Consider simultaneously text feature, text Controlling UEP, html tag feature, Other features such as semantic feature, improve the accuracy that Web page text extracts further.
Webpage rendering step is responsible for reading the input corresponding webpage of URL, in this embodiment, using IE11 browser core Html tag is processed, and renders this webpage.On the basis of webpage renders, Visual tree analytical procedure adopts VIPS to calculate Method, carries out Visual tree analysis to webpage, obtains position, the area of each block.In this embodiment, this step will be to be made a summary new Hear Web-page segmentation and become 6 blocks:One top block, a bottom block, a navigation block, an advertisement block and two Comprise the block of text.Participle step is responsible for carrying out participle to each block.Then, text feature analytical procedure is entered to word segmentation result Style of writing eigen analysis.Last comprehensive analysis step carries out comprehensive to the feature of each block that Visual tree analysis obtains and text feature Close analysis, export body.
In this embodiment, P (x is calculated using following equation1, y1, x2, y2).
P(x1, y1, x2, y2)=(x2-x1)*(y2-y1)-x1*y1
Wherein x1, y1Represent the coordinate in the block upper left corner, x2, y2Represent the coordinate in the block lower right corner.Then calculate each V (s) value of block:
V ( S ) = S 2 * P ( x 1 , y 1 , x 2 , y 2 ) N + 1
V (s) value of above-mentioned 6 blocks is respectively 3.7 × 10 from big to small6, 2.3 × 106, 7.5 × 105, 5.4 × 106, 3.7×105, 1.6 × 105, 1.2 × 104.
In this embodiment, the threshold value of employing is 106, so choosing V (s) to be more than 106Block, that is, V (s) value is maximum Two blocks.In this embodiment, two maximum blocks of V (s) value are exactly two blocks comprising text, so correct extract Arrive body.
After extracting body, then carry out autoabstract flow process (see Fig. 3), comprise phase between Text Pretreatment, word Relatedness computation, semantic section segmentation, these steps of summarization generation between Guan Du calculating, sentence.
One Text Pretreatment step, the basic structure in identification text, that is, identify article title, sentence completion, paragraph Cutting.In this embodiment, body comprises 8 paragraphs, 23 sentences altogether.
The computational semantics knowledge that between word, relatedness computation step is provided based on Hownet, by calculating the former phase of justice of two words To obtain the degree of association of two words like degree.Using formula as follows:
R(w1, w2)=max (Rele (Ci, Cj))(Ci∈w1, Cj∈w2)
Wherein R (w1, w2) illustrate semantic relevancy between two words, Rele (Ci, Cj) illustrate two former correlations of justice Degree, takes its maximum to represent the semantic relevancy of two words.
Between sentence, degree of association step obtains the degree of association of two sentences by analyzing the degree of association between word in two sentences.
R ( s 1 , s 2 ) = &Sigma; w i &Element; s 1 m a x ( R ( w i , w j ) ) ( w j &Element; s 2 )
Wherein R (s1, s2) illustrate degree of association between two sentences, for the word in each sentence 1, look in sentence 2 therewith The maximum word of degree of association, calculates the degree of association between this two words.Finally by these maximums sue for peace, obtain this two sentences it Between degree of association.
One semantic section segmentation step, with reference to document《Ground based on the Text Structure Analysis of content relatedness computation Study carefully》To carry out text retrieval conference TREC.First sentence after the feature of cut-point is cut-point between semantic section and some before The degree of association very little of sentence, and larger with the degree of association of several sentences afterwards.Using below equation to 23 in this embodiment 22 cut-points between individual sentence calculate the score value of cut-point, and find the maximum point of function Q (pi):
Q ( p i ) = &Sigma; i + 1 < j &le; i + a R ( s i , s j ) - &Sigma; i - a &le; j < i R ( s i , s j )
In this embodiment, Q (pi) comprise 2 maximum points, according to this two maximum points, this news is divided into 3 Individual semanteme section.Each semantic section contains sub-topicses of news, and in this embodiment, first semantic section is to news thing The general introduction of part, latter two semantic section is two sides comment to this media event respectively.
One summarization generation step, requires according to user, extracts summary from the text of text formatting by a certain percentage.
In this embodiment, this summarization generation step passes through relatedness computation step between sentence, calculates in each sub-topics The degree of association sum of sentence and article title sequence of words, so that it is determined that the value of each sub-topicses.Sentence is extracted from sub-topicses Quantity and the degree of association of this sub-topics and article title be directly proportional.
In this embodiment, the ratio that user specifies is 0.2, that is, 5 words in extracting 23 form summary.By to 3 The value of individual sub-topicses is calculated, and determines and extracts 2,1,1 sentence respectively from 3 semantic sections.Finally, described summarization generation 5 chosen summary sentences are linked in sequence by step, are formed and make a summary and export.

Claims (5)

1. a kind of based on text retrieval conference TREC Web document summary generation method it is characterised in that:The method includes following step Suddenly:
1) input the URL of webpage to be made a summary;
2) analyze from webpage view-based access control model to be made a summary and extract Web page text, specifically include;
2.1) using browser core, Web document is parsed and rendered;
2.2) piecemeal is carried out to webpage using vision tree algorithm, obtain position, the area of each block;
2.3) participle is carried out to each block;
2.4) to each component analysis text feature;
2.5) whether text is comprised to each block give a mark, the score value of marking is calculated using below equation:
V ( S ) = S 2 * P ( x 1 , y 1 , x 2 , y 2 ) N + 1
Wherein S represents assertive sentence quantity, and N represents non-assertive sentence quantity, and P is being obtained according to block size and position calculation Value, x1, y1Represent the coordinate in the block upper left corner, x2, y2Represent the coordinate in the block lower right corner;
2.6) text being higher than a certain threshold value by score is linked in sequence;
2.7) export Web document text;
3) autoabstract based on text retrieval conference TREC is carried out to the text extracting, specifically include:
3.1) by step 2) obtain Web page text;
3.2) participle and part-of-speech tagging are carried out to text;
3.3) carry out Text Pretreatment:Basic structure in identification text, that is, identify article title, sentence completion, paragraph cutting;
3.4) text is carried out with semantic section cutting, the position changing by text retrieval conference TREC identification semanteme, as semanteme The mark of section cutting;
3.5) to each semantic section, using the promotion method of TFIDF, importance in the semantic section that is located for each sentence is carried out Tolerance, then requires according to abstract word number, extracts the sentence that some sentences can represent this semantic section theme;
3.6) each sentence is linked in sequence, exports digest.
2. method according to claim 1 it is characterised in that:Step 2.4) described in text feature be number of words, font size, Assertive sentence quantity, non-assertive sentence quantity and text fragment quantity.
3. method according to claim 1 it is characterised in that:Step 3.4) described in semanteme change position point Analysing identification is:
3.4-1) subordinate sentence is carried out to document D, it is cut-point undetermined between the adjacent sentence of each two;
3.4-2) each cut-point undetermined is given a mark, its formula is:
Q ( p i ) = &Sigma; i + 1 < j &le; i + a R ( s i , s j ) - &Sigma; i - a &le; j < i R ( s i , s j )
Wherein, R (si, sj) represent sentence siWith sentence sjSentence between semantic relevancy;piRepresent cut-point in sentence siAnd si-1It Between, if Q is (pi) > Q (pi-1) and Q (pi) > Q (pi+1), p is describediIt is the maximum point of cut-point weights, so piIt is this article Cut-point between semantic section in this;A is an adjustable empirical parameter, the semantic analysis when identifying cut-point for the expression Scope, that is, represent and consider each a sentence before and after cut-point;
If 3.4-3) score value of cut-point is more than a certain threshold value, and it is local maximum, that is, score value is higher than former and later two cut-points Score value, this cut-point is exactly the cut-off of semantic section, i.e. step 3.4) described in the position that changes of semanteme.
4. method according to claim 3 it is characterised in that:Step 3.4-2) described between sentence semantic relevancy calculating Comprise the following steps:
3.4-2-1) sentence is cut into the set of word;
3.4-2-2) below equation is used to calculate semantic relevancy between sentence
R ( s 1 , s 2 ) = &Sigma; w i &Element; s 1 m a x ( R ( w i , w j ) ) ( w j &Element; s 2 )
Wherein R (wi, wj) represent word wiWith word wjWord between semantic relevancy.
5. method according to claim 1 it is characterised in that:Step 3.5) described in each sentence in the semantic section that is located In importance carry out tolerance calculated using below equation:
V(S1)=sum (w ∈ S1)*TFIDF(w)
Wherein, when calculating TFIDF (w), each paragraph is considered as independent file, several paragraphs that entire article is comprised regard For file set.
CN201410090200.0A 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method Expired - Fee Related CN103853834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410090200.0A CN103853834B (en) 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410090200.0A CN103853834B (en) 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method

Publications (2)

Publication Number Publication Date
CN103853834A CN103853834A (en) 2014-06-11
CN103853834B true CN103853834B (en) 2017-02-08

Family

ID=50861489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410090200.0A Expired - Fee Related CN103853834B (en) 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method

Country Status (1)

Country Link
CN (1) CN103853834B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677764B (en) * 2015-12-30 2020-05-08 百度在线网络技术(北京)有限公司 Information extraction method and device
CN106484768B (en) * 2016-09-09 2019-12-31 天津海量信息技术股份有限公司 Local feature extraction method and system for text content saliency region
CN106844340B (en) * 2017-01-10 2020-04-07 北京百度网讯科技有限公司 News abstract generating and displaying method, device and system based on artificial intelligence
CN108959312B (en) 2017-05-23 2021-01-29 华为技术有限公司 Method, device and terminal for generating multi-document abstract
CN107346335B (en) * 2017-06-28 2020-04-14 浙江大学 Webpage theme block identification method based on combination characteristics
CN107622046A (en) * 2017-09-01 2018-01-23 广州慧睿思通信息科技有限公司 A kind of algorithm according to keyword abstraction text snippet
CN107766325B (en) * 2017-09-27 2021-05-28 百度在线网络技术(北京)有限公司 Text splicing method and device
CN108427761B (en) * 2018-03-21 2022-01-14 腾讯科技(深圳)有限公司 News event processing method, terminal, server and storage medium
CN110889280B (en) * 2018-09-06 2023-09-26 上海智臻智能网络科技股份有限公司 Knowledge base construction method and device based on document splitting
CN110968752A (en) * 2018-09-28 2020-04-07 珠海格力电器股份有限公司 Data acquisition method and device, storage medium and electronic equipment
CN113515627B (en) * 2021-05-19 2023-07-25 北京世纪好未来教育科技有限公司 Document detection method, device, equipment and storage medium
CN114330315A (en) * 2021-12-28 2022-04-12 浙江大华技术股份有限公司 Method and device for processing secure text, storage medium and electronic device
CN114417808B (en) * 2022-02-25 2023-04-07 北京百度网讯科技有限公司 Article generation method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
CN102446191A (en) * 2010-10-13 2012-05-09 北京创新方舟科技有限公司 Method for generating webpage content abstracts and equipment and system adopting same

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8930376B2 (en) * 2008-02-15 2015-01-06 Yahoo! Inc. Search result abstract quality using community metadata

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
CN102446191A (en) * 2010-10-13 2012-05-09 北京创新方舟科技有限公司 Method for generating webpage content abstracts and equipment and system adopting same

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"基于内容相关度计算的文本结构分析方法研究";钟茂生;《中国博士学位论文全文数据库信息科技辑》;20101015(第10期);I138-81 *
"基于分块的网页正文信息提取算法研究";黄文蓓 等;《计算机应用》;20070601;第6卷(第S1期);24-26 *
"基于潜在语义分析的多网页自动文摘研究";何媛媛;《中国优秀硕士学位论文全文数据库信息科技辑》;20080115(第01期);I138-1310 *

Also Published As

Publication number Publication date
CN103853834A (en) 2014-06-11

Similar Documents

Publication Publication Date Title
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN104933027B (en) A kind of open Chinese entity relation extraction method of utilization dependency analysis
CN105022725B (en) A kind of text emotion trend analysis method applied to finance Web fields
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
TWI695277B (en) Automatic website data collection method
CN105022803B (en) A kind of method and system for extracting Web page text content
CN107590219A (en) Webpage personage subject correlation message extracting method
CN107704503A (en) User&#39;s keyword extracting device, method and computer-readable recording medium
Piperski et al. Big and diverse is beautiful: A large corpus of Russian to study linguistic variation
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN105975453A (en) Method and device for comment label extraction
CN102609427A (en) Public opinion vertical search analysis system and method
CN103294781A (en) Method and equipment used for processing page data
CN101887443A (en) Method and device for classifying texts
JP2006351002A5 (en)
CN105843796A (en) Microblog emotional tendency analysis method and device
CN103838796A (en) Webpage structured information extraction method
CN104699797A (en) Webpage data structured analytic method and device
CN104346382B (en) Use the text analysis system and method for language inquiry
CN110008473A (en) A kind of medical text name Entity recognition mask method based on alternative manner
Siklósi Using embedding models for lexical categorization in morphologically rich languages
CN107436931B (en) Webpage text extraction method and device
Nethra et al. WEB CONTENT EXTRACTION USING HYBRID APPROACH.
CN107291686B (en) Method and system for identifying emotion identification
CN102622405B (en) Method for computing text distance between short texts based on language content unit number evaluation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170208

Termination date: 20200312

CF01 Termination of patent right due to non-payment of annual fee