WO2012083870A1 - 一种论坛回帖增量采集方法及系统 - Google Patents
一种论坛回帖增量采集方法及系统 Download PDFInfo
- Publication number
- WO2012083870A1 WO2012083870A1 PCT/CN2011/084457 CN2011084457W WO2012083870A1 WO 2012083870 A1 WO2012083870 A1 WO 2012083870A1 CN 2011084457 W CN2011084457 W CN 2011084457W WO 2012083870 A1 WO2012083870 A1 WO 2012083870A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- page
- post
- url
- reply
- new
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
Definitions
- the invention belongs to the technical field of network information collection, and particularly relates to a method and system for incremental collection of forum posts. Background technique
- the existing forum collection system is only the home page information of the post, but not the post information of the post. Summary of the invention
- the technical problem to be solved by the present invention is to provide a forum replies incremental collection method and system, which can quickly, accurately and completely collect all the posts of a post/
- the reply information overcomes the defects of the existing search engine in the search for the page's page-turning post information, and the existing forum collection system only collects the home page information of the post without collecting the defect of the post information.
- the technical solutions adopted by the present invention are as follows:
- a forum reply incremental collection method includes the following steps:
- a forum reply incremental collection system comprising: judging whether there are new posts and posts with new posts in the forum list page that needs to be collected according to the post homepage URL and the post reply number information;
- the extracting device of the new reply letter is extracted from the post of the reply.
- the newly added replies in the list page and the post information with the new replies can be obtained in time; the URL identification and the reply number information are used for fast deduplication processing to avoid Repeat the collection; to distinguish the different page-link extraction methods to achieve the purpose of quickly flipping the page collection; thus, it is possible to quickly, accurately and completely collect all the main/post information of a post.
- the replies have a leak rate of less than 5% and real time can reach minutes.
- FIG. 1 is a structural block diagram of a forum replies incremental collection system in a specific implementation manner
- FIG. 2 is a flowchart of a method for incrementally collecting forum replies in a specific implementation manner
- FIG. 3 is a flowchart of a method for determining whether a new post and a post with a new reply exist in the judgment list page in the specific implementation manner;
- FIG. 4 is a flow chart of a method for extracting a new post message from a post with a new post by extracting the main post and the post information from the newly added post in the specific embodiment.
- the forum reply incremental collection system in the present embodiment includes a judging device 11 and an extracting device 12 connected to the judging device 11.
- the judging device 11 includes a first queue unit 111, a first obtaining unit 112, a list page extracting unit 113, and a judging unit 114.
- the extracting device 12 includes a second queue unit 121, a scanning unit 122, a second obtaining unit 123, a content page extracting unit 124, and a deduplication unit 125.
- the judging device 11 is configured to periodically judge whether there are new posts and posts with new posts in the forum list page that needs to be collected according to the post home page URL and the post reply number information.
- the first queue unit 111 is configured to add all the forum list page URLs that need to be collected to the list page collection queue.
- the first obtaining unit 112 is configured to retrieve each list page URL from the list page collection queue.
- the list page extracting unit 113 is configured to obtain the webpage content corresponding to the list page URL for each of the retrieved list page URLs, and extract the homepage URL and the current reply number of each post from the webpage content.
- the determining unit 113 is configured to determine, according to the post homepage URL, whether each post exists in the collected post information table; if yes, continue to determine whether the current reply number of the post is greater than the current reply number recorded in the collected post information table If it is greater than, the post has a new reply, updating the last reply number of the post in the post information table and the number of the current reply; if the post does not exist in the collected post information table, the post is Add a post, add the post home page URL and the current reply number to the collected post information table.
- the extracting device 12 is configured to extract the main post and the post information from the newly added post for the newly added post, and calculate the new post start point and the new post number for the post with the new reply, according to the new reply starting point and the new reply number.
- New post information is extracted from posts with new posts.
- the second queue unit 121 is configured to add the home page URL of the newly added post and the post URL with the new reply to the content page collection queue.
- the scanning unit 122 is configured to periodically scan the content page collection queue.
- the second obtaining unit 123 is configured to retrieve each URL from the content page collection queue.
- the content page extracting unit 124 is configured to obtain webpage content corresponding to the URL, and extract a post and/or a reply and/or a page turning URL from the webpage content.
- the deduplication unit 125 is configured to perform deduplication processing on the page turning URL extracted from the webpage content when the forum page turning mode is the next page turning mode.
- the second queue unit 121 is further configured to add the deduplicated page turning URL to the content page collection queue.
- the flow of the forum reply increment collection method based on the system shown in FIG. 1 in the embodiment includes the following steps:
- the judging device 11 periodically judges whether there are new posts and/or posts with new posts in all the forum list pages that need to be collected. As shown in FIG. 3, the method for determining the usage in the present embodiment includes the following steps:
- the first queue unit 111 adds all the forum list page URLs that need to be gathered to the list page collection queue.
- the list page refers to a list page containing all the post titles, URLs (unified resource locators), clicks, replies, and the like in the forum, and does not include the specific content of the posts.
- the Sohu Forum's list page of the financial hodgepodge channel has the following URL:
- a collection time interval is set for each forum list page that needs to be collected, such as every 5 minutes; monitoring the collection time interval of each list page; when a certain list page reaches ⁇ When the time interval is set, the list page URL is added to the list page collection queue.
- the refresh interval is dynamically adjusted according to the update frequency of the forum; the faster the update frequency of the forum, the shorter the refresh interval; the slower the update frequency of the forum, the longer the refresh interval. If it is set to be set every 5 minutes in advance, if the forum update frequency is increased during the subsequent collection, the refresh interval is shortened to 3 minutes, and then shortened to 1 minute or shorter.
- the first obtaining unit 112 takes out each list page URL from the list page collection queue.
- the method for extracting the list page URL from the list page collection queue is: a timed scan list page collection queue (the scan interval time can be set by the user according to a specific application), if the list page collection queue is not empty , the list page URL is removed from the list page collection queue in the first-in first-out order (the URL is automatically deleted from the queue after the URL is removed from the queue), and the friendly access condition of the website to which the list page URL belongs is satisfied. .
- a list page URL does not satisfy the friendly access condition of the website to which the list page URL belongs, the list page URL is ignored in this scan, and the next list page URL is continuously determined, and the list page URL is reserved for subsequent scanning.
- the friendly questioning conditions of the website include the current number of access restrictions and the time interval limit for access.
- the list page extracting unit 113 obtains the webpage content corresponding to the list page URL for each of the extracted list page URLs, and extracts the homepage URL and the current reply number of each post from the webpage content.
- an HTTP request for obtaining the content of the webpage corresponding to the URL is sent to the website to which the URL belongs, and then the returned webpage content is received. Extracting the home page URL and the current number of replies from the content of the webpage are prior art, and will not be described here.
- the judging unit 113 judges whether or not the post exists in the collected post information table based on the post home page URL. If it exists, it means that the post has been collected, and continue to judge whether the current reply number of the post is greater than the number of current responses recorded in the collected post information table. If it is greater than, it means that the post has a new reply.
- the collected post information table the last reply number of the post and the number of the current reply are updated, that is, the number of the reply of the post in the post information table has been collected. The value replaces the value of the last reply number, and replaces the value of the current reply number of the post in the post information table with the value of the current reply number of the post.
- the post does not exist in the collected post information table, it indicates that the post is a new post, and the post home page URL and the current reply number are added to the collected post information table, and the last reply number of the post is 0, the number of responses is the current number of replies.
- the collected post information table stores the last reply number and the current reply number of the collected post home page URL and the collected post.
- the structure is as follows:
- the identification information of the post homepage URL is stored in the collected post information table.
- the identification information By comparing the identification information, it is determined that the post homepage URL is in the collected post information table. Whether it exists. This can improve the efficiency of the URL comparison.
- the extracting device 12 extracts the main post and the post information from the newly added post, from the New post information is extracted from the post of the new reply.
- the extraction method used in the embodiment includes the following steps:
- the second queue unit 121 adds the home page URL of the newly added post and the post URL with the new post to the content page collection queue.
- the post homepage URL is taken out, and the current reply number of the post recorded in the collected post information table is changed to the current reply number. Then insert it into the content page collection queue. If the post home page URL does not exist in the content page collection queue, the post home page URL is directly added to the content page collection queue.
- the page flipping method of the forum to which the post belongs is to calculate the page turning mode
- the home page URL of the post with the new reply is directly added to the content page queue
- the page of the forum belongs to In the page turning mode the page turning URL information table of the post is searched, and the last page turning URL in the table is added to the content page collection queue.
- the method of calculating the page turning refers to the method of turning the page number determined by each page, such as the post in the international forum of the People's Network Power Community
- the page turning method on the next page refers to the method of turning pages in an indefinite number of pages per page, such as the posts in Tianya.
- Htt ://www. tianya.cn/publicforum/content/free/ 1/1880805. shtml is the way to page next page.
- the scanning unit 122 periodically scans the content page collection queue.
- the scan interval can be set by the user according to the specific application.
- the second acquisition unit 123 takes each URL from the content page collection queue. Once the URL is removed from the queue, the URL is automatically removed from the queue.
- the method for taking the URL from the content page collection queue by the second acquisition unit 123 is the same as the method for the first acquisition unit 112 to take the URL from the list page collection queue, and details are not described herein again.
- the content page extracting unit 124 acquires the webpage content corresponding to the retrieved URL, and extracts the main post and/or the reply post and/or the page turning URL from the webpage content, and adds the flipping page URL to the content page collection queue .
- the specific method for extracting the main post and/or the reply post from the webpage content is as follows:
- the main post and the post information are extracted from the webpage content corresponding to the URL. Specifically, firstly, it is determined whether the main reply style of the post is consistent. If they are consistent, the information is extracted one by one according to the same extraction method, and the first information extracted is used as the main post, and other information is used as the reply; if not, the first rule is followed. Extract the main post information, and then extract each post information. Whether the main reply style of the post is consistently set manually, the predetermined rule is a manually set keyword or a regular expression.
- the new reply starting point and the number of new postings C ParseCmmt are determined according to the following formula, and the C is extracted from the new reply starting point S F P c ⁇ new post information.
- l N Perf3 ⁇ 4ge contains the main sticker
- ⁇ R ⁇ um + does not contain the main post -- R This means that the number of replies to the post at the last snippet indicates the current number of replies to the post, and ⁇ ⁇ indicates the number of replies per page of the post.
- the overlapped page refers to a page in which the reply information is in the page and the partial reply is a new reply. The judgment is based on the fact that the page number corresponding to the URL is the same as the page number of the current page to be extracted.
- the calculation formula for the page number at which the page should currently be fetched is as follows: N P3 ⁇ 4 contains the main sticker
- the page content is a new reply
- NPerPage does not contain the main post
- the specific method for extracting the page turning URL from the web content is as follows: 1 If the forum page turning method is to calculate the page turning mode, and the URL is the post home page URL, the following formula is used to calculate the starting page number and the ending page number of the page turning, that is, the starting page number and the ending page number of the new reply. If the URL is not the post home page URL, the page flip URL is not extracted. Contains the main sticker
- N Perf ⁇ contains the main post
- N P3 ⁇ 4 does not contain the main sticker
- the specific URL splicing method is: splicing the page turning URL according to the configured page turning rule, the page turning start page number, and the page turning base number.
- the URL of the page turning is divided into three parts, wherein the first part and the third part are invariant parts, and the points are marked as strBe f orePa g e and strA f terPa S e ;
- the second part is the change part, which is recorded as nPageUp.
- nPageUp (nPageNo x nPage UsBaseN m);
- strPostPageUrl strBeforePage + nPageUp + str After Page; where ""geN. Indicates the page number of the new reply; nFirstPostPagelndex represents the thousand page number of the first page. In the actual forum, the possible value is 0 or 1, that is, the page number of the post is numbered from 0, the page number of the first page is 0; or the page number of the post is numbered from 1, and the page number of the first page of the post is 1. " ⁇ 6 ⁇ indicates the page number value indicating the page turning in the URL to be spliced, that is, the value of the second part; " ⁇ TM indicates the page turning base. strPostPageUrl shows the spliced URL.
- the post when the post is collected for the first time, the post already has 210 replies, and the spliced URLs obtained by the splicing are 10, respectively:
- the starting page number is 0 and the page turning base is 30. Extract page flip according to the page link rule
- the first part of the URL is:
- N i According to the above information, suppose that the first time the post is collected, the post has already had 210 replies, and the spliced URLs obtained by splicing are 6 in total:
- the deduplication unit 125 deduplicates the page turning URL before adding to the content page stack queue. deal with. The specific processing is as follows:
- the page turning URL information table find whether the page to which the paged URL belongs exists. If it does not exist, the page turning information of the post to which the page turning URL belongs is inserted into the page turning URL information table, and the page turning URL is added to the content page set queue. If it exists, it is further determined whether the current page number of the page is greater than the page number of the page recorded in the page URL information table. If it is greater, the page turning page number of the post is updated to the current page turning page number in the page turning URL information table, and the page turning URL is added to the content page set task queue. If it is not greater than, it is not necessary to update the page turning page number of the post in the page turning URL information table, and directly delete the page turning URL.
- the page turning URL information table stores a post home page URL (or identification information), a currently paged page page number, a last post position on the currently collected page, and a currently paged page URL. Its header structure is shown in the following table:
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- Operations Research (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- Marketing (AREA)
- Economics (AREA)
- Information Transfer Between Computers (AREA)
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP11851417.3A EP2657854A4 (en) | 2010-12-22 | 2011-12-22 | METHOD AND SYSTEM FOR THE PROGRESSIVE COLLECTION OF FORUM RESPONSES |
US13/997,257 US9552435B2 (en) | 2010-12-22 | 2011-12-22 | Method and system for incremental collection of forum replies |
JP2013545030A JP5702474B2 (ja) | 2010-12-22 | 2011-12-22 | 電子掲示板リプライ増加量の採集方法及びシステム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010618393.4 | 2010-12-22 | ||
CN201010618393.4A CN102567407B (zh) | 2010-12-22 | 2010-12-22 | 一种论坛回帖增量采集方法及系统 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2012083870A1 true WO2012083870A1 (zh) | 2012-06-28 |
WO2012083870A9 WO2012083870A9 (zh) | 2013-08-29 |
Family
ID=46313183
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2011/084457 WO2012083870A1 (zh) | 2010-12-22 | 2011-12-22 | 一种论坛回帖增量采集方法及系统 |
Country Status (5)
Country | Link |
---|---|
US (1) | US9552435B2 (zh) |
EP (1) | EP2657854A4 (zh) |
JP (1) | JP5702474B2 (zh) |
CN (1) | CN102567407B (zh) |
WO (1) | WO2012083870A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9552435B2 (en) | 2010-12-22 | 2017-01-24 | Peking University Founder Group Co., Ltd. | Method and system for incremental collection of forum replies |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593344B (zh) * | 2012-08-13 | 2016-09-21 | 北大方正集团有限公司 | 一种信息采集方法和装置 |
CN103631906A (zh) * | 2013-11-25 | 2014-03-12 | 北京奇虎科技有限公司 | 一种识别网页url中页码标识的方法和装置 |
CN104731824B (zh) * | 2013-12-24 | 2018-12-18 | 腾讯科技(深圳)有限公司 | 一种显示图片的方法及装置 |
US10061725B2 (en) | 2014-04-03 | 2018-08-28 | Strato Scale Ltd. | Scanning memory for de-duplication using RDMA |
CN104391917A (zh) * | 2014-11-19 | 2015-03-04 | 四川长虹电器股份有限公司 | 一种增量抓取网页内容的方法 |
US9912748B2 (en) | 2015-01-12 | 2018-03-06 | Strato Scale Ltd. | Synchronization of snapshots in a distributed storage system |
WO2016135570A1 (en) * | 2015-02-26 | 2016-09-01 | Strato Scale Ltd. | Using access-frequency hierarchy for selection of eviction destination |
US10051154B2 (en) * | 2016-01-13 | 2018-08-14 | Canon Kabushiki Kaisha | Information processing apparatus, control method in information processing apparatus, and image processing apparatus |
CN106372134B (zh) * | 2016-08-26 | 2019-08-23 | 四川九洲电器集团有限责任公司 | 一种车联网实时数据处理方法及系统 |
CN108664303B (zh) * | 2018-04-28 | 2023-06-30 | 北京小米移动软件有限公司 | 网页内容的显示方法及装置 |
CN109741200A (zh) * | 2018-12-29 | 2019-05-10 | 深圳英飞拓智能技术有限公司 | 论坛热帖归档管理方法、装置、计算机设备和存储介质 |
CN112650910B (zh) * | 2020-12-30 | 2024-03-12 | 北京百度网讯科技有限公司 | 确定网站更新信息的方法、装置、设备和存储介质 |
CN114417200B (zh) * | 2022-01-04 | 2023-04-14 | 马上消费金融股份有限公司 | 网络数据的采集方法、装置及电子设备 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101193038A (zh) * | 2007-06-08 | 2008-06-04 | 腾讯科技(深圳)有限公司 | 回复主题帖、查看回复帖及交互主题帖的方法及系统 |
CN101335639A (zh) * | 2007-06-25 | 2008-12-31 | 文贵华 | 一种基于网络论坛的网络调查新方法 |
CN101727486A (zh) * | 2009-12-04 | 2010-06-09 | 中国人民解放军信息工程大学 | 一种Web论坛信息抽取系统 |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08330991A (ja) * | 1995-05-30 | 1996-12-13 | Matsushita Electric Ind Co Ltd | データ放送受信装置 |
US20030084035A1 (en) * | 2001-07-23 | 2003-05-01 | Emerick Charles L. | Integrated search and information discovery system |
JP2004246785A (ja) * | 2003-02-17 | 2004-09-02 | Nippon Telegr & Teleph Corp <Ntt> | 情報収集装置と情報収集方法およびプログラムと記録媒体 |
US20040225644A1 (en) * | 2003-05-09 | 2004-11-11 | International Business Machines Corporation | Method and apparatus for search engine World Wide Web crawling |
US7725452B1 (en) * | 2003-07-03 | 2010-05-25 | Google Inc. | Scheduler for search engine crawler |
US7310632B2 (en) * | 2004-02-12 | 2007-12-18 | Microsoft Corporation | Decision-theoretic web-crawling and predicting web-page change |
US20070106663A1 (en) * | 2005-02-01 | 2007-05-10 | Outland Research, Llc | Methods and apparatus for using user personality type to improve the organization of documents retrieved in response to a search query |
US7617193B2 (en) * | 2005-03-28 | 2009-11-10 | Elan Bitan | Interactive user-controlled relevance ranking retrieved information in an information search system |
CN101231640B (zh) * | 2007-01-22 | 2010-09-22 | 北大方正集团有限公司 | 一种自动计算互联网上主题演化趋势的方法及系统 |
JP2009230663A (ja) * | 2008-03-25 | 2009-10-08 | Kddi Corp | ウェブページの異常検知装置、プログラム、および記録媒体 |
US8010544B2 (en) * | 2008-06-06 | 2011-08-30 | Yahoo! Inc. | Inverted indices in information extraction to improve records extracted per annotation |
US20100205168A1 (en) * | 2009-02-10 | 2010-08-12 | Microsoft Corporation | Thread-Based Incremental Web Forum Crawling |
US8620849B2 (en) * | 2010-03-10 | 2013-12-31 | Lockheed Martin Corporation | Systems and methods for facilitating open source intelligence gathering |
CN101819585A (zh) * | 2010-03-29 | 2010-09-01 | 哈尔滨工程大学 | 一种论坛事件传播图的构建装置及构建方法 |
CN102567407B (zh) | 2010-12-22 | 2014-07-16 | 北大方正集团有限公司 | 一种论坛回帖增量采集方法及系统 |
CN102270239A (zh) * | 2011-08-15 | 2011-12-07 | 哈尔滨工业大学 | 论坛中关联网络的演化分析方法 |
-
2010
- 2010-12-22 CN CN201010618393.4A patent/CN102567407B/zh not_active Expired - Fee Related
-
2011
- 2011-12-22 JP JP2013545030A patent/JP5702474B2/ja not_active Expired - Fee Related
- 2011-12-22 WO PCT/CN2011/084457 patent/WO2012083870A1/zh active Application Filing
- 2011-12-22 US US13/997,257 patent/US9552435B2/en active Active
- 2011-12-22 EP EP11851417.3A patent/EP2657854A4/en not_active Ceased
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101193038A (zh) * | 2007-06-08 | 2008-06-04 | 腾讯科技(深圳)有限公司 | 回复主题帖、查看回复帖及交互主题帖的方法及系统 |
CN101335639A (zh) * | 2007-06-25 | 2008-12-31 | 文贵华 | 一种基于网络论坛的网络调查新方法 |
CN101727486A (zh) * | 2009-12-04 | 2010-06-09 | 中国人民解放军信息工程大学 | 一种Web论坛信息抽取系统 |
Non-Patent Citations (1)
Title |
---|
See also references of EP2657854A4 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9552435B2 (en) | 2010-12-22 | 2017-01-24 | Peking University Founder Group Co., Ltd. | Method and system for incremental collection of forum replies |
Also Published As
Publication number | Publication date |
---|---|
JP2014506355A (ja) | 2014-03-13 |
CN102567407B (zh) | 2014-07-16 |
US20150127644A1 (en) | 2015-05-07 |
US9552435B2 (en) | 2017-01-24 |
EP2657854A1 (en) | 2013-10-30 |
CN102567407A (zh) | 2012-07-11 |
JP5702474B2 (ja) | 2015-04-15 |
WO2012083870A9 (zh) | 2013-08-29 |
EP2657854A4 (en) | 2014-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2012083870A1 (zh) | 一种论坛回帖增量采集方法及系统 | |
CN102693271B (zh) | 一种网络信息推荐方法及系统 | |
CN106709052B (zh) | 一种基于关键词的主题网络爬虫设计方法 | |
CN103870461B (zh) | 主题推荐方法、装置和服务器 | |
WO2015196907A1 (zh) | 一种挖掘用户需求的搜索推送方法和装置 | |
WO2012089005A1 (zh) | 钓鱼网页检测方法及设备 | |
WO2014029173A1 (zh) | 一种用于对搜索结果进行排序的方法、装置与设备 | |
CN104615627B (zh) | 一种基于微博平台的事件舆情信息提取方法及系统 | |
CN103617213B (zh) | 识别新闻网页属性特征的方法和系统 | |
CN102752154A (zh) | Web网站死链检测方法 | |
JP2014506355A5 (zh) | ||
JP2009048380A5 (zh) | ||
CN103970800B (zh) | 网页相关关键词的抽取处理方法和系统 | |
CN102682011B (zh) | 建立域名描述名称信息表、搜索的方法、装置及系统 | |
CN103617278A (zh) | 一种地址栏搜索的控制方法及装置 | |
CN101354718B (zh) | 确定文件包资源标识信息的方法及装置 | |
CN104133908B (zh) | 在页面显示或生成讨论框的方法、服务器、客户端和系统 | |
CN102902796A (zh) | 浏览器网页标签自动分组系统及方法 | |
CN105117482A (zh) | 一种实现网站导航的方法和装置 | |
CN104008213B (zh) | 一种网页信息更新发现与统计的方法和装置 | |
JP2011133941A5 (zh) | ||
CN104317796A (zh) | 一种基于搜索的多用户交互方法、服务器,以及系统 | |
CN103678601A (zh) | 一种范文检索请求的处理方法和装置 | |
CN102929948A (zh) | 列表页识别系统及方法 | |
CN104239487B (zh) | 基于搜索的多用户交互方法、服务器、客户端和系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11851417 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2013545030 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REEP | Request for entry into the european phase |
Ref document number: 2011851417 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2011851417 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13997257 Country of ref document: US |