[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN109033195A - The acquisition methods of webpage information obtain equipment and computer-readable medium - Google Patents

The acquisition methods of webpage information obtain equipment and computer-readable medium Download PDF

Info

Publication number
CN109033195A
CN109033195A CN201810688855.6A CN201810688855A CN109033195A CN 109033195 A CN109033195 A CN 109033195A CN 201810688855 A CN201810688855 A CN 201810688855A CN 109033195 A CN109033195 A CN 109033195A
Authority
CN
China
Prior art keywords
url
crawled
website
webpage
web crawler
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810688855.6A
Other languages
Chinese (zh)
Inventor
孟祥祥
陈冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Sheng Electronic Payment Services Ltd
Original Assignee
Shanghai Sheng Electronic Payment Services Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Sheng Electronic Payment Services Ltd filed Critical Shanghai Sheng Electronic Payment Services Ltd
Priority to CN201810688855.6A priority Critical patent/CN109033195A/en
Publication of CN109033195A publication Critical patent/CN109033195A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The purpose of the application is to provide the acquisition methods, computer-readable medium and equipment of a kind of webpage information, the application passes through before acquiring webpage, web crawlers queue comprising uniform resource locator (URL) to be crawled is put into memory database, it avoids due to the problem of URL of storage in memory can disappear when network crawler system needs to restart, after can guaranteeing that network crawler system is restarted, URL to be crawled quickly can be read from the web crawlers queue of memory database, guarantee the normal execution of network crawler system;Web page content information is extracted from the webpage got by using Context resolution tool, web page contents are cleaned in realization, the web page content information is finally stored, web page content information storage is realized, to improve the acquisition efficiency and reliability of web page content information.

Description

Webpage information acquisition method, acquisition equipment and computer readable medium
Technical Field
The present application relates to the field of computers, and in particular, to a method, an apparatus, and a computer readable medium for acquiring webpage information.
Background
Currently, when crawling web page information, a web crawler system usually stores a Uniform Resource Locator (URL) to be crawled in a memory. When the web crawler system needs to be restarted, the URL to be crawled stored in the memory will disappear. When the web crawler system wants to continue to crawl the web page information after being restarted, the URL to be crawled needs to be found again and loaded into the memory, so that the web page information acquisition efficiency is low.
Disclosure of Invention
An object of the present application is to provide a method, an apparatus and a computer readable medium for acquiring webpage information.
According to an aspect of the present application, there is provided a method for acquiring webpage information, the method including: putting a web crawler queue containing a URL to be crawled into an internal memory database; taking out the URL to be crawled from the web crawler queue in the memory database; sending an acquisition request to a website corresponding to the URL, wherein the acquisition request is used for requesting a webpage corresponding to the URL to be crawled; if the webpage is obtained from the website, extracting webpage content information from the webpage by adopting a content analysis tool; and storing the webpage content information.
Further, in the above method, before the placing the web crawler queue including the URL to be crawled into the in-memory database, the method further includes: sequencing the URLs to be crawled according to a preset priority rule; and placing the sequenced URLs to be crawled into the web crawler queue.
Further, in the above method, after sending the acquisition request to the website corresponding to the URL, the method further includes: and if the webpage is not acquired from the website, the URL to be crawled is put back into the web crawler queue in the memory database.
Further, in the above method, the step of placing the URL to be crawled back into the web crawler queue in the in-memory database includes: if the priority of the URL to be crawled is larger than or equal to a preset threshold value, the URL to be crawled is placed back to the head of the web crawler queue; or if the priority of the URL to be crawled is smaller than a preset threshold value, the URL to be crawled is placed back to the tail position of the web crawler queue.
Further, in the above method, after the URL to be crawled is fetched from the web crawler queue in the in-memory database, the method further includes: starting a thread pool, and putting the URL to be crawled into the thread pool; the sending of the acquisition request to the website corresponding to the URL includes: and sending the acquisition request to the website through the thread pool.
Further, in the above method, the sending an acquisition request to the website corresponding to the URL includes: extracting an IP address from a preset proxy Internet Protocol (IP) queue; and sending the acquisition request to the website through the extracted IP address.
Further, in the above method, before or after sending the acquisition request to the website corresponding to the URL, the method further includes: acquiring a verification code graph from the website; identifying the verification code from the verification code graph in a text identification mode, and sending the verification code to the website.
Further, in the above method, the obtaining request includes a cookie when the website is logged in, and before the obtaining request is sent to the website corresponding to the URL, the method further includes: and acquiring the cookie from a browser used for logging in the website.
Further, in the foregoing method, the storing the web page content information includes: and packaging the webpage content information into a JSON format and then storing.
According to another aspect of the present application, there is also provided a device for acquiring web page information, the device comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform any of the methods described above.
According to another aspect of the present application, there is also provided a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement the method of any of the above.
Compared with the prior art, the method and the device have the advantages that the web crawler queue containing the Uniform Resource Locators (URLs) to be crawled is placed in the memory database before the web pages are collected, so that the problem that the URLs stored in the memory disappear when the web crawler system needs to be restarted is solved, the URLs to be crawled can be rapidly read from the web crawler queue of the memory database after the web crawler system is restarted, and normal execution of the web crawler system is guaranteed; the webpage content information is extracted from the acquired webpage by adopting a content analysis tool, the webpage content is cleaned, and finally the webpage content information is stored, so that the webpage content information is put in storage, and the acquisition efficiency and the reliability of the webpage content information are improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
fig. 1 is a flowchart illustrating a method for acquiring web page information according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for obtaining a web page via a proxy IP address according to an embodiment of the present application;
FIG. 3 illustrates a flow diagram for retrieving a web page via cookie in accordance with an embodiment of the present application;
fig. 4 is a flowchart illustrating a method for acquiring web page information according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an apparatus for acquiring webpage information according to an embodiment of the present application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
As shown in fig. 1, the present application provides a method for acquiring web page information, which may be applied to a network device, for example, executed by a web crawler through a network device. Among them, web crawlers are also called web spiders or web robots, or, among FOAF (Friend-of-a-Friend, which is an XML/RDF vocabulary) communities, web crawlers are more often called web chasers. A web crawler may refer to a program or script that automatically crawls the world wide web according to certain rules. The web crawler can grab the target according to the task URL in the web crawler queue, access the corresponding web page and the related link, and acquire the required information. As shown in fig. 1, the method includes:
step S101, a web crawler queue containing the URL to be crawled is placed into an in-memory database.
Here, the URL is a compact representation of the location and access method of a resource available from the internet, and is an address of a standard resource on the internet. Each file on the internet has a unique URL that contains information indicating the location of the file in the web page and how the browser should handle it. Generally, the basic URL includes a pattern (or protocol), a server name or an Internet Protocol (IP) address (corresponding to a website), a path, and a file name.
The web crawler queue may also be referred to as a web crawler task queue.
The memory database refers to a database in which data is directly stored in a memory for operation. Compared with a magnetic disk, the data read-write speed of the memory is higher by several orders of magnitude, and the application performance can be greatly improved by storing data in the memory compared with accessing from the magnetic disk.
Step S102, the URL to be crawled is taken out from the web crawler queue in the memory database.
Here, one or more URLs to be crawled may be fetched from the web crawler queue in the in-memory database at a time according to the actual data processing capacity of the device.
Step S103, sending an acquisition request to a website corresponding to the URL, wherein the acquisition request is used for requesting a webpage corresponding to the URL to be crawled.
Here, one URL contains address information of a corresponding website and web page. The web crawler may send a request for obtaining a corresponding web page in the website to the corresponding website according to address information in one or more URLs to be crawled.
And step S104, if the webpage is acquired from the website, extracting webpage content information from the acquired webpage by adopting a content analysis tool.
In this case, when the web page requested by the acquisition request is successfully acquired, the content analysis tool may be used to extract all or part of the web page content information from the acquired web page according to a preset rule.
Step S105, storing the webpage content information.
After all or part of the web page content information is extracted from the acquired web page, the web page content information can be stored, so that the web page content information can be counted or analyzed subsequently and further.
In the embodiment, before the web page is collected, the web crawler queue containing the Uniform Resource Locator (URL) to be crawled is put into the memory database, so that the problem that the URL stored in the memory disappears when the web crawler system needs to be restarted is solved, the URL to be crawled can be quickly read from the web crawler queue of the memory database after the web crawler system is restarted, and the normal execution of the web crawler system is ensured; the webpage content information is extracted from the acquired webpage by adopting a content analysis tool, the webpage content is cleaned, and finally the webpage content information is stored, so that the webpage content information is put in storage, and the acquisition efficiency and the reliability of the webpage content information are improved.
The application can be implemented based on a webmagic framework or other forms of framework.
In an embodiment of the method for acquiring web page information, before the step S101 of placing the web crawler queue including the URL to be crawled into the in-memory database, the method further includes: sequencing URLs to be crawled according to a preset priority rule; and placing the sequenced URLs to be crawled into the web crawler queue.
Here, the preset priority rule may be a website category of the URL or an importance degree of the URL, and the URLs to be crawled may be prioritized according to the preset priority rule, for example, the URL with a higher priority may be ranked in the front, and correspondingly, the URL with a lower priority may be ranked in the back, and then, the ranked URLs are placed in the web crawler queue. Therefore, the URLs with higher priority in the front can be taken out preferentially to crawl the webpages.
In an embodiment of the method for acquiring web page information of the present application, after the step S103 sends the acquisition request to the website corresponding to the URL, the method further includes: and if the webpage requested by the acquisition is not acquired from the corresponding website, the URL to be crawled is put back into the web crawler queue in the memory database.
Here, because of the instability of the web crawler system, there may be some URLs which have failed in crawling, if the requested web page is not obtained from the corresponding website, the URL which has failed in crawling may be selected to be stored back in the web crawler queue in the memory database again, after the web crawler system is stabilized, the URL is taken out from the web crawler queue in the memory database again, and crawling of the web page is performed according to the taken URL again.
In an embodiment of the method for acquiring web page information, the step of placing the URL to be crawled back into the web crawler queue in the memory database includes: if the priority of the URL to be crawled to be put back is larger than or equal to a preset threshold value, the URL to be crawled is put back to the head position in the web crawler queue; or if the priority of the URL to be crawled to be put back is smaller than a preset threshold value, the URL to be crawled is put back to the tail position in the web crawler queue.
Here, in connection with the previous embodiment, when the web page requested for obtaining is not acquired from the corresponding website and the URL to be crawled needs to be placed back in the web crawler queue in the memory database, it may be determined first whether the priority of the URL to be crawled that is to be placed back is greater than or equal to a preset threshold, if so, the URL to be crawled is placed back in a head-of-line position in the web crawler queue, and after the URL is placed back in the web crawler queue, the URL that is taken out from the head-of-line position is still the URL, so that the URL with a higher priority can still be crawled in time after one-time crawling failure; or, if the priority of the to-be-crawled URL to be put back is less than the preset threshold, the to-be-crawled URL may be put back to the tail position of the web crawler queue, and the URL is put back to the tail position of the web crawler queue. In an embodiment of the method for acquiring web page information of the present application, in step S102, after the URL to be crawled is fetched from the web crawler queue in the memory database, the method further includes: starting a thread pool, and putting the URL to be crawled into the thread pool; step S103, sending an acquisition request to the website corresponding to the URL, including: and sending the acquisition request to the website through the thread pool.
Here, a thread pool is a form of multi-threaded processing in which tasks are added to a queue and then automatically started after a thread is created. In this embodiment, the web crawler sends the acquisition request to the corresponding website through the thread pool, so that parallel acquisition of multiple webpages corresponding to URLs to be crawled can be realized, and the crawling efficiency is improved.
In an embodiment of the method for acquiring web page information, in step S103, sending the acquisition request to the website corresponding to the URL includes: extracting an IP address from a preset proxy Internet Protocol (IP) queue; and sending the acquisition request to the website through the extracted IP address.
In this case, the IP address may be extracted from the preset proxy IP queue by random extraction or sequential cyclic extraction, so as to ensure that the extracted IP addresses are different each time.
If a website with limited number of acquisition requests sent by the same IP address is encountered, namely the website with limited IP is encountered, different IP addresses can be randomly extracted from a preset proxy Internet protocol IP queue, then, a web crawler sends the acquisition requests to the corresponding websites by passing through the randomly extracted different IP addresses every time, the problem that the acquisition requests are limited by the website to the IP addresses is solved, and the success rate of webpage acquisition is improved.
Specifically, for example, as shown in fig. 2, step S201 determines whether a website needs to be proxied with an IP address, if yes, step S202 randomly extracts an IP address from a preset configuration file queue, and then a web crawler sends the acquisition request to the corresponding website through the randomly extracted IP address, step S203; otherwise, the web crawler directly sends the acquisition request to the corresponding website through the current IP address, step S203.
In an embodiment of the method for acquiring web page information of the present application, before or after the step S103, sending the acquisition request to the website corresponding to the URL, the method further includes: acquiring a verification code graph from the corresponding website; identifying the verification code from the verification code graph in a text identification mode, and sending the verification code to the corresponding website.
Before or after the web crawler sends the acquisition request to the corresponding website, for the condition that some websites acquire the webpage only when the graphic verification code is input correctly, the verification code can be identified from the verification code graph in a text identification mode, and the verification code is sent to the corresponding website, so that the verification code graph is automatically cracked, and the success rate and the efficiency of webpage acquisition are improved.
In an embodiment of the method for acquiring the webpage information, the text recognition mode is a Tesseract recognition mode.
Here, the text recognition method may be various OCR recognition methods, for example, Tesseract recognition method.
Tesseract, an open source OCR (Optical character recognition) engine maintained by Google developed by HP laboratories, which can be a continuously trained library, continuously enhancing the ability of images to convert text; if the team needs deeply, the OCR engine meeting the requirements of the team can be developed by taking the team as a template.
According to the embodiment, the verification code graph is identified in a Tesseract mode, so that the identification accuracy of the verification code graph can be improved, and the success rate and efficiency of webpage acquisition are improved.
In an embodiment of the method for acquiring web page information, the acquiring request includes the cookie when the website logs in, and before the acquiring request is sent to the website corresponding to the URL, the method further includes: and acquiring the cookie from a browser used for logging in the website.
Here, cookie refers to data (usually encrypted) stored on the user's local terminal by some websites for identifying the user's identity and performing session tracking.
For the website which needs to input the user identity information and can acquire the webpage after manual login, the corresponding cookie of the corresponding website during manual login can be acquired from the browser after the user successfully logs in the website through the browser for the first time. And when the website is repeatedly logged in or the webpage is visited each time, the acquisition request containing the cookie can be sent to the website each time, so that a complicated manual login process is avoided, and the success rate and the efficiency of webpage acquisition are improved.
Specifically, for example, as shown in fig. 3, step S301 may first determine whether a website needs to log in, if so, step S202 may obtain, from the browser, a cookie corresponding to the website when the website first manually logs in, and then the web crawler sends an obtaining request including the cookie to the website to perform step S203, and crawl the webpage, otherwise, the web crawler may directly send an obtaining request not including the cookie to the website to perform step S203, and crawl the webpage.
In an embodiment of the method for acquiring web page information of the present application, in step S105, storing the web page content information includes: and packaging the webpage content information into a JSON format and then storing.
Here, JSON (JSON Object Notation) is a lightweight data exchange format. It stores and represents data in a text format that is completely independent of the programming language, based on a subset of ECMAScript (js specification set by the european computer association). The compact and clear hierarchy makes JSON an ideal data exchange language. The network transmission method is easy to read and write by people, is easy to analyze and generate by machines, and effectively improves the network transmission efficiency.
In this embodiment, the webpage content information is packaged into a JSON format and then stored, so that the speed and success rate of querying subsequent webpage content information can be increased.
In an embodiment of the method for acquiring web page information of the present application, in step S105, storing the web page content information includes: and storing the webpage content information into an ElasticSearch cluster.
Here, the ElasticSearch is a Lucene-based search server. It provides a distributed multi-user capable full-text search engine based on RESTful web interface. The Elasticisearch is developed by Java and issued as an open source code under Apache licensing terms, is designed for cloud computing, can achieve real-time search, is stable, reliable, fast and convenient to install and use, and can simply use JSON to index data through HTTP.
According to the embodiment, the webpage content information is stored in the ElasticSearch cluster, so that the speed and the success rate of inquiring the subsequent webpage content information can be improved.
In an embodiment of the method for acquiring webpage information, the memory database is a redis memory database.
Here, redis is a key-value storage system that supports relatively more stored value types, including string, list, set, zset, and hash. These data types all support push/pop, add/remove, and intersect union and difference, and richer operations, and these operations are all atomic. On the basis, the redis supports various sorting modes, and data are cached in a memory in order to ensure efficiency.
In this embodiment, a web crawler queue including a Uniform Resource Locator (URL) to be crawled may be placed in a redis memory database; in addition, if the web crawler does not acquire the requested web page from the corresponding website, the URL to be crawled is put back into the web crawler queue in the redis memory database, so that the web crawler can conveniently crawl again next time.
In this embodiment, a web crawler queue including a Uniform Resource Locator (URL) to be crawled is stored in the redis memory database, so that the efficiency of the web crawler system for acquiring the URL to be crawled can be further improved.
In an embodiment of the method for acquiring webpage information, the content analysis tool is a jsup analysis tool.
Here, the jsup is a Java HTML parser, and can directly parse a certain URL address and HTML text content. It provides a very labor-saving set of APIs that can fetch and manipulate data through DOM, CSS and jQuery-like manipulation methods.
According to the embodiment, the webpage content information in the webpage is analyzed through the jsup analysis tool, so that the extraction accuracy and efficiency of the webpage content information can be further improved.
As shown in fig. 4, a specific embodiment of the method for acquiring webpage information of the present application includes the following steps:
step S401: storing a URL to be crawled and a URL newly added by a web crawler in a crawling process in a redis memory database, and ensuring that a crawler system can be normally executed after being restarted;
step S402: when a web crawler collects a webpage, a spider (crawler) thread pool is newly established, a URL to be crawled is read from a redis, and the read URL is started to crawl the webpage in a corresponding website;
step S403: whether the corresponding web page is crawled successfully is judged,
if the crawling fails, step S401: storing the failed URL in the redis for retrying again;
step S404: if the crawling is successful, extracting webpage content information from the acquired webpage by adopting a content analysis tool, and packaging the webpage content information into a JSON format;
step S405: and storing the webpage content information in the post-packaged JSON format into an ElasticSearch cluster.
According to another aspect of the present application, there is also provided a device for acquiring webpage information, which may perform the methods shown in fig. 1 to 4. The apparatus may be implemented by software, hardware or a combination of software and hardware, for example, the apparatus may include modules or units for performing the steps of the methods shown in fig. 1 to 4.
For example, as shown in fig. 5, the apparatus includes:
a URL storage 501, configured to put a web crawler queue including a Uniform Resource Locator (URL) to be crawled into an in-memory database;
here, a Uniform Resource Locator (URL) is a compact representation of the location and access method of a Resource available from the internet, and is an address of a standard Resource on the internet. Each file on the internet has a unique URL, which contains information indicating the location of the file in the web page and how the browser should handle it, and the basic URL contains a pattern (or protocol), a server name or IP address (corresponding to the web site), a path, and a file name;
in a web crawler queue (web crawler task queue), a web crawler (also called a web spider, a web robot, in the middle of an FOAF community, more often called a web chaser) is a program or script for automatically capturing web information according to a certain rule, and the web crawler can capture a target according to a task URL in the task queue, access a corresponding web page and related links, and acquire required information;
the memory database is a database which directly operates by putting data in a memory. Compared with a magnetic disk, the data read-write speed of the memory is higher by several orders of magnitude, and the application performance can be greatly improved by storing the data in the memory compared with accessing from the magnetic disk;
a URL extracting means 502, configured to extract the URL to be crawled from the web crawler queue in the in-memory database;
here, one or more URLs to be crawled may be fetched from the web crawler queue in the in-memory database at a time according to the device data processing capability;
the web crawler device 503 is configured to send an acquisition request to a corresponding website, where the acquisition request is used to request a web page corresponding to the URL to be crawled;
each URL comprises address information of a corresponding website and a corresponding webpage, and the web crawler sends a request for correspondingly acquiring the webpage in the website to the corresponding website according to the address information in one or more URLs to be crawled;
a content analysis device 504, configured to, if the web page requested by the acquisition request is acquired from the corresponding website, extract web page content information from the acquired web page by using a content analysis tool;
in this case, when the webpage requested by the acquisition request is successfully acquired, a content analysis tool can be used to extract all or part of webpage content information from the acquired webpage according to a preset rule;
a storage device 505 for storing the web page content information.
After all or part of the web page content information is extracted from the acquired web page, the web page content information can be stored, so that the web page content information can be counted or analyzed subsequently and further.
In the embodiment, before the web page is collected, the web crawler queue containing the Uniform Resource Locator (URL) to be crawled is put into the memory database, so that the problem that the URL stored in the memory disappears when the web crawler system needs to be restarted is solved, the URL to be crawled can be quickly read from the web crawler queue of the memory database after the web crawler system is restarted, and the normal execution of the web crawler system is ensured; the webpage content information is extracted from the acquired webpage by adopting a content analysis tool, the webpage content is cleaned, and finally the webpage content information is stored, so that the webpage content information is put in storage, and the acquisition efficiency and the reliability of the webpage content information are improved.
The application can be implemented based on a webmagic framework or other forms of framework.
According to another aspect of the present application, there is also provided a device for acquiring web page information, the device comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform any of the methods described above.
According to another aspect of the present application, there is also provided a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement the method of any one of the above.
For details of each computer-readable medium and apparatus embodiment of the present application, reference may be made to corresponding contents of each method embodiment, and details are not described herein again.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (11)

1. A method for acquiring webpage information is characterized by comprising the following steps:
putting a web crawler queue containing a URL to be crawled into an internal memory database;
taking out the URL to be crawled from the web crawler queue in the memory database;
sending an acquisition request to a website corresponding to the URL, wherein the acquisition request is used for requesting a webpage corresponding to the URL to be crawled;
if the webpage is obtained from the website, extracting webpage content information from the webpage by adopting a content analysis tool;
and storing the webpage content information.
2. The method of claim 1, wherein prior to placing the web crawler queue containing URLs to be crawled into the in-memory database, further comprising:
sequencing the URLs to be crawled according to a preset priority rule;
and placing the sequenced URLs to be crawled into the web crawler queue.
3. The method according to claim 1 or 2, wherein after sending the acquisition request to the website corresponding to the URL, the method further comprises:
and if the webpage is not acquired from the website, the URL to be crawled is put back into the web crawler queue in the memory database.
4. The method of claim 3, wherein the placing the URL to be crawled back into the web crawler queue in the in-memory database comprises:
if the priority of the URL to be crawled is larger than or equal to a preset threshold value, the URL to be crawled is placed back to the head of the web crawler queue; or,
and if the priority of the URL to be crawled is smaller than a preset threshold value, the URL to be crawled is placed back to the tail position of the web crawler queue.
5. The method according to any one of claims 1 to 4, further comprising, after fetching the URL to be crawled from the web crawler queue in the in-memory database: :
starting a thread pool, and putting the URL to be crawled into the thread pool;
the sending of the acquisition request to the website corresponding to the URL includes:
and sending the acquisition request to the website through the thread pool.
6. The method according to any one of claims 1 to 4, wherein the sending the acquisition request to the website corresponding to the URL includes:
extracting an IP address from a preset proxy Internet Protocol (IP) queue;
and sending the acquisition request to the website through the extracted IP address.
7. The method according to any one of claims 1 to 6, wherein before or after sending the acquisition request to the website corresponding to the URL, the method further comprises:
acquiring a verification code graph from the website;
identifying the verification code from the verification code graph in a text identification mode, and sending the verification code to the website.
8. The method according to any one of claims 1 to 7, wherein the obtaining request includes a cookie when the website is logged in, and before sending the obtaining request to the website corresponding to the URL, the method further includes:
and acquiring the cookie from a browser used for logging in the website.
9. The method according to any one of claims 1 to 8, wherein the storing the web page content information comprises:
and packaging the webpage content information into a JSON format and then storing.
10. An apparatus for obtaining web page information, the apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform the method of any of claims 1 to 9.
11. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 9.
CN201810688855.6A 2018-06-28 2018-06-28 The acquisition methods of webpage information obtain equipment and computer-readable medium Pending CN109033195A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810688855.6A CN109033195A (en) 2018-06-28 2018-06-28 The acquisition methods of webpage information obtain equipment and computer-readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810688855.6A CN109033195A (en) 2018-06-28 2018-06-28 The acquisition methods of webpage information obtain equipment and computer-readable medium

Publications (1)

Publication Number Publication Date
CN109033195A true CN109033195A (en) 2018-12-18

Family

ID=65520811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810688855.6A Pending CN109033195A (en) 2018-06-28 2018-06-28 The acquisition methods of webpage information obtain equipment and computer-readable medium

Country Status (1)

Country Link
CN (1) CN109033195A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109831491A (en) * 2019-01-15 2019-05-31 科大国创软件股份有限公司 Intrusive social data acquisition method based on agency
CN109885744A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Web data crawling method, device, system, computer equipment and storage medium
CN109992707A (en) * 2019-03-18 2019-07-09 广州视源电子科技股份有限公司 Data crawling method and device, storage medium and server
CN110008393A (en) * 2018-12-29 2019-07-12 义语智能科技(上海)有限公司 It is a kind of for obtaining the method and apparatus of site information
CN110062025A (en) * 2019-03-14 2019-07-26 深圳绿米联创科技有限公司 Method, apparatus, server and the storage medium of data acquisition
CN110069686A (en) * 2019-03-15 2019-07-30 平安科技(深圳)有限公司 User behavior analysis method, apparatus, computer installation and storage medium
CN110134858A (en) * 2019-03-26 2019-08-16 国网重庆市电力公司 Method, system, storage medium and electronic device for converting unstructured data
CN110262888A (en) * 2019-06-26 2019-09-20 京东数字科技控股有限公司 The method and apparatus that method for scheduling task and device and calculate node execute task
CN111324797A (en) * 2020-02-20 2020-06-23 民生科技有限责任公司 Method and device for acquiring data accurately at high speed
CN111538550A (en) * 2020-04-17 2020-08-14 姜海强 Webpage information screening method based on image detection algorithm
CN111538883A (en) * 2020-03-25 2020-08-14 北京市科学技术情报研究所 Data crawling method, system and equipment
CN111753162A (en) * 2020-06-29 2020-10-09 平安国际智慧城市科技股份有限公司 Data crawling method, device, server and storage medium
CN112199567A (en) * 2020-09-27 2021-01-08 深圳市伊欧乐科技有限公司 Distributed data acquisition method, system, server and storage medium
CN112508362A (en) * 2020-11-24 2021-03-16 江苏省质量和标准化研究院 Product export information processing method and device, electronic equipment and storage medium
CN112579853A (en) * 2019-09-30 2021-03-30 顺丰科技有限公司 Method and device for sequencing crawling links and storage medium
CN112905867A (en) * 2019-03-14 2021-06-04 福建省天奕网络科技有限公司 Efficient historical data tracing and crawling method and terminal
CN112948654A (en) * 2019-11-26 2021-06-11 上海哔哩哔哩科技有限公司 Webpage crawling method and device and computer equipment
CN113761315A (en) * 2021-09-10 2021-12-07 未鲲(上海)科技服务有限公司 Webpage content crawling method and device, computer equipment and storage medium
CN114064998A (en) * 2021-11-17 2022-02-18 四川长虹电器股份有限公司 Data crawling method based on queue
CN115087969A (en) * 2020-05-14 2022-09-20 深圳市欢太科技有限公司 Information crawling method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
US7676553B1 (en) * 2003-12-31 2010-03-09 Microsoft Corporation Incremental web crawler using chunks
CN104408194A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Acquisition method and device of web crawler request
CN105207852A (en) * 2015-10-09 2015-12-30 西安未来国际信息股份有限公司 Method for directionally acquiring network data based on distributed mode

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7676553B1 (en) * 2003-12-31 2010-03-09 Microsoft Corporation Incremental web crawler using chunks
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
CN104408194A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Acquisition method and device of web crawler request
CN105207852A (en) * 2015-10-09 2015-12-30 西安未来国际信息股份有限公司 Method for directionally acquiring network data based on distributed mode

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
夏征农: "《大辞海信息科学卷》", 31 December 2015, 上海辞书出版社 *
李小平: "《网络影视课程编导论》", 30 April 2016, 北京理工大学出版社 *
郑铁男: "《数字编辑实训教程》", 30 September 2017, 知识产权出版社 *
韦鹏程: "《大数据巨量分析与机器学习的整合与开发》", 31 May 2017, 电子科技大学出版社 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008393B (en) * 2018-12-29 2023-03-07 义语智能科技(上海)有限公司 Method and equipment for acquiring website information
CN110008393A (en) * 2018-12-29 2019-07-12 义语智能科技(上海)有限公司 It is a kind of for obtaining the method and apparatus of site information
CN109885744B (en) * 2019-01-07 2024-05-10 平安科技(深圳)有限公司 Webpage data crawling method, device, system, computer equipment and storage medium
CN109885744A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Web data crawling method, device, system, computer equipment and storage medium
CN109831491A (en) * 2019-01-15 2019-05-31 科大国创软件股份有限公司 Intrusive social data acquisition method based on agency
CN109831491B (en) * 2019-01-15 2022-03-15 科大国创软件股份有限公司 Invasive social data acquisition method based on agent
CN110062025A (en) * 2019-03-14 2019-07-26 深圳绿米联创科技有限公司 Method, apparatus, server and the storage medium of data acquisition
CN112905867A (en) * 2019-03-14 2021-06-04 福建省天奕网络科技有限公司 Efficient historical data tracing and crawling method and terminal
CN112905866B (en) * 2019-03-14 2022-06-07 福建省天奕网络科技有限公司 Historical data tracing and crawling method and terminal without manual participation
CN112905867B (en) * 2019-03-14 2022-06-07 福建省天奕网络科技有限公司 Efficient historical data tracing and crawling method and terminal
CN112905866A (en) * 2019-03-14 2021-06-04 福建省天奕网络科技有限公司 Historical data tracing and crawling method and terminal without manual participation
CN110069686A (en) * 2019-03-15 2019-07-30 平安科技(深圳)有限公司 User behavior analysis method, apparatus, computer installation and storage medium
CN109992707A (en) * 2019-03-18 2019-07-09 广州视源电子科技股份有限公司 Data crawling method and device, storage medium and server
CN110134858A (en) * 2019-03-26 2019-08-16 国网重庆市电力公司 Method, system, storage medium and electronic device for converting unstructured data
CN110262888A (en) * 2019-06-26 2019-09-20 京东数字科技控股有限公司 The method and apparatus that method for scheduling task and device and calculate node execute task
CN110262888B (en) * 2019-06-26 2020-11-20 京东数字科技控股有限公司 Task scheduling method and device and method and device for computing node to execute task
CN112579853A (en) * 2019-09-30 2021-03-30 顺丰科技有限公司 Method and device for sequencing crawling links and storage medium
CN112948654A (en) * 2019-11-26 2021-06-11 上海哔哩哔哩科技有限公司 Webpage crawling method and device and computer equipment
CN111324797B (en) * 2020-02-20 2023-08-11 民生科技有限责任公司 Method and device for precisely acquiring data at high speed
CN111324797A (en) * 2020-02-20 2020-06-23 民生科技有限责任公司 Method and device for acquiring data accurately at high speed
CN111538883A (en) * 2020-03-25 2020-08-14 北京市科学技术情报研究所 Data crawling method, system and equipment
CN111538883B (en) * 2020-03-25 2023-11-17 北京市科学技术情报研究所 Data crawling method, system and equipment
CN111538550A (en) * 2020-04-17 2020-08-14 姜海强 Webpage information screening method based on image detection algorithm
CN115087969A (en) * 2020-05-14 2022-09-20 深圳市欢太科技有限公司 Information crawling method and device, electronic equipment and storage medium
CN111753162A (en) * 2020-06-29 2020-10-09 平安国际智慧城市科技股份有限公司 Data crawling method, device, server and storage medium
CN112199567A (en) * 2020-09-27 2021-01-08 深圳市伊欧乐科技有限公司 Distributed data acquisition method, system, server and storage medium
CN112508362A (en) * 2020-11-24 2021-03-16 江苏省质量和标准化研究院 Product export information processing method and device, electronic equipment and storage medium
CN112508362B (en) * 2020-11-24 2024-04-23 江苏省质量和标准化研究院 Product outlet information processing method and device, electronic equipment and storage medium
CN113761315A (en) * 2021-09-10 2021-12-07 未鲲(上海)科技服务有限公司 Webpage content crawling method and device, computer equipment and storage medium
CN114064998A (en) * 2021-11-17 2022-02-18 四川长虹电器股份有限公司 Data crawling method based on queue

Similar Documents

Publication Publication Date Title
CN109033195A (en) The acquisition methods of webpage information obtain equipment and computer-readable medium
JP5990605B2 (en) Method and system for acquiring AJAX web page content
CN107895009B (en) Distributed internet data acquisition method and system
CN108304498B (en) Webpage data acquisition method and device, computer equipment and storage medium
US8799262B2 (en) Configurable web crawler
US9235640B2 (en) Logging browser data
US9614862B2 (en) System and method for webpage analysis
US7672938B2 (en) Creating search enabled web pages
US8245198B2 (en) Mapping breakpoints between web based documents
CN102982162B (en) The acquisition system of info web
CN102880607A (en) network dynamic content capturing method and network dynamic content crawler system
CN109033115A (en) A kind of dynamic web page crawler system
CN111177519B (en) Webpage content acquisition method, device, storage medium and equipment
CN110851681B (en) Crawler processing method, crawler processing device, server and computer readable storage medium
CN112532490A (en) Regression testing system and method and electronic equipment
CN110020062A (en) A kind of customized web crawlers method and system
CN106844486A (en) Crawl the method and device of dynamic web page
CN102262635A (en) Page crawler system and page crawler method
WO2016086784A1 (en) Method, apparatus and system for collecting webpage data
CN109246069B (en) Webpage login method and device and readable storage medium
US20140067854A1 (en) Crawling of generated server-side content
CN110719344B (en) Domain name acquisition method and device, electronic equipment and storage medium
CN101625692B (en) Method for rapidly collecting dynamic script website data
CN110764994A (en) Page element packaging method and device, electronic equipment and storage medium
CN112579947A (en) Webpage element graph intercepting method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181218