[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110209909A - Data crawling method, device, computer equipment and storage medium - Google Patents

Data crawling method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110209909A
CN110209909A CN201910319429.XA CN201910319429A CN110209909A CN 110209909 A CN110209909 A CN 110209909A CN 201910319429 A CN201910319429 A CN 201910319429A CN 110209909 A CN110209909 A CN 110209909A
Authority
CN
China
Prior art keywords
data
code block
crawler
code
crawl
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910319429.XA
Other languages
Chinese (zh)
Inventor
张师琲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910319429.XA priority Critical patent/CN110209909A/en
Publication of CN110209909A publication Critical patent/CN110209909A/en
Priority to PCT/CN2019/118419 priority patent/WO2020211367A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

This application involves a kind of data crawling method, device, computer equipment and storage mediums, this method comprises: crawling demand according to data, required code block is selected from the database constructed in advance;And sequence is executed according to each code block selected, each code block selected is ranked up, corresponding code block sequence is obtained;According to the code block sequence, required crawler is configured;Data are carried out using the required crawler that configuration is completed to crawl, and obtain crawling data;Wherein, it include multiple code blocks in the database, the preparatory building process of the database includes: to carry out data respectively to preset multiple websites to crawl, and data are crawled each of process and crawl computer code corresponding to step as a code block.The application can satisfy the different demands of user.

Description

Data crawling method, device, computer equipment and storage medium
Technical field
The present invention relates to crawler technology field more particularly to a kind of data crawling method, device, computer equipment and storages Medium.
Background technique
Currently, open source crawler it is many kinds of, but various crawlers respectively have it is excellent lack, be not able to satisfy the various need that data crawl It asks.For example, with the rapid development of network, WWW becomes the carrier of bulk information, how to efficiently extract and use These information become a huge challenge.And there are many data modes on WWW, such as picture, database, audio, view Frequency multimedia etc., there are also various forms of webpages, it is various forms of it is counter climb technology so that current, open source community is various climbs Worm has been not enough to support to crawl requirement for different form data.
Summary of the invention
The embodiment of the present application provides a kind of data crawling method, device, computer equipment and storage medium, can satisfy number According to the different demands crawled.
The embodiment of the present application provides a kind of data crawling method, comprising:
Demand is crawled according to data, required code block is selected from the database constructed in advance;And according to selecting Each code block executes sequence, is ranked up to each code block selected, and obtains corresponding code block sequence;
According to the code block sequence, required crawler is configured;
Data are carried out using the required crawler that configuration is completed to crawl, and obtain crawling data;
It wherein, include multiple code blocks in the database, the preparatory building process of the database includes:
Data are carried out respectively to preset multiple websites to crawl, and will be crawled each of process in data and be crawled step Corresponding computer code is as a code block.
In some embodiments, the method also includes: using local sensitivity hash algorithm to it is described crawl data carry out Go heavy filtration.
In some embodiments, described according to the code block sequence, required crawler is configured, comprising: according to institute State code block sequence and it is preset illustrate document, determine it is described needed for crawler configuration file;Wherein, described to illustrate to deposit in document It contains and illustrates information for generate the configuration file.
In some embodiments, described data are carried out to preset multiple websites respectively to crawl, comprising: to described preset The corresponding computer code is write in multiple websites respectively, and using the corresponding computer code in each website to the net Progress data of standing crawl.
It is in some embodiments, described that the corresponding computer code is write respectively to preset multiple websites, It include: that the corresponding computer code is write to preset multiple websites using fine granularity isolation respectively.
In some embodiments, described to be crawled using the required crawler progress data that configuration is completed, comprising: to use institute State required crawler and log in corresponding website, specifically include: the server transmission by the required crawler to corresponding website, which logs in, asks It asks, agent address is carried in the log on request, and repair to the agent address periodically through the required crawler Change or is modified by the required crawler to the agent address when encountering limited access or access errors.
In some embodiments, described according to the code block sequence, required crawler is configured, comprising:
A1, seed is configured;
A2, the address of the seed is configured;
A3, to whether be full dose crawl configure;
It a4, is that non-javascript web page contents configure to javascript web page contents are crawled also;
A5, it configures to crawling required keyword;
A6, the region of the seed is configured;
A7, the series for starting to grab webpage are configured;
A8, page turning mode is configured;
A9, the attribute for the field that needs grab is configured.
The embodiment of the present application also provides a kind of data and crawls device, comprising:
Sequence determining module selects required code for crawling demand according to data from the database constructed in advance Block;And sequence is executed according to each code block selected, each code block selected is ranked up, is obtained corresponding Code block sequence;
Crawler configuration module, for being configured to required crawler according to the code block sequence;
Data crawl module, and the required crawler for being completed using configuration is carried out data and crawled, and obtain crawling data;
Database sharing module includes multiple code blocks in the database for constructing the database in advance, described Database sharing module is specifically used for: carrying out data respectively to preset multiple websites and crawls, and will be during data crawl Each crawl computer code corresponding to step as a code block.
In some embodiments, the crawler configuration module is specifically used for: according to the code block sequence and preset theory Plaintext shelves determine the configuration file of the required crawler;Wherein, described to illustrate to be stored in document for generating the configuration text Part illustrates information.
The embodiment of the present application also provides a kind of computer equipment, including memory and processor, stores in the memory There is computer-readable instruction, when the computer-readable instruction is executed by the processor, so that processor execution is above-mentioned The step of data crawling method.
The embodiment of the present application also provides a kind of storage medium for being stored with computer-readable instruction, the computer-readable finger When order is executed by one or more processors, so that the step of one or more processors execute above-mentioned data crawling method.
Data crawling method, device, computer equipment and storage medium provided by the embodiments of the present application, crawl according to data Demand selects required code block from database, then arranges each code block selected according to step execution sequence Sequence obtains code block sequence, and then according to crawler needed for the configuration of code block sequence, is finally counted using configured crawler According to crawling.Since the embodiment of the present application can crawl demand according to data required code block is selected, then to selecting Code block is ranked up, that is to say, that is equivalent to and is crawled demand selection multiple crawl according to data and step and then crawl to each Step is combined sequence, and the crawler being configured so that can satisfy the various demands of user, for example, being to download entire webpage also Be precisely grab, be crawl javascript webpage be also non-javascript webpage etc., moreover, provided by the embodiments of the present application Data crawling method is simple, easily configures, and may be implemented to crawl different web sites, various forms of data.
Detailed description of the invention
Fig. 1 is the internal structure block diagram of computer equipment in one embodiment;
Fig. 2 is the flow chart of data crawling method in one embodiment;
Fig. 3 is the structural block diagram that data crawl device in one embodiment.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
It is appreciated that term " first " used in this application, " second " etc. can be used to describe various elements herein, But these elements should not be limited by these terms.These terms are only used to distinguish the first element from the other element.
Fig. 1 is the structural schematic diagram of computer equipment in the application one embodiment.As shown in Figure 1, the computer equipment Including processor, non-volatile memory medium, memory and the network interface connected by system bus.Wherein, the computer The non-volatile memory medium of equipment is stored with operating system, database and computer-readable instruction, can be stored in database Control information sequence when the computer-readable instruction is executed by processor, may make processor to realize a kind of data crawling method. The processor of the computer equipment supports the operation of entire computer equipment for providing calculating and control ability.The computer It can be stored with computer-readable instruction in the memory of equipment, when which is executed by processor, may make place Reason device executes a kind of data crawling method.The network interface of the computer equipment is used for and terminal connection communication.Art technology Personnel are appreciated that structure shown in Fig. 1, and only the block diagram of part-structure relevant to application scheme, is not constituted Restriction to the computer equipment that application scheme is applied thereon, specific computer equipment may include than as shown in the figure More or fewer components perhaps combine certain components or with different component layouts.
The embodiment of the present application provides a kind of data crawling method, and this method can be executed by the computer equipment in Fig. 1.Such as Shown in Fig. 2, this method comprises the following steps:
S21, demand is crawled according to data, required code block is selected from the database constructed in advance, and according to selection Each code block out executes sequence, is ranked up to each code block selected, and obtains corresponding code block sequence;
It wherein, include multiple code blocks in the database, the preparatory building process of the database includes: to preset Multiple websites carry out data respectively and crawl, and crawl computer generation corresponding to step for each of process is crawled in data Code is used as a code block.
It will be appreciated that above-mentioned computer code is to crawl the corresponding code of step, code can be referred to as crawled.
In practical applications, above-mentioned preset multiple websites, for example, certain shopping website, certain friend-making sites, certain News Network It stands, certain database website etc., can choose different types of website as above-mentioned preset multiple websites, so that building Code block in database is more comprehensive, can be configured to various crawlers.
It will be appreciated that each is crawled the corresponding code of step as a code during database sharing Block, a code block can also be referred to as a component, that is to say, that a corresponding code block of step or a component. So-called step, for example, when crawling webpage the step of logging in, into the step of list, page turning the step of, drop-down rolling step It is rapid etc..As it can be seen that saving the corresponding computer code of each step as a code block into database, being equivalent to will be every One step is preserved as an individual component.
It is in practical applications, above-mentioned that carry out the process that data crawl respectively to preset multiple websites may include: to institute It states preset multiple websites and writes corresponding computer code respectively, and using the corresponding computer code in each website to the net Progress data of standing crawl.
That is, computer code is first write for each preset website, it is available in this way to be suitble to crawl this The crawler of website, then using the corresponding computer code in each preset website, (i.e. each preset website is corresponding Crawler) carry out data crawl, will crawl the corresponding code of each of process step (can also be referred to as a code block For a component) it saves into database.This mode that computer code is write for each preset website, can obtain To the crawler for being very suitable to the website, to enable each very effective completion of step during data crawl to crawl Work.
Wherein, the above-mentioned process for writing corresponding computer code respectively to preset multiple websites may include: The corresponding computer for being used to carry out data and crawling is write to preset multiple websites using fine granularity isolation respectively Code.Popular saying exactly is segmented the object in business model, to obtain more scientific and reasonable object model, intuitively Say to be exactly to mark off many objects.Detailed process may include: to each preset website write it is corresponding be used to carry out When the computer code that data crawl, computer code is write respectively for the different objects that crawls;Wherein, described to crawl object Including at least one of figure product, audio, video and text information.For example, computer generation is write to some news website Code when, using the picture in the news website as crawl object write computer code, using the audio in the news website as Object is crawled to write computer code, write computer code, this is new using the video in the news website as object is crawled Text information conduct in news website crawls object and writes computer code etc..It segments out for each website and much crawls Object, can make the code block in database more comprehensive, can satisfy various data in this way and crawl demand.
For example, multiple steps corresponding to multiple code blocks in the database constructed by the above process can wrap It includes: (1) login record cookie;(2) enter list page and crawl network address URL;(3) enter article page and crawl article content; (4) click next translates into lower one page and continues to execute;(5) enter article page and crawl article content;(6) drop-down scroll bar occurs next Page content;(7) search box input content is searched for.
It can be diversified it will be appreciated that above-mentioned data crawl demand, climbed for example, which website to carry out data to It takes, data is carried out to which kind of content (picture, audio, video, text etc.) on the website and are crawled.Different data, which crawl, to be needed It asks, required code block is different.
It will be appreciated that the embodiment of the present application crawls each generation required for demand is selected from database according to data Code block, since different code blocks corresponds to different steps, that is to say, that the corresponding each step of execution sequence of each code block Execute sequence, it is therefore desirable to each code block is ranked up, is equivalent to and each step is ranked up according to execution sequence.
For example, user wants to crawl the content of Sina weibo, demand is crawled according to this data, it is known that crawling step can wrap It includes: login-search hot word-and crawls microblogging ID, content of microblog, issuing time etc.-page turning, it is seen that according to exemplified above, step Sequence is about (1)-(7)-(3)-(4), it is therefore desirable to step (1), (3), (4), (7) corresponding generation are selected from database Then this four code blocks are ranked up according to execution sequence (1)-(7)-(3)-(4), obtain corresponding code block sequence by code block Column.
For another example user wants to crawl the content in Netease's news, demand is crawled according to this data, it is known that crawl step It will include: crawling URL-into article page-downslide page turning into list page, it is seen that according to exemplified above, sequence of steps is substantially (2)-(3)-(6), it is therefore desirable to step (2), (3) and (6) corresponding code block is selected from database, then by this three A code block is ranked up according to execution sequence (2)-(3)-(6), obtains corresponding code block sequence.
S22, according to the code block sequence, required crawler is configured;
It will be appreciated that the process configured to the required crawler is actually to generate the process of configuration file, obtain Required crawler configures and completes after to configuration file.Therefore the detailed process of above-mentioned steps S22 may include: according to the code Block sequence and it is preset illustrate document, determine it is described needed for crawler configuration file.Wherein, illustrate to have can store one in document Illustrating information, these illustrate that information can assist user to generate configuration file, for example, the process step of configuration file is generated, Those information etc. are needed in each step.
In practical applications, it can be configured by way of expandable mark language XML, that is to say, that configuration text Code in part can use the form of XML, and the versatility of above-mentioned required crawler can be improved.
For example, want to crawl the content of Sina weibo this data for above-mentioned user and crawl demand, code block Sequence is the corresponding code block sequence in step (1)-(7)-(3)-(4), can generate configure according to this code block sequence at this time File.
It will be appreciated that data crawl demand not only includes which website crawls be, crawl which kind of content, can also wrap Include that full dose crawls or it is non-javascript web page contents, from that increment, which crawls, crawls javascript web page contents also, What webpage starts grab content, whether page turning mode pulls down sliding, the field of being grabbed has what attribute etc., therefore also needs These contents are configured.
It in the specific implementation, may include: to seed, seed according to the process that code block sequence configures crawler Address, seed region, whether be full dose crawl, crawl required keyword, page turning mode, need the field that grabs Attribute, start crawl webpage sum of series whether grab in javascript web page contents at least one of configured.
Detailed process may include steps of:
A1, seed is configured, seed, that is, seed, as the term suggests it is that crawl content is dissipated to introduce with seed;
A2, the address of seed is configured, url, that is, seed address, for example, url is configured ashttp:// www.chinanews.com/business/gd.shtml
A3, to whether be full dose crawl configure, whether fully is that full dose crawls, and it is yes, fully that fully, which takes 1, It is no for taking 0;
A4, non-javascript web page contents are configured to crawling javascript web page contents also and be, for example, Whether javascript is javascript webpage, javascript take 1 be it is yes, it is no that javascript, which takes 0);
A5, keyword is configured, keyword, that is, keyword can also be not provided with keyword in code;
A6, seed region is configured, seedArea, that is, seed region, if not filling out the whole network page then URL is all removed address, and seed region is in above-mentioned fragment code![CDATA[#content_right> div.content_list]];
Which a7, configure to being grabbed since grade webpage, start grabs content, example since which grade webpage Such as, it is grabbed since the 2nd grade of webpage;
A8, page turning mode is configured, turning, that is, page turning mode, turning are configured to slider, then it represents that turn over Page mode is drop-down sliding;
A9, the attribute for the field that needs grab is configured, meta is the attribute for needing to grab field, for example, field I.e. field, site, that is, address, tag, that is, label, index are index, pic, that is, picture etc..
From above-mentioned fragment code it is found that can choose javascript webpage or non-javascript webpage, that is to say, that Javascript webpage capture and the crawl of the non-javascript page may be implemented.It, can be with when selecting javascript webpage It is accurate to explain javascript code, and then it is changed into the html code of normal tape label.It will be appreciated that javascript Webpage is the page of dynamic generation, and non-javascript webpage is the static page generated.
Need to be combined different code blocks sequence (i.e. pair due to that can crawl according to data in the embodiment of the present application Various steps carry out any combination configuration), and crawler configuration is carried out according to the code block sequence that sequence obtains, therefore configuration obtains Crawler complete page-downloading may be implemented, also may be implemented precisely to grab, for example, only capturing pictures.Certainly, by data Crawl the setting of demand, can also realize it is cluster distributed crawl, with improve crawl speed.
No matter can be transferred through crawler needed for aforesaid way configures as it can be seen that it is what that data, which crawl demand,.
Certainly, in practical applications, the configuration file can also be uploaded on server and is stored, so as to subsequent Demand is crawled for same data to directly acquire, i.e., the configuration file is obtained from the server, and according to institute Configuration file progress data are stated to crawl, it is more convenient.
S23, it is crawled using the required crawler progress data that configuration is completed, obtains crawling data.
When progress data crawl, crawler can be potentially encountered the counter of website and climb mechanism, and the so-called anti-mechanism of climbing refers to one Agent IP address frequently accesses a website, which will access to the agent IP address limitation.This is asked Topic, can be by being improved with any one in two kinds of under type:
(1) crawler sends log on request to the server for the website to be logged in, and carries and is used in the log on request The agent address (i.e. agent IP address) of the server of the website is logged in, is periodically modified to the agent address, in this way It can be to avoid the problem limited because frequently accessing website using the same agent address.For example, crawler is modified every half an hour Agent address, then modified agent address is stored, when needing to access website, extraction is modified vicariously Location.
(2) crawler sends log on request to the server for the website to be logged in, and carries and is used in the log on request The agent address (i.e. agent IP address) for logging in the server of the website passes through crawler when encountering limited access or access errors It modifies to the agent address.After server finds that an agent address frequently accesses its website, interception will do it, and One is fed back to sender, that is, crawler of log on request and accesses limited or access errors information, when crawler receives the information Afterwards, agent address can be modified, sends log on request again, what is carried in log on request at this time is modified agent address.When After agent address is modified, the server of website would not be intercepted.For example, when crawler logs in the transmission of the server of website The feedback information for accessing limited or access errors is received after request, crawler repairs the agent address in log on request at this time Change, then sends the log on request for carrying modified agent address, it thus can successful log website.
No matter which kind of mode, the modification process of agent address can according to need, used vicariously for example, last Location is 192.168.1.1, and the agent address used next time can be revised as 192.168.2.1.
In practical applications, what is obtained after data crawl crawls data there may be duplicate pages and/or there are advertisement, this When heavy filtration can be carried out to the data that crawl using local sensitivity hash algorithm.
Wherein, local sensitivity hash algorithm, that is, simhash algorithm, the principle of simhash algorithm generally comprise following content: Carry out basic pretreatment to the text crawled out, for example removal stops word (i.e. number, quantifier, function word etc. do not have significant Word), root reduction, segmentation (i.e. chunking), last available multiple vectors.Each vector is carried out hash algorithm to turn It changes, obtains length f hash codes, then each carries out positive and negative weight conversion to the 1-0 value on each, such as f1 are When 1, weight is set as+weight, and f1 when being 0, weight is set as-weight, thus the corresponding one f weights of each vector Vector.All corresponding weight vectors of vector are added up according to corresponding position, one f weight arrays is finally obtained, will count What position was positive in group sets 1, and what position was negative sets 0, then text is transformed into one f new 1-0 arrays, that is, one new Hash code, as hash fingerprint, and then carry out duplicate removal and filtering using hash fingerprint removes a large amount of duplicate pages and advertisement Deng.
Data crawling method provided by the embodiments of the present application crawls demand according to data and selects required generation from database Then each code block selected is ranked up according to step execution sequence, obtains code block sequence, and then foundation by code block Crawler needed for the configuration of code block sequence finally carries out data using configured crawler and crawls.Since the embodiment of the present application can Required code block is selected to crawl demand according to data, then the code block selected is ranked up, that is to say, that phase When step and then being combined sequence to each step that crawls in crawling demand selection multiple crawl according to data, it is configured so that into Crawler can satisfy the various demands of user, for example, being the entire webpage of downloading or precisely grabbing, be to grab javascript Webpage is also non-javascript webpage etc., moreover, data crawling method provided by the embodiments of the present application is simple, easily configuration, it can Different web sites, various forms of data are crawled with realizing.
As shown in figure 3, in one embodiment, provide a kind of data and crawl device 30, the device 30 can integrate in In above-mentioned computer equipment, it can specifically include:
Sequence determining module 32 selects required generation for crawling demand according to data from the database constructed in advance Code block;And sequence is executed according to each code block selected, each code block selected is ranked up, is corresponded to Code block sequence;
Crawler configuration module 33, for being configured to required crawler according to the code block sequence;
Data crawl module 34, and the required crawler for being completed using configuration is carried out data and crawled, and obtain crawling number According to;
Database sharing module 31 includes multiple code blocks, institute in the database for constructing the database in advance It states database sharing module to be specifically used for: data being carried out to preset multiple websites respectively and are crawled, and process will be crawled in data Each of crawl computer code corresponding to step as a code block.
In some embodiments, described device further include: duplicate removal filtering module, for using local sensitivity hash algorithm pair The data that crawl carry out heavy filtration.
In some embodiments, the crawler configuration module is specifically used for: according to the code block sequence and preset theory Plaintext shelves determine the configuration file of the required crawler, wherein described to illustrate to be stored in document for generating the configuration text Part illustrates information.
In some embodiments, data are carried out to preset multiple websites in the database sharing module respectively and crawl packet It includes: the corresponding computer code being write respectively to preset multiple websites, and corresponding described using each website Computer code carries out data to the website and crawls.
In some embodiments, preset multiple websites are write respectively in the database sharing module corresponding The computer code includes: to write the corresponding meter respectively to preset multiple websites using fine granularity isolation Calculation machine code.
In some embodiments, data crawl in module and crawl packet using the required crawler progress data that configuration is completed It includes: corresponding website being logged in using the required crawler, is specifically included: being sent out by the required crawler to the server of corresponding website Send log on request, carry agent address in the log on request, and periodically through the required crawler to it is described vicariously It modifies or the agent address is repaired by the required crawler when encountering limited access or access errors in location Change.
In some embodiments, the crawler configuration module is specifically used for: a1, configuring to seed;A2, to described kind The address of son is configured;A3, to whether be full dose crawl configure;A4, to crawling javascript web page contents still Non- javascript web page contents are configured;A5, it configures to crawling required keyword;A6, the institute to the seed It is configured in region;A7, the series for starting to grab webpage are configured;A8, page turning mode is configured;A9, to needs The attribute of the field of crawl is configured.
Data provided by the embodiments of the present application crawl device, and sequence determining module crawls demand from database according to data Then each code block selected is ranked up according to step execution sequence, obtains code block by code block needed for selection Sequence, and then crawler configuration module, according to crawler needed for the configuration of code block sequence, final data crawls module utilization and configures Crawler carry out data crawl.Required code block is selected since the embodiment of the present application can crawl demand according to data, so The code block selected is ranked up afterwards, that is to say, that be equivalent to according to data crawl demand selection it is multiple crawl step into And sequence is combined to each step that crawls, the crawler being configured so that can satisfy the various demands of user, for example, under being Carry entire webpage still precisely grab, be crawl javascript webpage be also non-javascript webpage etc., moreover, the application The data crawling method that embodiment provides is simple, easily configures, and may be implemented to crawl different web sites, various forms of data.
In some embodiments, propose a kind of computer equipment, the computer equipment include memory, processor and It is stored in the computer program that can be run on the memory and on the processor, the processor executes the computer It is performed the steps of when program and crawls demand according to data, required code block is selected from the database constructed in advance;And root Sequence is executed according to each code block selected, each code block selected is ranked up, corresponding code block is obtained Sequence;According to the code block sequence, required crawler is configured;Data are carried out using the required crawler that configuration is completed It crawls, obtains crawling data;It wherein, include multiple code blocks, the preparatory building process packet of the database in the database It includes: data being carried out to preset multiple websites respectively and are crawled, and will be crawled corresponding to each of process step in data Computer code is as a code block.
In some embodiments, it is also performed the steps of when the processor executes the computer program using part Sensitive hash algorithm carries out heavy filtration to the data that crawl.
In some embodiments, what the processor executed is described according to the code block sequence, carries out to required crawler Configuration, comprising: according to the code block sequence and it is preset illustrate document, determine it is described needed for crawler configuration file, wherein It is described illustrate to be stored in document illustrate information for generate the configuration file.
In some embodiments, the processor execute it is described data carried out to preset multiple websites respectively crawl, It include: the corresponding computer code to be write respectively to preset multiple websites, and use the corresponding institute in each website Computer code is stated to crawl website progress data.
In some embodiments, the processor execute it is described preset multiple websites are write respectively it is corresponding The computer code, comprising: preset multiple websites are write respectively using fine granularity isolation corresponding described Computer code.
In some embodiments, the required crawler for using configuration to complete that the processor executes carries out data It crawls, comprising: corresponding website is logged in using crawler needed for described, is specifically included: by the required crawler to corresponding website Server sends log on request, carries agent address in the log on request, and periodically through the required crawler to institute State agent address modify or when encountering access limited or access errors by crawler needed for described to it is described vicariously It modifies location.
In some embodiments, what the processor executed is described according to the code block sequence, carries out to required crawler Configuration includes: a1, configures to seed;A2, the address of the seed is configured;A3, to whether be full dose grab into Row configuration;It a4, is that non-javascript web page contents configure to javascript web page contents are crawled also;A5, to crawling Required keyword is configured;A6, the region of the seed is configured;A7, start grab webpage series into Row configuration;A8, page turning mode is configured;A9, the attribute for the field that needs grab is configured.
The beneficial effect of computer equipment provided by the present application is identical as above-mentioned data crawling method and device, here no longer It repeats.
In one embodiment it is proposed that a kind of storage medium for being stored with computer-readable instruction, this is computer-readable When instruction is executed by one or more processors, so that one or more processors execute following steps: crawling need according to data It asks, required code block is selected from the database constructed in advance;It is right and according to the sequence that executes for each code block selected The each code block selected is ranked up, and obtains corresponding code block sequence;According to the code block sequence, to required crawler It is configured;Data are carried out using the required crawler that configuration is completed to crawl, and obtain crawling data;Wherein, the database In include multiple code blocks, the preparatory building process of the database includes: to carry out data respectively to preset multiple websites to climb It takes, and crawls computer code corresponding to step as a code block for each of process is crawled in data.
In some embodiments, following step is also realized when one or more of processors execute the computer program It is rapid: heavy filtration is carried out to the data that crawl using local sensitivity hash algorithm.
In some embodiments, what one or more of processors executed is described according to the code block sequence, to institute Need crawler to be configured, comprising: according to the code block sequence and it is preset illustrate document, determine it is described needed for crawler configuration File, wherein it is described illustrate to be stored in document illustrate information for generate the configuration file.
In some embodiments, described preset multiple websites are carried out respectively of one or more of processors execution Data crawl, comprising: write the corresponding computer code respectively to preset multiple websites, and use each website The corresponding computer code carries out data to the website and crawls.
In some embodiments, described preset multiple websites are compiled respectively of institute's one or more processors execution Write the corresponding computer code, comprising: write respectively pair to preset multiple websites using fine granularity isolation The computer code answered.
In some embodiments, one or more of processors execute it is described using configuration complete it is described needed for climb Worm carries out data and crawls, comprising: logs in corresponding website using crawler needed for described, specifically includes: by the required crawler to The server of corresponding website sends log on request, carries agent address in the log on request, and periodically through the institute Crawler is needed to modify to the agent address or when encountering limited access or access errors by the required crawler pair The agent address is modified.
In some embodiments, what one or more of processors executed is described according to the code block sequence, to institute It needs crawler to carry out configuration to include: a1, configure seed;A2, the address of the seed is configured;A3, to whether being Full dose crawl is configured;It a4, is that non-javascript web page contents configure to javascript web page contents are crawled also; A5, it configures to crawling required keyword;A6, the region of the seed is configured;A7, start to grab webpage Series configured;A8, page turning mode is configured;A9, the attribute for the field that needs grab is configured.
The beneficial effect of storage medium provided by the present application is identical as data crawling method and device, and which is not described herein again.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, which can be stored in a computer-readable storage and be situated between In matter, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, storage medium above-mentioned can be The non-volatile memory mediums such as magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random storage note Recall body (Random Access Memory, RAM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (10)

1. a kind of data crawling method characterized by comprising
Demand is crawled according to data, required code block is selected from the database constructed in advance;And it is each according to what is selected Code block executes sequence, is ranked up to each code block selected, and obtains corresponding code block sequence;
According to the code block sequence, required crawler is configured;
Data are carried out using the required crawler that configuration is completed to crawl, and obtain crawling data;
It wherein, include multiple code blocks in the database, the preparatory building process of the database includes:
Data are carried out respectively to preset multiple websites to crawl, and data are crawled into each of process and are crawled corresponding to step Computer code as a code block.
2. the method according to claim 1, wherein further include: it is climbed using local sensitivity hash algorithm to described Access is according to carrying out heavy filtration.
3. the method according to claim 1, wherein described according to the code block sequence, to required crawler into Row configuration, comprising: according to the code block sequence and it is preset illustrate document, determine it is described needed for crawler configuration file;Its In, it is described illustrate to be stored in document illustrate information for generate the configuration file.
4. carrying out data respectively the method according to claim 1, wherein described to preset multiple websites and climbing It takes, comprising: the corresponding computer code is write respectively to preset multiple websites, and corresponding using each website The computer code carries out data to the website and crawls.
5. according to the method described in claim 4, it is characterized in that, described write correspondence to preset multiple websites respectively The computer code, comprising: corresponding institute is write to preset multiple websites using fine granularity isolation respectively State computer code.
6. described in any item methods according to claim 1~5, which is characterized in that described using the described required of configuration completion Crawler carries out data and crawls, comprising: logs in corresponding website using the required crawler, specifically includes: by the required crawler Log on request is sent to the server of corresponding website, carries agent address in the log on request, and periodically through described Required crawler modifies to the agent address or when encountering limited access or access errors by the required crawler It modifies to the agent address.
7. described in any item methods according to claim 1~5, which is characterized in that it is described according to the code block sequence, to institute Crawler is needed to be configured, comprising:
A1, seed is configured;
A2, the address of the seed is configured;
A3, to whether be full dose crawl configure;
It a4, is that non-javascript web page contents configure to javascript web page contents are crawled also;
A5, it configures to crawling required keyword;
A6, the region of the seed is configured;
A7, the series for starting to grab webpage are configured;
A8, page turning mode is configured;
A9, the attribute for the field that needs grab is configured.
8. a kind of data crawl device, which is characterized in that described device includes:
Sequence determining module selects required code block for crawling demand according to data from the database constructed in advance;And Sequence is executed according to each code block selected, each code block selected is ranked up, corresponding code is obtained Block sequence;
Crawler configuration module, for being configured to required crawler according to the code block sequence;
Data crawl module, and the required crawler for being completed using configuration is carried out data and crawled, and obtain crawling data;
Database sharing module includes multiple code blocks, the data in the database for constructing the database in advance Library building module is specifically used for: carrying out data respectively to preset multiple websites and crawls, and will be every during data crawl One crawls computer code corresponding to step as a code block.
9. a kind of computer equipment, which is characterized in that including memory and processor, being stored with computer in the memory can Reading instruction, when the computer-readable instruction is executed by the processor, so that the processor executes such as claim 1 to 7 Any one of data crawling method described in claim the step of.
10. a kind of storage medium for being stored with computer-readable instruction, which is characterized in that the computer-readable instruction is by one Or multiple processors are when executing, so that one or more processors are executed as described in any one of claims 1 to 7 claim The step of data crawling method.
CN201910319429.XA 2019-04-19 2019-04-19 Data crawling method, device, computer equipment and storage medium Pending CN110209909A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910319429.XA CN110209909A (en) 2019-04-19 2019-04-19 Data crawling method, device, computer equipment and storage medium
PCT/CN2019/118419 WO2020211367A1 (en) 2019-04-19 2019-11-14 Data crawling method and apparatus, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910319429.XA CN110209909A (en) 2019-04-19 2019-04-19 Data crawling method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110209909A true CN110209909A (en) 2019-09-06

Family

ID=67786028

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910319429.XA Pending CN110209909A (en) 2019-04-19 2019-04-19 Data crawling method, device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110209909A (en)
WO (1) WO2020211367A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597421A (en) * 2020-04-30 2020-08-28 武汉思普崚技术有限公司 Method, device, equipment and storage medium for realizing website picture crawler
WO2020211367A1 (en) * 2019-04-19 2020-10-22 平安科技(深圳)有限公司 Data crawling method and apparatus, computer device and storage medium
CN112541104A (en) * 2019-09-20 2021-03-23 浙江大搜车软件技术有限公司 Data capturing method and device
CN112732996A (en) * 2021-01-11 2021-04-30 深圳市洪堡智慧餐饮科技有限公司 Multi-platform distributed data crawling method based on asynchronous aiohttp
CN113542223A (en) * 2021-06-16 2021-10-22 杭州拼便宜网络科技有限公司 Equipment fingerprint-based crawler-resisting method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766014A (en) * 2015-04-30 2015-07-08 安一恒通(北京)科技有限公司 Method and system used for detecting malicious website
CN107729508A (en) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 Information crawler method and apparatus
CN108153880A (en) * 2017-12-26 2018-06-12 北京非斗数据科技发展有限公司 A kind of more tactful self-adapting crawling technologies about network picture
CN109063144A (en) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 Visual network crawler method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567513B (en) * 2011-12-27 2014-09-17 北京神州绿盟信息安全科技股份有限公司 Method and equipment for collecting phishing websites
US9792629B2 (en) * 2013-08-05 2017-10-17 Yahoo Holdings, Inc. Keyword recommendation
CN110209909A (en) * 2019-04-19 2019-09-06 平安科技(深圳)有限公司 Data crawling method, device, computer equipment and storage medium
CN110189189A (en) * 2019-04-19 2019-08-30 平安科技(深圳)有限公司 One-stop shopping at network bootstrap technique, device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766014A (en) * 2015-04-30 2015-07-08 安一恒通(北京)科技有限公司 Method and system used for detecting malicious website
CN107729508A (en) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 Information crawler method and apparatus
CN108153880A (en) * 2017-12-26 2018-06-12 北京非斗数据科技发展有限公司 A kind of more tactful self-adapting crawling technologies about network picture
CN109063144A (en) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 Visual network crawler method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020211367A1 (en) * 2019-04-19 2020-10-22 平安科技(深圳)有限公司 Data crawling method and apparatus, computer device and storage medium
CN112541104A (en) * 2019-09-20 2021-03-23 浙江大搜车软件技术有限公司 Data capturing method and device
CN111597421A (en) * 2020-04-30 2020-08-28 武汉思普崚技术有限公司 Method, device, equipment and storage medium for realizing website picture crawler
CN111597421B (en) * 2020-04-30 2022-08-30 武汉思普崚技术有限公司 Method, device, equipment and storage medium for realizing website picture crawler
CN112732996A (en) * 2021-01-11 2021-04-30 深圳市洪堡智慧餐饮科技有限公司 Multi-platform distributed data crawling method based on asynchronous aiohttp
CN113542223A (en) * 2021-06-16 2021-10-22 杭州拼便宜网络科技有限公司 Equipment fingerprint-based crawler-resisting method

Also Published As

Publication number Publication date
WO2020211367A1 (en) 2020-10-22

Similar Documents

Publication Publication Date Title
CN110209909A (en) Data crawling method, device, computer equipment and storage medium
Khalil et al. RCrawler: An R package for parallel web crawling and scraping
US10572863B2 (en) Systems and methods for managing allocation of machine data storage
CN102985921B (en) There is the client terminal device high speed caching electronic document resources of e-sourcing data base
CN110189189A (en) One-stop shopping at network bootstrap technique, device, computer equipment and storage medium
CN104424199A (en) Search method and device
CN107807937B (en) Website SEO processing method, device and system
CN106575298A (en) Fast rendering of websites containing dynamic content and stale content
CN108932332A (en) The loading method and device of static resource
CN106126693A (en) The sending method of the related data of a kind of webpage and device
CN107688568A (en) Acquisition method and device based on web page access behavior record
CN106407371A (en) User comment data displaying method and system, server and client
CN103699674A (en) Webpage storing method, webpage opening method, webpage storing device, webpage opening device and webpage browsing system
CN105283843B (en) Embeddable media content search widget
US9398068B2 (en) Bulk uploading of multiple self-referencing objects
CN106201562A (en) A kind of page switching method and device
CN102591916A (en) Webpage opening method and website system
CN103455547B (en) A kind of method and device for webpage loading
CN108334619A (en) A kind of collecting method, device, computing device and storage medium
CN106886547A (en) A kind of scenario generation method and device
Chang A survey of modern crawler methods
US9824151B2 (en) Providing a portion of requested data based upon historical user interaction with the data
CN110020273A (en) For generating the method, apparatus and system of thermodynamic chart
CN107147645A (en) The acquisition methods and device of network security data
US9846605B2 (en) Server-side minimal download and error failover

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190906