CN110209909A - Data crawling method, device, computer equipment and storage medium - Google Patents
Data crawling method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110209909A CN110209909A CN201910319429.XA CN201910319429A CN110209909A CN 110209909 A CN110209909 A CN 110209909A CN 201910319429 A CN201910319429 A CN 201910319429A CN 110209909 A CN110209909 A CN 110209909A
- Authority
- CN
- China
- Prior art keywords
- data
- code block
- crawler
- code
- crawl
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 230000009193 crawling Effects 0.000 title claims abstract description 49
- 238000003860 storage Methods 0.000 title claims abstract description 13
- 230000008569 process Effects 0.000 claims abstract description 24
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 238000001914 filtration Methods 0.000 claims description 8
- 238000002955 isolation Methods 0.000 claims description 6
- 230000035945 sensitivity Effects 0.000 claims description 6
- 230000009194 climbing Effects 0.000 claims description 2
- 239000003795 chemical substances by application Substances 0.000 description 29
- 239000013598 vector Substances 0.000 description 6
- 238000004590 computer program Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 244000097202 Rathbunia alamosensis Species 0.000 description 2
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000008439 repair process Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 235000014510 cooky Nutrition 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
This application involves a kind of data crawling method, device, computer equipment and storage mediums, this method comprises: crawling demand according to data, required code block is selected from the database constructed in advance;And sequence is executed according to each code block selected, each code block selected is ranked up, corresponding code block sequence is obtained;According to the code block sequence, required crawler is configured;Data are carried out using the required crawler that configuration is completed to crawl, and obtain crawling data;Wherein, it include multiple code blocks in the database, the preparatory building process of the database includes: to carry out data respectively to preset multiple websites to crawl, and data are crawled each of process and crawl computer code corresponding to step as a code block.The application can satisfy the different demands of user.
Description
Technical field
The present invention relates to crawler technology field more particularly to a kind of data crawling method, device, computer equipment and storages
Medium.
Background technique
Currently, open source crawler it is many kinds of, but various crawlers respectively have it is excellent lack, be not able to satisfy the various need that data crawl
It asks.For example, with the rapid development of network, WWW becomes the carrier of bulk information, how to efficiently extract and use
These information become a huge challenge.And there are many data modes on WWW, such as picture, database, audio, view
Frequency multimedia etc., there are also various forms of webpages, it is various forms of it is counter climb technology so that current, open source community is various climbs
Worm has been not enough to support to crawl requirement for different form data.
Summary of the invention
The embodiment of the present application provides a kind of data crawling method, device, computer equipment and storage medium, can satisfy number
According to the different demands crawled.
The embodiment of the present application provides a kind of data crawling method, comprising:
Demand is crawled according to data, required code block is selected from the database constructed in advance;And according to selecting
Each code block executes sequence, is ranked up to each code block selected, and obtains corresponding code block sequence;
According to the code block sequence, required crawler is configured;
Data are carried out using the required crawler that configuration is completed to crawl, and obtain crawling data;
It wherein, include multiple code blocks in the database, the preparatory building process of the database includes:
Data are carried out respectively to preset multiple websites to crawl, and will be crawled each of process in data and be crawled step
Corresponding computer code is as a code block.
In some embodiments, the method also includes: using local sensitivity hash algorithm to it is described crawl data carry out
Go heavy filtration.
In some embodiments, described according to the code block sequence, required crawler is configured, comprising: according to institute
State code block sequence and it is preset illustrate document, determine it is described needed for crawler configuration file;Wherein, described to illustrate to deposit in document
It contains and illustrates information for generate the configuration file.
In some embodiments, described data are carried out to preset multiple websites respectively to crawl, comprising: to described preset
The corresponding computer code is write in multiple websites respectively, and using the corresponding computer code in each website to the net
Progress data of standing crawl.
It is in some embodiments, described that the corresponding computer code is write respectively to preset multiple websites,
It include: that the corresponding computer code is write to preset multiple websites using fine granularity isolation respectively.
In some embodiments, described to be crawled using the required crawler progress data that configuration is completed, comprising: to use institute
State required crawler and log in corresponding website, specifically include: the server transmission by the required crawler to corresponding website, which logs in, asks
It asks, agent address is carried in the log on request, and repair to the agent address periodically through the required crawler
Change or is modified by the required crawler to the agent address when encountering limited access or access errors.
In some embodiments, described according to the code block sequence, required crawler is configured, comprising:
A1, seed is configured;
A2, the address of the seed is configured;
A3, to whether be full dose crawl configure;
It a4, is that non-javascript web page contents configure to javascript web page contents are crawled also;
A5, it configures to crawling required keyword;
A6, the region of the seed is configured;
A7, the series for starting to grab webpage are configured;
A8, page turning mode is configured;
A9, the attribute for the field that needs grab is configured.
The embodiment of the present application also provides a kind of data and crawls device, comprising:
Sequence determining module selects required code for crawling demand according to data from the database constructed in advance
Block;And sequence is executed according to each code block selected, each code block selected is ranked up, is obtained corresponding
Code block sequence;
Crawler configuration module, for being configured to required crawler according to the code block sequence;
Data crawl module, and the required crawler for being completed using configuration is carried out data and crawled, and obtain crawling data;
Database sharing module includes multiple code blocks in the database for constructing the database in advance, described
Database sharing module is specifically used for: carrying out data respectively to preset multiple websites and crawls, and will be during data crawl
Each crawl computer code corresponding to step as a code block.
In some embodiments, the crawler configuration module is specifically used for: according to the code block sequence and preset theory
Plaintext shelves determine the configuration file of the required crawler;Wherein, described to illustrate to be stored in document for generating the configuration text
Part illustrates information.
The embodiment of the present application also provides a kind of computer equipment, including memory and processor, stores in the memory
There is computer-readable instruction, when the computer-readable instruction is executed by the processor, so that processor execution is above-mentioned
The step of data crawling method.
The embodiment of the present application also provides a kind of storage medium for being stored with computer-readable instruction, the computer-readable finger
When order is executed by one or more processors, so that the step of one or more processors execute above-mentioned data crawling method.
Data crawling method, device, computer equipment and storage medium provided by the embodiments of the present application, crawl according to data
Demand selects required code block from database, then arranges each code block selected according to step execution sequence
Sequence obtains code block sequence, and then according to crawler needed for the configuration of code block sequence, is finally counted using configured crawler
According to crawling.Since the embodiment of the present application can crawl demand according to data required code block is selected, then to selecting
Code block is ranked up, that is to say, that is equivalent to and is crawled demand selection multiple crawl according to data and step and then crawl to each
Step is combined sequence, and the crawler being configured so that can satisfy the various demands of user, for example, being to download entire webpage also
Be precisely grab, be crawl javascript webpage be also non-javascript webpage etc., moreover, provided by the embodiments of the present application
Data crawling method is simple, easily configures, and may be implemented to crawl different web sites, various forms of data.
Detailed description of the invention
Fig. 1 is the internal structure block diagram of computer equipment in one embodiment;
Fig. 2 is the flow chart of data crawling method in one embodiment;
Fig. 3 is the structural block diagram that data crawl device in one embodiment.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
It is appreciated that term " first " used in this application, " second " etc. can be used to describe various elements herein,
But these elements should not be limited by these terms.These terms are only used to distinguish the first element from the other element.
Fig. 1 is the structural schematic diagram of computer equipment in the application one embodiment.As shown in Figure 1, the computer equipment
Including processor, non-volatile memory medium, memory and the network interface connected by system bus.Wherein, the computer
The non-volatile memory medium of equipment is stored with operating system, database and computer-readable instruction, can be stored in database
Control information sequence when the computer-readable instruction is executed by processor, may make processor to realize a kind of data crawling method.
The processor of the computer equipment supports the operation of entire computer equipment for providing calculating and control ability.The computer
It can be stored with computer-readable instruction in the memory of equipment, when which is executed by processor, may make place
Reason device executes a kind of data crawling method.The network interface of the computer equipment is used for and terminal connection communication.Art technology
Personnel are appreciated that structure shown in Fig. 1, and only the block diagram of part-structure relevant to application scheme, is not constituted
Restriction to the computer equipment that application scheme is applied thereon, specific computer equipment may include than as shown in the figure
More or fewer components perhaps combine certain components or with different component layouts.
The embodiment of the present application provides a kind of data crawling method, and this method can be executed by the computer equipment in Fig. 1.Such as
Shown in Fig. 2, this method comprises the following steps:
S21, demand is crawled according to data, required code block is selected from the database constructed in advance, and according to selection
Each code block out executes sequence, is ranked up to each code block selected, and obtains corresponding code block sequence;
It wherein, include multiple code blocks in the database, the preparatory building process of the database includes: to preset
Multiple websites carry out data respectively and crawl, and crawl computer generation corresponding to step for each of process is crawled in data
Code is used as a code block.
It will be appreciated that above-mentioned computer code is to crawl the corresponding code of step, code can be referred to as crawled.
In practical applications, above-mentioned preset multiple websites, for example, certain shopping website, certain friend-making sites, certain News Network
It stands, certain database website etc., can choose different types of website as above-mentioned preset multiple websites, so that building
Code block in database is more comprehensive, can be configured to various crawlers.
It will be appreciated that each is crawled the corresponding code of step as a code during database sharing
Block, a code block can also be referred to as a component, that is to say, that a corresponding code block of step or a component.
So-called step, for example, when crawling webpage the step of logging in, into the step of list, page turning the step of, drop-down rolling step
It is rapid etc..As it can be seen that saving the corresponding computer code of each step as a code block into database, being equivalent to will be every
One step is preserved as an individual component.
It is in practical applications, above-mentioned that carry out the process that data crawl respectively to preset multiple websites may include: to institute
It states preset multiple websites and writes corresponding computer code respectively, and using the corresponding computer code in each website to the net
Progress data of standing crawl.
That is, computer code is first write for each preset website, it is available in this way to be suitble to crawl this
The crawler of website, then using the corresponding computer code in each preset website, (i.e. each preset website is corresponding
Crawler) carry out data crawl, will crawl the corresponding code of each of process step (can also be referred to as a code block
For a component) it saves into database.This mode that computer code is write for each preset website, can obtain
To the crawler for being very suitable to the website, to enable each very effective completion of step during data crawl to crawl
Work.
Wherein, the above-mentioned process for writing corresponding computer code respectively to preset multiple websites may include:
The corresponding computer for being used to carry out data and crawling is write to preset multiple websites using fine granularity isolation respectively
Code.Popular saying exactly is segmented the object in business model, to obtain more scientific and reasonable object model, intuitively
Say to be exactly to mark off many objects.Detailed process may include: to each preset website write it is corresponding be used to carry out
When the computer code that data crawl, computer code is write respectively for the different objects that crawls;Wherein, described to crawl object
Including at least one of figure product, audio, video and text information.For example, computer generation is write to some news website
Code when, using the picture in the news website as crawl object write computer code, using the audio in the news website as
Object is crawled to write computer code, write computer code, this is new using the video in the news website as object is crawled
Text information conduct in news website crawls object and writes computer code etc..It segments out for each website and much crawls
Object, can make the code block in database more comprehensive, can satisfy various data in this way and crawl demand.
For example, multiple steps corresponding to multiple code blocks in the database constructed by the above process can wrap
It includes: (1) login record cookie;(2) enter list page and crawl network address URL;(3) enter article page and crawl article content;
(4) click next translates into lower one page and continues to execute;(5) enter article page and crawl article content;(6) drop-down scroll bar occurs next
Page content;(7) search box input content is searched for.
It can be diversified it will be appreciated that above-mentioned data crawl demand, climbed for example, which website to carry out data to
It takes, data is carried out to which kind of content (picture, audio, video, text etc.) on the website and are crawled.Different data, which crawl, to be needed
It asks, required code block is different.
It will be appreciated that the embodiment of the present application crawls each generation required for demand is selected from database according to data
Code block, since different code blocks corresponds to different steps, that is to say, that the corresponding each step of execution sequence of each code block
Execute sequence, it is therefore desirable to each code block is ranked up, is equivalent to and each step is ranked up according to execution sequence.
For example, user wants to crawl the content of Sina weibo, demand is crawled according to this data, it is known that crawling step can wrap
It includes: login-search hot word-and crawls microblogging ID, content of microblog, issuing time etc.-page turning, it is seen that according to exemplified above, step
Sequence is about (1)-(7)-(3)-(4), it is therefore desirable to step (1), (3), (4), (7) corresponding generation are selected from database
Then this four code blocks are ranked up according to execution sequence (1)-(7)-(3)-(4), obtain corresponding code block sequence by code block
Column.
For another example user wants to crawl the content in Netease's news, demand is crawled according to this data, it is known that crawl step
It will include: crawling URL-into article page-downslide page turning into list page, it is seen that according to exemplified above, sequence of steps is substantially
(2)-(3)-(6), it is therefore desirable to step (2), (3) and (6) corresponding code block is selected from database, then by this three
A code block is ranked up according to execution sequence (2)-(3)-(6), obtains corresponding code block sequence.
S22, according to the code block sequence, required crawler is configured;
It will be appreciated that the process configured to the required crawler is actually to generate the process of configuration file, obtain
Required crawler configures and completes after to configuration file.Therefore the detailed process of above-mentioned steps S22 may include: according to the code
Block sequence and it is preset illustrate document, determine it is described needed for crawler configuration file.Wherein, illustrate to have can store one in document
Illustrating information, these illustrate that information can assist user to generate configuration file, for example, the process step of configuration file is generated,
Those information etc. are needed in each step.
In practical applications, it can be configured by way of expandable mark language XML, that is to say, that configuration text
Code in part can use the form of XML, and the versatility of above-mentioned required crawler can be improved.
For example, want to crawl the content of Sina weibo this data for above-mentioned user and crawl demand, code block
Sequence is the corresponding code block sequence in step (1)-(7)-(3)-(4), can generate configure according to this code block sequence at this time
File.
It will be appreciated that data crawl demand not only includes which website crawls be, crawl which kind of content, can also wrap
Include that full dose crawls or it is non-javascript web page contents, from that increment, which crawls, crawls javascript web page contents also,
What webpage starts grab content, whether page turning mode pulls down sliding, the field of being grabbed has what attribute etc., therefore also needs
These contents are configured.
It in the specific implementation, may include: to seed, seed according to the process that code block sequence configures crawler
Address, seed region, whether be full dose crawl, crawl required keyword, page turning mode, need the field that grabs
Attribute, start crawl webpage sum of series whether grab in javascript web page contents at least one of configured.
Detailed process may include steps of:
A1, seed is configured, seed, that is, seed, as the term suggests it is that crawl content is dissipated to introduce with seed;
A2, the address of seed is configured, url, that is, seed address, for example, url is configured ashttp:// www.chinanews.com/business/gd.shtml;
A3, to whether be full dose crawl configure, whether fully is that full dose crawls, and it is yes, fully that fully, which takes 1,
It is no for taking 0;
A4, non-javascript web page contents are configured to crawling javascript web page contents also and be, for example,
Whether javascript is javascript webpage, javascript take 1 be it is yes, it is no that javascript, which takes 0);
A5, keyword is configured, keyword, that is, keyword can also be not provided with keyword in code;
A6, seed region is configured, seedArea, that is, seed region, if not filling out the whole network page then
URL is all removed address, and seed region is in above-mentioned fragment code![CDATA[#content_right>
div.content_list]];
Which a7, configure to being grabbed since grade webpage, start grabs content, example since which grade webpage
Such as, it is grabbed since the 2nd grade of webpage;
A8, page turning mode is configured, turning, that is, page turning mode, turning are configured to slider, then it represents that turn over
Page mode is drop-down sliding;
A9, the attribute for the field that needs grab is configured, meta is the attribute for needing to grab field, for example, field
I.e. field, site, that is, address, tag, that is, label, index are index, pic, that is, picture etc..
From above-mentioned fragment code it is found that can choose javascript webpage or non-javascript webpage, that is to say, that
Javascript webpage capture and the crawl of the non-javascript page may be implemented.It, can be with when selecting javascript webpage
It is accurate to explain javascript code, and then it is changed into the html code of normal tape label.It will be appreciated that javascript
Webpage is the page of dynamic generation, and non-javascript webpage is the static page generated.
Need to be combined different code blocks sequence (i.e. pair due to that can crawl according to data in the embodiment of the present application
Various steps carry out any combination configuration), and crawler configuration is carried out according to the code block sequence that sequence obtains, therefore configuration obtains
Crawler complete page-downloading may be implemented, also may be implemented precisely to grab, for example, only capturing pictures.Certainly, by data
Crawl the setting of demand, can also realize it is cluster distributed crawl, with improve crawl speed.
No matter can be transferred through crawler needed for aforesaid way configures as it can be seen that it is what that data, which crawl demand,.
Certainly, in practical applications, the configuration file can also be uploaded on server and is stored, so as to subsequent
Demand is crawled for same data to directly acquire, i.e., the configuration file is obtained from the server, and according to institute
Configuration file progress data are stated to crawl, it is more convenient.
S23, it is crawled using the required crawler progress data that configuration is completed, obtains crawling data.
When progress data crawl, crawler can be potentially encountered the counter of website and climb mechanism, and the so-called anti-mechanism of climbing refers to one
Agent IP address frequently accesses a website, which will access to the agent IP address limitation.This is asked
Topic, can be by being improved with any one in two kinds of under type:
(1) crawler sends log on request to the server for the website to be logged in, and carries and is used in the log on request
The agent address (i.e. agent IP address) of the server of the website is logged in, is periodically modified to the agent address, in this way
It can be to avoid the problem limited because frequently accessing website using the same agent address.For example, crawler is modified every half an hour
Agent address, then modified agent address is stored, when needing to access website, extraction is modified vicariously
Location.
(2) crawler sends log on request to the server for the website to be logged in, and carries and is used in the log on request
The agent address (i.e. agent IP address) for logging in the server of the website passes through crawler when encountering limited access or access errors
It modifies to the agent address.After server finds that an agent address frequently accesses its website, interception will do it, and
One is fed back to sender, that is, crawler of log on request and accesses limited or access errors information, when crawler receives the information
Afterwards, agent address can be modified, sends log on request again, what is carried in log on request at this time is modified agent address.When
After agent address is modified, the server of website would not be intercepted.For example, when crawler logs in the transmission of the server of website
The feedback information for accessing limited or access errors is received after request, crawler repairs the agent address in log on request at this time
Change, then sends the log on request for carrying modified agent address, it thus can successful log website.
No matter which kind of mode, the modification process of agent address can according to need, used vicariously for example, last
Location is 192.168.1.1, and the agent address used next time can be revised as 192.168.2.1.
In practical applications, what is obtained after data crawl crawls data there may be duplicate pages and/or there are advertisement, this
When heavy filtration can be carried out to the data that crawl using local sensitivity hash algorithm.
Wherein, local sensitivity hash algorithm, that is, simhash algorithm, the principle of simhash algorithm generally comprise following content:
Carry out basic pretreatment to the text crawled out, for example removal stops word (i.e. number, quantifier, function word etc. do not have significant
Word), root reduction, segmentation (i.e. chunking), last available multiple vectors.Each vector is carried out hash algorithm to turn
It changes, obtains length f hash codes, then each carries out positive and negative weight conversion to the 1-0 value on each, such as f1 are
When 1, weight is set as+weight, and f1 when being 0, weight is set as-weight, thus the corresponding one f weights of each vector
Vector.All corresponding weight vectors of vector are added up according to corresponding position, one f weight arrays is finally obtained, will count
What position was positive in group sets 1, and what position was negative sets 0, then text is transformed into one f new 1-0 arrays, that is, one new
Hash code, as hash fingerprint, and then carry out duplicate removal and filtering using hash fingerprint removes a large amount of duplicate pages and advertisement
Deng.
Data crawling method provided by the embodiments of the present application crawls demand according to data and selects required generation from database
Then each code block selected is ranked up according to step execution sequence, obtains code block sequence, and then foundation by code block
Crawler needed for the configuration of code block sequence finally carries out data using configured crawler and crawls.Since the embodiment of the present application can
Required code block is selected to crawl demand according to data, then the code block selected is ranked up, that is to say, that phase
When step and then being combined sequence to each step that crawls in crawling demand selection multiple crawl according to data, it is configured so that into
Crawler can satisfy the various demands of user, for example, being the entire webpage of downloading or precisely grabbing, be to grab javascript
Webpage is also non-javascript webpage etc., moreover, data crawling method provided by the embodiments of the present application is simple, easily configuration, it can
Different web sites, various forms of data are crawled with realizing.
As shown in figure 3, in one embodiment, provide a kind of data and crawl device 30, the device 30 can integrate in
In above-mentioned computer equipment, it can specifically include:
Sequence determining module 32 selects required generation for crawling demand according to data from the database constructed in advance
Code block;And sequence is executed according to each code block selected, each code block selected is ranked up, is corresponded to
Code block sequence;
Crawler configuration module 33, for being configured to required crawler according to the code block sequence;
Data crawl module 34, and the required crawler for being completed using configuration is carried out data and crawled, and obtain crawling number
According to;
Database sharing module 31 includes multiple code blocks, institute in the database for constructing the database in advance
It states database sharing module to be specifically used for: data being carried out to preset multiple websites respectively and are crawled, and process will be crawled in data
Each of crawl computer code corresponding to step as a code block.
In some embodiments, described device further include: duplicate removal filtering module, for using local sensitivity hash algorithm pair
The data that crawl carry out heavy filtration.
In some embodiments, the crawler configuration module is specifically used for: according to the code block sequence and preset theory
Plaintext shelves determine the configuration file of the required crawler, wherein described to illustrate to be stored in document for generating the configuration text
Part illustrates information.
In some embodiments, data are carried out to preset multiple websites in the database sharing module respectively and crawl packet
It includes: the corresponding computer code being write respectively to preset multiple websites, and corresponding described using each website
Computer code carries out data to the website and crawls.
In some embodiments, preset multiple websites are write respectively in the database sharing module corresponding
The computer code includes: to write the corresponding meter respectively to preset multiple websites using fine granularity isolation
Calculation machine code.
In some embodiments, data crawl in module and crawl packet using the required crawler progress data that configuration is completed
It includes: corresponding website being logged in using the required crawler, is specifically included: being sent out by the required crawler to the server of corresponding website
Send log on request, carry agent address in the log on request, and periodically through the required crawler to it is described vicariously
It modifies or the agent address is repaired by the required crawler when encountering limited access or access errors in location
Change.
In some embodiments, the crawler configuration module is specifically used for: a1, configuring to seed;A2, to described kind
The address of son is configured;A3, to whether be full dose crawl configure;A4, to crawling javascript web page contents still
Non- javascript web page contents are configured;A5, it configures to crawling required keyword;A6, the institute to the seed
It is configured in region;A7, the series for starting to grab webpage are configured;A8, page turning mode is configured;A9, to needs
The attribute of the field of crawl is configured.
Data provided by the embodiments of the present application crawl device, and sequence determining module crawls demand from database according to data
Then each code block selected is ranked up according to step execution sequence, obtains code block by code block needed for selection
Sequence, and then crawler configuration module, according to crawler needed for the configuration of code block sequence, final data crawls module utilization and configures
Crawler carry out data crawl.Required code block is selected since the embodiment of the present application can crawl demand according to data, so
The code block selected is ranked up afterwards, that is to say, that be equivalent to according to data crawl demand selection it is multiple crawl step into
And sequence is combined to each step that crawls, the crawler being configured so that can satisfy the various demands of user, for example, under being
Carry entire webpage still precisely grab, be crawl javascript webpage be also non-javascript webpage etc., moreover, the application
The data crawling method that embodiment provides is simple, easily configures, and may be implemented to crawl different web sites, various forms of data.
In some embodiments, propose a kind of computer equipment, the computer equipment include memory, processor and
It is stored in the computer program that can be run on the memory and on the processor, the processor executes the computer
It is performed the steps of when program and crawls demand according to data, required code block is selected from the database constructed in advance;And root
Sequence is executed according to each code block selected, each code block selected is ranked up, corresponding code block is obtained
Sequence;According to the code block sequence, required crawler is configured;Data are carried out using the required crawler that configuration is completed
It crawls, obtains crawling data;It wherein, include multiple code blocks, the preparatory building process packet of the database in the database
It includes: data being carried out to preset multiple websites respectively and are crawled, and will be crawled corresponding to each of process step in data
Computer code is as a code block.
In some embodiments, it is also performed the steps of when the processor executes the computer program using part
Sensitive hash algorithm carries out heavy filtration to the data that crawl.
In some embodiments, what the processor executed is described according to the code block sequence, carries out to required crawler
Configuration, comprising: according to the code block sequence and it is preset illustrate document, determine it is described needed for crawler configuration file, wherein
It is described illustrate to be stored in document illustrate information for generate the configuration file.
In some embodiments, the processor execute it is described data carried out to preset multiple websites respectively crawl,
It include: the corresponding computer code to be write respectively to preset multiple websites, and use the corresponding institute in each website
Computer code is stated to crawl website progress data.
In some embodiments, the processor execute it is described preset multiple websites are write respectively it is corresponding
The computer code, comprising: preset multiple websites are write respectively using fine granularity isolation corresponding described
Computer code.
In some embodiments, the required crawler for using configuration to complete that the processor executes carries out data
It crawls, comprising: corresponding website is logged in using crawler needed for described, is specifically included: by the required crawler to corresponding website
Server sends log on request, carries agent address in the log on request, and periodically through the required crawler to institute
State agent address modify or when encountering access limited or access errors by crawler needed for described to it is described vicariously
It modifies location.
In some embodiments, what the processor executed is described according to the code block sequence, carries out to required crawler
Configuration includes: a1, configures to seed;A2, the address of the seed is configured;A3, to whether be full dose grab into
Row configuration;It a4, is that non-javascript web page contents configure to javascript web page contents are crawled also;A5, to crawling
Required keyword is configured;A6, the region of the seed is configured;A7, start grab webpage series into
Row configuration;A8, page turning mode is configured;A9, the attribute for the field that needs grab is configured.
The beneficial effect of computer equipment provided by the present application is identical as above-mentioned data crawling method and device, here no longer
It repeats.
In one embodiment it is proposed that a kind of storage medium for being stored with computer-readable instruction, this is computer-readable
When instruction is executed by one or more processors, so that one or more processors execute following steps: crawling need according to data
It asks, required code block is selected from the database constructed in advance;It is right and according to the sequence that executes for each code block selected
The each code block selected is ranked up, and obtains corresponding code block sequence;According to the code block sequence, to required crawler
It is configured;Data are carried out using the required crawler that configuration is completed to crawl, and obtain crawling data;Wherein, the database
In include multiple code blocks, the preparatory building process of the database includes: to carry out data respectively to preset multiple websites to climb
It takes, and crawls computer code corresponding to step as a code block for each of process is crawled in data.
In some embodiments, following step is also realized when one or more of processors execute the computer program
It is rapid: heavy filtration is carried out to the data that crawl using local sensitivity hash algorithm.
In some embodiments, what one or more of processors executed is described according to the code block sequence, to institute
Need crawler to be configured, comprising: according to the code block sequence and it is preset illustrate document, determine it is described needed for crawler configuration
File, wherein it is described illustrate to be stored in document illustrate information for generate the configuration file.
In some embodiments, described preset multiple websites are carried out respectively of one or more of processors execution
Data crawl, comprising: write the corresponding computer code respectively to preset multiple websites, and use each website
The corresponding computer code carries out data to the website and crawls.
In some embodiments, described preset multiple websites are compiled respectively of institute's one or more processors execution
Write the corresponding computer code, comprising: write respectively pair to preset multiple websites using fine granularity isolation
The computer code answered.
In some embodiments, one or more of processors execute it is described using configuration complete it is described needed for climb
Worm carries out data and crawls, comprising: logs in corresponding website using crawler needed for described, specifically includes: by the required crawler to
The server of corresponding website sends log on request, carries agent address in the log on request, and periodically through the institute
Crawler is needed to modify to the agent address or when encountering limited access or access errors by the required crawler pair
The agent address is modified.
In some embodiments, what one or more of processors executed is described according to the code block sequence, to institute
It needs crawler to carry out configuration to include: a1, configure seed;A2, the address of the seed is configured;A3, to whether being
Full dose crawl is configured;It a4, is that non-javascript web page contents configure to javascript web page contents are crawled also;
A5, it configures to crawling required keyword;A6, the region of the seed is configured;A7, start to grab webpage
Series configured;A8, page turning mode is configured;A9, the attribute for the field that needs grab is configured.
The beneficial effect of storage medium provided by the present application is identical as data crawling method and device, and which is not described herein again.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, which can be stored in a computer-readable storage and be situated between
In matter, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, storage medium above-mentioned can be
The non-volatile memory mediums such as magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random storage note
Recall body (Random Access Memory, RAM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously
Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention
Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (10)
1. a kind of data crawling method characterized by comprising
Demand is crawled according to data, required code block is selected from the database constructed in advance;And it is each according to what is selected
Code block executes sequence, is ranked up to each code block selected, and obtains corresponding code block sequence;
According to the code block sequence, required crawler is configured;
Data are carried out using the required crawler that configuration is completed to crawl, and obtain crawling data;
It wherein, include multiple code blocks in the database, the preparatory building process of the database includes:
Data are carried out respectively to preset multiple websites to crawl, and data are crawled into each of process and are crawled corresponding to step
Computer code as a code block.
2. the method according to claim 1, wherein further include: it is climbed using local sensitivity hash algorithm to described
Access is according to carrying out heavy filtration.
3. the method according to claim 1, wherein described according to the code block sequence, to required crawler into
Row configuration, comprising: according to the code block sequence and it is preset illustrate document, determine it is described needed for crawler configuration file;Its
In, it is described illustrate to be stored in document illustrate information for generate the configuration file.
4. carrying out data respectively the method according to claim 1, wherein described to preset multiple websites and climbing
It takes, comprising: the corresponding computer code is write respectively to preset multiple websites, and corresponding using each website
The computer code carries out data to the website and crawls.
5. according to the method described in claim 4, it is characterized in that, described write correspondence to preset multiple websites respectively
The computer code, comprising: corresponding institute is write to preset multiple websites using fine granularity isolation respectively
State computer code.
6. described in any item methods according to claim 1~5, which is characterized in that described using the described required of configuration completion
Crawler carries out data and crawls, comprising: logs in corresponding website using the required crawler, specifically includes: by the required crawler
Log on request is sent to the server of corresponding website, carries agent address in the log on request, and periodically through described
Required crawler modifies to the agent address or when encountering limited access or access errors by the required crawler
It modifies to the agent address.
7. described in any item methods according to claim 1~5, which is characterized in that it is described according to the code block sequence, to institute
Crawler is needed to be configured, comprising:
A1, seed is configured;
A2, the address of the seed is configured;
A3, to whether be full dose crawl configure;
It a4, is that non-javascript web page contents configure to javascript web page contents are crawled also;
A5, it configures to crawling required keyword;
A6, the region of the seed is configured;
A7, the series for starting to grab webpage are configured;
A8, page turning mode is configured;
A9, the attribute for the field that needs grab is configured.
8. a kind of data crawl device, which is characterized in that described device includes:
Sequence determining module selects required code block for crawling demand according to data from the database constructed in advance;And
Sequence is executed according to each code block selected, each code block selected is ranked up, corresponding code is obtained
Block sequence;
Crawler configuration module, for being configured to required crawler according to the code block sequence;
Data crawl module, and the required crawler for being completed using configuration is carried out data and crawled, and obtain crawling data;
Database sharing module includes multiple code blocks, the data in the database for constructing the database in advance
Library building module is specifically used for: carrying out data respectively to preset multiple websites and crawls, and will be every during data crawl
One crawls computer code corresponding to step as a code block.
9. a kind of computer equipment, which is characterized in that including memory and processor, being stored with computer in the memory can
Reading instruction, when the computer-readable instruction is executed by the processor, so that the processor executes such as claim 1 to 7
Any one of data crawling method described in claim the step of.
10. a kind of storage medium for being stored with computer-readable instruction, which is characterized in that the computer-readable instruction is by one
Or multiple processors are when executing, so that one or more processors are executed as described in any one of claims 1 to 7 claim
The step of data crawling method.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910319429.XA CN110209909A (en) | 2019-04-19 | 2019-04-19 | Data crawling method, device, computer equipment and storage medium |
PCT/CN2019/118419 WO2020211367A1 (en) | 2019-04-19 | 2019-11-14 | Data crawling method and apparatus, computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910319429.XA CN110209909A (en) | 2019-04-19 | 2019-04-19 | Data crawling method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110209909A true CN110209909A (en) | 2019-09-06 |
Family
ID=67786028
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910319429.XA Pending CN110209909A (en) | 2019-04-19 | 2019-04-19 | Data crawling method, device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110209909A (en) |
WO (1) | WO2020211367A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111597421A (en) * | 2020-04-30 | 2020-08-28 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
WO2020211367A1 (en) * | 2019-04-19 | 2020-10-22 | 平安科技(深圳)有限公司 | Data crawling method and apparatus, computer device and storage medium |
CN112541104A (en) * | 2019-09-20 | 2021-03-23 | 浙江大搜车软件技术有限公司 | Data capturing method and device |
CN112732996A (en) * | 2021-01-11 | 2021-04-30 | 深圳市洪堡智慧餐饮科技有限公司 | Multi-platform distributed data crawling method based on asynchronous aiohttp |
CN113542223A (en) * | 2021-06-16 | 2021-10-22 | 杭州拼便宜网络科技有限公司 | Equipment fingerprint-based crawler-resisting method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104766014A (en) * | 2015-04-30 | 2015-07-08 | 安一恒通(北京)科技有限公司 | Method and system used for detecting malicious website |
CN107729508A (en) * | 2017-10-23 | 2018-02-23 | 北京京东金融科技控股有限公司 | Information crawler method and apparatus |
CN108153880A (en) * | 2017-12-26 | 2018-06-12 | 北京非斗数据科技发展有限公司 | A kind of more tactful self-adapting crawling technologies about network picture |
CN109063144A (en) * | 2018-08-07 | 2018-12-21 | 广州金猫信息技术服务有限公司 | Visual network crawler method and device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567513B (en) * | 2011-12-27 | 2014-09-17 | 北京神州绿盟信息安全科技股份有限公司 | Method and equipment for collecting phishing websites |
US9792629B2 (en) * | 2013-08-05 | 2017-10-17 | Yahoo Holdings, Inc. | Keyword recommendation |
CN110209909A (en) * | 2019-04-19 | 2019-09-06 | 平安科技(深圳)有限公司 | Data crawling method, device, computer equipment and storage medium |
CN110189189A (en) * | 2019-04-19 | 2019-08-30 | 平安科技(深圳)有限公司 | One-stop shopping at network bootstrap technique, device, computer equipment and storage medium |
-
2019
- 2019-04-19 CN CN201910319429.XA patent/CN110209909A/en active Pending
- 2019-11-14 WO PCT/CN2019/118419 patent/WO2020211367A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104766014A (en) * | 2015-04-30 | 2015-07-08 | 安一恒通(北京)科技有限公司 | Method and system used for detecting malicious website |
CN107729508A (en) * | 2017-10-23 | 2018-02-23 | 北京京东金融科技控股有限公司 | Information crawler method and apparatus |
CN108153880A (en) * | 2017-12-26 | 2018-06-12 | 北京非斗数据科技发展有限公司 | A kind of more tactful self-adapting crawling technologies about network picture |
CN109063144A (en) * | 2018-08-07 | 2018-12-21 | 广州金猫信息技术服务有限公司 | Visual network crawler method and device |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020211367A1 (en) * | 2019-04-19 | 2020-10-22 | 平安科技(深圳)有限公司 | Data crawling method and apparatus, computer device and storage medium |
CN112541104A (en) * | 2019-09-20 | 2021-03-23 | 浙江大搜车软件技术有限公司 | Data capturing method and device |
CN111597421A (en) * | 2020-04-30 | 2020-08-28 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN111597421B (en) * | 2020-04-30 | 2022-08-30 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN112732996A (en) * | 2021-01-11 | 2021-04-30 | 深圳市洪堡智慧餐饮科技有限公司 | Multi-platform distributed data crawling method based on asynchronous aiohttp |
CN113542223A (en) * | 2021-06-16 | 2021-10-22 | 杭州拼便宜网络科技有限公司 | Equipment fingerprint-based crawler-resisting method |
Also Published As
Publication number | Publication date |
---|---|
WO2020211367A1 (en) | 2020-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110209909A (en) | Data crawling method, device, computer equipment and storage medium | |
Khalil et al. | RCrawler: An R package for parallel web crawling and scraping | |
US10572863B2 (en) | Systems and methods for managing allocation of machine data storage | |
CN102985921B (en) | There is the client terminal device high speed caching electronic document resources of e-sourcing data base | |
CN110189189A (en) | One-stop shopping at network bootstrap technique, device, computer equipment and storage medium | |
CN104424199A (en) | Search method and device | |
CN107807937B (en) | Website SEO processing method, device and system | |
CN106575298A (en) | Fast rendering of websites containing dynamic content and stale content | |
CN108932332A (en) | The loading method and device of static resource | |
CN106126693A (en) | The sending method of the related data of a kind of webpage and device | |
CN107688568A (en) | Acquisition method and device based on web page access behavior record | |
CN106407371A (en) | User comment data displaying method and system, server and client | |
CN103699674A (en) | Webpage storing method, webpage opening method, webpage storing device, webpage opening device and webpage browsing system | |
CN105283843B (en) | Embeddable media content search widget | |
US9398068B2 (en) | Bulk uploading of multiple self-referencing objects | |
CN106201562A (en) | A kind of page switching method and device | |
CN102591916A (en) | Webpage opening method and website system | |
CN103455547B (en) | A kind of method and device for webpage loading | |
CN108334619A (en) | A kind of collecting method, device, computing device and storage medium | |
CN106886547A (en) | A kind of scenario generation method and device | |
Chang | A survey of modern crawler methods | |
US9824151B2 (en) | Providing a portion of requested data based upon historical user interaction with the data | |
CN110020273A (en) | For generating the method, apparatus and system of thermodynamic chart | |
CN107147645A (en) | The acquisition methods and device of network security data | |
US9846605B2 (en) | Server-side minimal download and error failover |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190906 |