CN106844553A - Data snooping and extending method and device based on sample data - Google Patents
Data snooping and extending method and device based on sample data Download PDFInfo
- Publication number
- CN106844553A CN106844553A CN201611264829.8A CN201611264829A CN106844553A CN 106844553 A CN106844553 A CN 106844553A CN 201611264829 A CN201611264829 A CN 201611264829A CN 106844553 A CN106844553 A CN 106844553A
- Authority
- CN
- China
- Prior art keywords
- data
- matched
- sample
- sample data
- matched rule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of data snooping and extending method and device based on sample data, methods described comprise the following steps:At least one data in based on database determine the sample data, and the database purchase has many datas for being detected from mass data and being obtained;Searched in the mass data based on the sample data, to obtain the matched data matched with the sample data in the mass data;The matched data is processed to obtain matched rule, and updated fingerprint base, the fingerprint base is stored with the matched rule for obtaining in history;Matching extraction is carried out in the mass data based on the fingerprint base after renewal, with obtain in the mass data with the renewal after fingerprint base in the data that match of matched rule, and the data extending that obtains to the database will be matched.The technical scheme provided by the present invention can more accurately and efficiently carry out the analysis and treatment of global, system to mass data.
Description
Technical field
The present invention relates to Internet technical field, more particularly to a kind of data snooping based on sample data and expansion side
Method and device.
Background technology
With the high speed development of Internet technology, China Internet website and number of netizens rapidly rise, with netizen
Be skyrocketed through, and Internet resources increasingly enrich, on internet produce access log data also rapid expanding is formed
Mass data so that how detection finds and expands required data message as current information treatment side work from mass data
The most important thing of work.
At present, the method for data is concentrated mainly on following two needed for being found from mass data and expanded:First, being people
Work checks data mode, by manually to internet Shang Ge websites or application program (Application, abbreviation APP, for example,
Be loaded in the application software in mobile phone) user's accessing united resource positioning symbol (Uniform Resource Locator, referred to as
URL) it is analyzed and summarizes, obtain a series of matched rule, is then based on these matched rules again to the magnanimity of internet
Matched again in data resource, so as to extract the data expanded needed for obtaining.Second, being then application programming interface
(Application Programming Interface, abbreviation API) inquiry mode, this method is by API provider
Document description, calls the interface of other side so as to obtain required data as needed.
Although both approaches can to a certain extent meet user and wish certain kinds are found and expanded from mass data
The data of type, but, both approaches are individually present the defect that cannot avoid.For hand inspection data mode,
Substantial amounts of manpower is needed to go to carry out analysis and the statistics of correlation manually in practical operation, detection and expansion efficiency are low;API issuers
Formula then depends on the document description that API provider provides, with uncertainty.
On the other hand, found and extending method including the available data including above two mode, what is finally obtained is all
Data on some specific websites.But due to the tremendous expansion of website scale in internet, and many websites and APP pairs
The building mode of URL does not formulate unified standard and rule, therefore is mass data by the data that existing method is obtained
In sub-fraction, be unfavorable for that user carries out global, the analysis and treatment of system to mass data, have impact on user detect with
Expand the degree of accuracy of the data for obtaining.
The content of the invention
Present invention solves the technical problem that being that prior art cannot be so that more accurately and efficiently mode is carried out to mass data
Global, the analysis and treatment of system.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of data snooping based on sample data and expansion side
Method, comprises the following steps:At least one data in based on database determine the sample data, the database purchase have from
Many datas of acquisition are detected in mass data;Searched in the mass data based on the sample data, it is described to obtain
The matched data matched with the sample data in mass data;The matched data is processed to obtain matching rule
Then, and fingerprint base is updated, the fingerprint base is stored with the matched rule for obtaining in history;Based on the fingerprint base after renewal described
Matching extraction is carried out in mass data, with obtain in the mass data with the renewal after fingerprint base in matched rule phase
The data of matching, and the data extending for obtaining will be matched to the database.
Optionally, it is described based on database at least one data determine the sample data, comprise the following steps:From
The data of predetermined number are selected in the database, and using the characteristic information of the data of the predetermined number as the sample number
According to.
Optionally, the characteristic information includes:The signature identification code of the data of the predetermined number;Or according to described pre-
If the regular expression that the data of quantity determine.
Optionally, searched in the mass data based on the sample data, with obtain in the mass data with institute
The matched data that sample data matches is stated, is comprised the following steps:Searched in the mass data and had with the sample data
There are a data of same characteristic features information, and using the data with same characteristic features information as the matched data.
Optionally, when being searched in mass data based on the sample data, if there is preset limit condition, in institute
State in mass data and searched by the partial data of the preset limit conditional definition, to obtain the matched data.
Optionally, the matched data is processed to obtain matched rule, and is updated fingerprint base, including following step
Suddenly:Structuring treatment is carried out to the matched data, the normal data of preset format arrangement is pressed to obtain;Based on the criterion numeral
According to the generation matched rule and duplicate removal;The fingerprint base is updated based on the matched rule after duplicate removal.
Optionally, the matched rule and duplicate removal are generated based on the normal data, is comprised the following steps:According to described pre-
If the normal data is converted to the matched rule by form;Duplicate keys in the matched rule that removal is converted to, obtain
Matched rule after the duplicate removal.
Optionally, the fingerprint base is updated based on the fingerprint after duplicate removal, is comprised the following steps:By the matching after the duplicate removal
Compared with the regular matched rule in the fingerprint base, with secondary removal duplicate keys;By the matching after secondary removal duplicate keys
Policy Updates are to the fingerprint base.
Optionally, the data are internet access record.
The embodiment of the present invention also provides a kind of data snooping and expanding device based on sample data, including:Determining module,
Determine the sample data at least one data in based on database, the database purchase has to be visited from mass data
Survey many datas for obtaining;Searching modul, it is described to obtain for being searched in the mass data based on the sample data
The matched data matched with the sample data in mass data;Update module, for processing the matched data
To obtain matched rule, and fingerprint base is updated, the fingerprint base is stored with the matched rule for obtaining in history;Extraction module, uses
Fingerprint base after based on renewal carries out matching extraction in the mass data, with obtain in the mass data with it is described more
The data that the matched rule in fingerprint base after new matches, and the data extending that obtains to the database will be matched.
Optionally, the determining module includes:Selection submodule, the number for selecting predetermined number from the database
According to, and using the characteristic information of the data of the predetermined number as the sample data.
Optionally, the characteristic information includes:The signature identification code of the data of the predetermined number;Or according to described pre-
If the regular expression that the data of quantity determine.
Optionally, the searching modul includes:First search submodule, for searched in the mass data with it is described
Sample data has a data of same characteristic features information, and using the data with same characteristic features information as the coupling number
According to.
Optionally, the searching modul also includes that second searches submodule, and the second lookup submodule is used to be based on
When the sample data is searched in mass data, if there is preset limit condition, by described pre- in the mass data
Searched in the partial data of restriction conditional definition, to obtain the matched data.
Optionally, the update module includes:Treatment submodule, for carrying out structuring treatment to the matched data,
The normal data of preset format arrangement is pressed to obtain;Generation submodule, for based on the normal data generation matching rule
Then and duplicate removal;Submodule is updated, for updating the fingerprint base based on the matched rule after duplicate removal.
Optionally, the generation submodule includes:Converting unit, for according to the preset format by the normal data
Be converted to the matched rule;Duplicate removal unit, for removing the duplicate keys in the matched rule being converted to, obtains the duplicate removal
Matched rule afterwards.
Optionally, the renewal submodule includes:Comparing unit, for by the matched rule after the duplicate removal and the finger
Matched rule in line storehouse compares, with secondary removal duplicate keys;Updating block, for by it is secondary removal duplicate keys after matching
Policy Updates are to the fingerprint base.
Optionally, the data are internet access record.
Compared with prior art, the technical scheme of the embodiment of the present invention has the advantages that:
First at least one data in database determine sample data, and based on the sample data to mass data
Middle lookup, obtains the matched data matched with the sample data, then again to described with the detection from the mass data
Matched data is processed to obtain matched rule, so as to update fingerprint base, is based ultimately upon the fingerprint base after updating and is arrived described again
Matching extraction is carried out in mass data, thus in obtaining the mass data with the renewal after fingerprint base in matched rule
Data for matching, and the data extending for obtaining will be matched to the database, realize data snooping based on sample data and
Expand.Find to be compared with expansion scheme than the existing data for being based primarily upon artificial or API inquiries, the embodiment of the present invention
Technical scheme is based on sample data and generates matched rule, further according to being done in matched rule to original data source (i.e. mass data)
With extraction, with expanding data storehouse, sample data and repetition abovementioned steps, most end form are then determined from the database after expansion again
Into closed loop cycle flow.The technical scheme provided by the present invention, more accurately and efficiently can be carried out entirely to mass data
Office, the analysis and treatment of system.
Further, the data of predetermined number are selected from the database, and by the feature of the data of the predetermined number
Information is detected by template of the sample data as the sample data in mass data, to obtain and the sample number
Come expanding data storehouse according to the data for matching, it is ensured that the data stored in the database are the number with same characteristic features information
According to meeting the use demand that user had found and collected specific type of data from mass data.
Brief description of the drawings
Fig. 1 is the flow of a kind of data snooping and extending method based on sample data of the first embodiment of the present invention
Figure;
Fig. 2 is the flow of a kind of data snooping and extending method based on sample data of the second embodiment of the present invention
Figure;
Fig. 3 is the flow of a kind of data snooping and extending method based on sample data of the third embodiment of the present invention
Figure;
Fig. 4 is the character match that data snooping and extending method based on sample data build using the embodiment of the present invention
Tree schematic diagram;
Fig. 5 is that the structure of a kind of data snooping and expanding device based on sample data of the fourth embodiment of the present invention is shown
It is intended to.
Specific embodiment
As background technology is sayed, the existing method that user requested data is found and expanded from mass data still is limited to
Manual retrieval or API inquiry two ways.But, the former needs to expend substantial amounts of manpower and goes to be analyzed data system manually
Meter;The latter cannot then adapt to the analysis of overall importance and treatment of data.
In order to solve this technical problem, at least one data of the technical scheme of the present invention first in database are true
Random sample notebook data, and based on being searched in the sample data to mass data, obtained and institute with being detected from the mass data
The matched data that sample data matches is stated, the matched data is processed to obtain matched rule again then, so that more
New fingerprint base, is based ultimately upon the fingerprint base after updating again to matching extraction is carried out in the mass data, so as to obtain the sea
In amount data with the renewal after fingerprint base in the data that match of matched rule, and the data extending for obtaining will be matched extremely
The database, realizes the data snooping based on sample data and expansion.
It will be appreciated by those skilled in the art that as the expansion type of Internet user increases, the substantial increase of internet site with
And the lifting at full speed of Internet bandwidth, increasing user generates increasing internet on increasing website
User behavior (i.e. internet access record).And these behaviors are recorded and made by Various types of data picker in the form of daily record
For data (i.e. mass data) are stored.The technical scheme of the embodiment of the present invention is based on sample data and generates matched rule, then
Extracted according to matching is done in matched rule to original data source (i.e. mass data), with expanding data storehouse, then again from after expansion
Database in determine sample data and repeatedly abovementioned steps, ultimately form closed loop cycle flow.The skill provided by the present invention
Art scheme, can more accurately and efficiently carry out global, the analysis and treatment of system to mass data.
It is understandable to enable above-mentioned purpose of the invention, feature and beneficial effect to become apparent, below in conjunction with the accompanying drawings to this
The specific embodiment of invention is described in detail.
Fig. 1 is the flow of a kind of data snooping and extending method based on sample data of the first embodiment of the present invention
Figure.Wherein, the data can be recorded for internet access.
Specifically, in the present embodiment, step S101 is first carried out, based on database at least one data determine institute
Sample data is stated, the database purchase there are many datas for being detected from mass data and being obtained.More specifically, the magnanimity
Data can be the data that history is obtained from internet, for example, the internet access record of all users, Huo Zhe in history
The internet access record of selected user in during selected.In a preference, the quantity of the sample data can basis
The data-handling capacity of the hardware or software that perform the embodiment of the present invention carries out personalized setting, for example, the general sample number
According to quantity can be between 10,000 to 100,000.Preferably, the data can be with URL (Uniform
Resource Locator, abbreviation URL) form represents, or, the data can also more than the URL (Refer of
URL), the form such as user agent (user agent) or cookie represents that those skilled in the art can also be according to actual needs
Change dissolves more embodiments, will not be described here.
Performed subsequently into step S102, searched in the mass data based on the sample data, it is described to obtain
The matched data matched with the sample data in mass data.Specifically, the matching can refer to the matched data with
The sample data has identical rule.Preferably, this step can be simultaneously or priority is at least one device clusters
Carry out, wherein, the device clusters can be formed by the coupling of one or more computers.In a preference, can be by
The mass data is distributed on the computer of multiple cluster compositions and is processed, and then collects the computer institute in each cluster
The matched data being fitted on, for example, can be by based on distributed system architecture (Hadoop Distributed File
System mapping stipulations (Mapreduce) task) is realized to the decentralized processing of the mass data and collected.
Next step S103 is performed, the matched data is processed to obtain matched rule, and update fingerprint base,
The fingerprint base is stored with the matched rule for obtaining in history.Specifically, the matched rule is used to describe the sample data
The rule having jointly with the matched data.More specifically, the fingerprint base is used to store the history execution present invention in fact
After applying the technical scheme of example, the matched rule extracted from the matched data.It will be appreciated by those skilled in the art that passing through
The fingerprint base is constantly enriched, follow-up iterative operation can be preferably promoted so that the technical scheme of the embodiment of the present invention
The fingerprint base that can be based on updating matches the more data of acquisition in mass data.
Step S104 is finally performed, matching extraction is carried out in the mass data based on the fingerprint base after renewal, to obtain
In the mass data with the renewal after fingerprint base in the data that match of matched rule, and the number for obtaining will be matched
According to extending to the database.In a preference, the mass data is entered one by one based on the fingerprint base after the renewal
Row treatment, and matching result one by one is collated the minutes, the data for obtaining will be matched and be updated to the database, so that real
Now to effective expansion of the scale of construction of the database.
In a change case of the present embodiment, after the step S104 has been performed, the number after expanding is also based on
The execution step S101 is started again at according to storehouse, more sample datas are generated with based on the database after the expansion, and then
Detected in the mass data and obtain more matched datas, finally further expand the database.
By upper, using the scheme of first embodiment, matched rule is generated based on sample data, further according to matched rule to original
Matching is done in beginning data source (i.e. mass data) to extract, and with expanding data storehouse, then determines sample from the database after expansion again
Notebook data simultaneously repeats abovementioned steps.By the technical scheme of the embodiment of the present invention, can be formed at an iteration for closed loop
Reason mechanism, is conducive to user that global, the analysis and treatment of system are more accurately and efficiently carried out to mass data.
Fig. 2 is the flow of a kind of data snooping and extending method based on sample data of the second embodiment of the present invention
Figure.Specifically, in the present embodiment, step S201 is first carried out, the data of predetermined number is selected from the database, and will
The characteristic information of the data of the predetermined number is used as the sample data.More specifically, the predetermined number is by user's root
Determine according to the hardware or the data-handling capacity of software that perform the embodiment of the present invention.Preferably, the characteristic information can be institute
State the signature identification code of the data of predetermined number.For example, when URL information of the data for commodity, the signature identification code
Can be the identity code (identification, abbreviation ID) of the commodity, the identity code can be from the business
Extracted in the corresponding URL information of product.
Performed subsequently into step S202, being searched in the mass data with the sample data there are same characteristic features to believe
The data of breath, and using the data with same characteristic features information as the matched data.Preferably for equally with URL
The mass data for representing, can be split as three matched position (main frames by the URL of each mass data by structure
Host, path path and inquiry query), and in the way of selecting one or select two or all matching, the matched position that will choose with
The sample data compares, to search the number for having same characteristic features information with the sample data from the mass data
According to.Preferably for the sample data that information is characterized with signature identification code, can be by different matched rule to the magnanimity
The data that there is same characteristic features information with the sample data are searched in data.
In a preference, the position of host machine in the URL of the mass data can be matched, and can adopt
The position of host machine of the URL for searching which data in the mass data with mode of the left side comprising matching has with the sample data
There is identical signature identification code.Preferably, the left side can refer to that position to be matched is (in i.e. foregoing preference comprising matching
Position of host machine) character string left side complete match described in sample data signature identification code.For example, a certain number in mass data
According to URL position of host machine include character string item_44123_abcde, then it is considered that the character string complete match with feature
The sample data that identification code item_44123 is represented, so that it is determined that the mass data with the sample data there are same characteristic features to believe
Breath.
Next step S203 is performed, the matched data is processed to obtain matched rule, and update fingerprint base,
The fingerprint base is stored with the matched rule for obtaining in history.Specifically, those skilled in the art may be referred to shown in above-mentioned Fig. 1
Step S103 described in embodiment, will not be described here.Preferably, the matched rule is used to filtering and extract described in multiple
With the common feature that data have.
Step S204 is finally performed, matching extraction is carried out in the mass data based on the fingerprint base after renewal, to obtain
In the mass data with the renewal after fingerprint base in the data that match of matched rule, and the number for obtaining will be matched
According to extending to the database.Specifically, those skilled in the art may be referred to step described in above-mentioned embodiment illustrated in fig. 1
S104, will not be described here.In a preference, all data included to the mass data by the matched position by
Bar is matched, for example, can be by first main frame, path, the matching order finally inquired about are matched again.Specifically, sentence first
The host machine part of the matched rule that the host machine part of the URL of the data of breaking can include with the fingerprint base is matched, if both
Host machine part is mismatched, then skip the data and transfer to match other data that the mass data includes, if both host machine parts
Matching, then continue to match the path sections of the data with the path sections of the matched rule, when both path sections also
Whether timing, then match the query portion of the data with the query portion of the matched rule, finally to determine the data
Match with the matched rule in the fingerprint base after the renewal.
Further, however, it is determined that the data meet the matching condition of the matched rule, then from the extracting data
The part that matches with the matched rule is simultaneously updated to the database.
Further, the mass data is matched one by one based on the fingerprint base after the renewal, it is described to determine
In mass data with the renewal after fingerprint base in the data that match of matched rule, and by the data content of compatible portion
Extract and arrange to the database, so as to greatly expand the scale of construction of the database.
Further, for implement the embodiment of the present invention during, can be obtainable dirty during for detection and expansion
Data, the mode that can be combined with artificial and/or Computer Automatic Recognition is subject to examination, to ensure final updated to the data
Data validity and accuracy in storehouse.
In a change case of the step S201, the characteristic information can also be the number according to the predetermined number
According to the regular expression for determining.It will be appreciated by those skilled in the art that the regular expression can be used for matching from the database
The characteristic information of middle random selection all data out, or, the regular expression can be also used for matching user wishes
The characteristic information of all data for obtaining is detected and expanded from the mass data.
If for example, wish to go in mass data to be detected and expanded for sample data with equipment signature identification code, from institute
State and the sample data of acquisition is randomly choosed in database include the equipment signature identification code of telecommunication apparatus and setting for mobile device
Standby signature identification code, because the equipment signature identification code of telecommunication apparatus is based on International Mobile Equipment Identity code (International
Mobile Equipment Identity, abbreviation IMEI) represent, and the equipment signature identification of mobile device code is set based on movement
Standby identification code (Mobile Equipment Identifier, abbreviation MEID) represents, and both equipment signature identifications code is total to
It is that it is numeral that both are with the 11 of 1 beginning, thus is referred to the common ground and determines the regular expression with point.
The all data for for example being randomly choosed out from the database again are media access control (Media Access
Control, abbreviation MAC) address, then the regular expression can be expressed as "/^ (and [a-zA-Z0-9] { 8 }-[a-zA-Z0-
9]{4}\-[a-zA-Z0-9]{4}\-[a-zA-Z0-9]{4}\-[a-zA-Z0-9]{12})$/”。
Again for example, user wishes to detect and expand from the mass data data obtained in specific geographical area, then
The regular expression can also be used, the specific geographical area is represented by the restriction of pair warp and weft degree.
It is possible to further there is phase with the sample data by being searched in different matched rule to the mass data
Data with characteristic information include, directly by the part to be matched (matched position chosen) of the data and the sample number
According to regular expression carry out canonical matching, if the matching condition that the regular expression is met with compatible portion, can
To determine that the data have identical characteristic information with the sample data.For example, the regular expression of the sample data
Can for shop- (d+)-, then for a data, if the URL of the part to be matched of the data be shop-33415-23-
Test, the logic of the regular expression is met due to the URL of the part to be matched, it is possible to determine the data and institute
Stating sample data has identical characteristic information.
In a change case of the step S202, the matched rule also includes that right side includes matching, if the sea
The signature identification code complete match of character string and the sample data of the position to be matched of a certain data in amount data, it is determined that
The data have identical characteristic information with the sample data.For example, in the mass data URL of a certain data road
Path position includes character string car_shanghai_ser33456, and the signature identification code of the sample data is ser3356, then may be used
To determine that the mass data has identical characteristic information with the sample data.
In another change case of the step S202, the matched rule also includes essentially equal matching, if described
The character string of the position to be matched of a certain data is essentially equal with the signature identification code of the sample data in mass data, then really
The fixed data have identical characteristic information with the sample data.It may for instance be considered that character string shop=33415&
Category=23&item=test is essentially equal with signature identification code 33415.
In another change case of the step S202, the matched rule also includes comprising matching, if the magnanimity
Signature identification code of the character string of the position to be matched of a certain data comprising the sample data in data, it is determined that the data
There is identical characteristic information with the sample data.It may for instance be considered that character string shop-33415-23-test includes spy
Levy identification code 33415.
In a change case of the step S204, when the characteristic information is true according to the data of the predetermined number
Current scanned data have identical special with the sample information during fixed regular expression, and in the mass data
When reference ceases, directly the data being currently scanned can be carried out with the extraction of regular expression, and by the regular expressions
Formula is updated to the fingerprint base.
In a change case of the present embodiment, the step S202 is looked into based on the sample data in mass data
When looking for, if there is preset limit condition, in the partial data in the mass data by the preset limit conditional definition
Search, to obtain the matched data.Preferably for the data and sample data that are represented with URL, the preset limit condition
Can be the top layer domain name tld in URL.For example, user can select the part or complete to selecting to determine in the step S201
Portion's sample data defines the top layer domain name tld, then the technical scheme of the embodiment of the present invention is performing the step S202 to institute
When stating step S204, for the sample data defined by the top layer domain name tld, the top level domain is preferably only detected and expanded
Data where name tld on website are to the database.
Further, the preset limit condition can be according to user's request, or the technology for performing the embodiment of the present invention
The data-handling capacity of the equipment of scheme sets.
Further, the top layer domain name tld of the sample data be able to can also be differed with identical, for example, can by from
In selecting all sample datas for determining in the database, the top layer domain name tld of half sample data and an other half data
Top layer domain name tld be set as different websites, with the technical scheme based on the embodiment of the present invention simultaneously enter in two websites
The detection and retrieval of row data.
In a typical application scenarios, when computer performs the technical scheme of the embodiment of the present invention, first by institute
State sample data to be loaded into the local memory of the computer, when the part or all of data in the sample data are present in advance
If the top layer domain name tld when, can build a mapping table in the internal memory, the mapping table is used for the sample
The characteristic information or regular expression classified and stored of one or more sample datas with identical top layer domain name tld in data.
Can be with identical top layer preferably for the application scenarios that the characteristic information is the signature identification code
One or more sample datas of domain name tld build character match tree, matching during to improve subsequent probe and expanding data
Efficiency.
Preferably for the application scenarios that the characteristic information is the regular expression, can also there will be identical top
The respective regular expression of one or more sample datas of layer domain name tld is stored as a list, to perform follow-up spy
Survey and expand step.As a change case, the multiple sample datas with identical top layer domain name tld can also be directed to and determine institute
State regular expression.
Further, when the sample data and the mass data are based on URL to be represented, the step S202 is preferred
Ground first to the URL treatment of the sample data, to obtain top layer domain name tld, Ran Hou corresponding with the sample data
When scanning the mass data one by one, judge whether the URL of current scanned data includes the top layer domain name tld, if sentencing
Disconnected result shows that the URL of the current scanned data not comprising the top layer domain name tld, then directly skips the data;It is no
Then, if judged result shows that the URL of the current scanned data includes the top layer domain name tld, it is further continued for performing institute
Step S202, is compared the URL of the data with the characteristic information of the sample data based on the matched position chosen, with from
The data that there is same characteristic features information with the sample data are searched in the mass data.
By upper, using the scheme of second embodiment, have with sample data during mass data can be detected according to sample data
There are the data of same characteristic features information, so that the data finally extended in the database are all believed with identical feature
Breath, meets the actually used demand that user had found and expanded specific type of data in mass data.
It will be appreciated by those skilled in the art that step S201 described in the present embodiment and step S202 and corresponding change
Example can be understood as a specific embodiment party of step S101 described in above-mentioned embodiment illustrated in fig. 1 and the step S102
Formula, the matching workload during matching in the mass data is reduced by the preset limit condition, while allowing user
Data snooping and expansion can be carried out for specific website.Further, user can according to the actual requirements choose whether needs
The preset limit condition is set, wherein, when the user does not set the preset limit condition, the embodiment of the present invention is by institute
State all of record that accesses on internet carries out data snooping and expands (i.e. the whole network search) as the mass data;When described
When user sets the preset limit condition, the embodiment of the present invention preset limit condition is limited one or more
Access on website is recorded as the mass data, to obtain the data (i.e. specific website search) needed for user.
Used as a change case, when user's selection carries out the whole network searches for, the embodiment of the present invention can first in multiple websites
Technical scheme for the embodiment of the present invention of upper execution, the matched rule is obtained with from each website, by the multiple website
After respective matched rule is integrated into general matching symbol, then using the general matching symbol as the sample data characteristic information
Carry out the whole network search.
Fig. 3 is the flow of a kind of data snooping and extending method based on sample data of the third embodiment of the present invention
Figure.Specifically, in the present embodiment, step S301 is first carried out, the data of predetermined number is selected from the database, and will
The characteristic information of the data of the predetermined number is used as the sample data.More specifically, those skilled in the art can join
It is admitted to and states step S201 described in embodiment illustrated in fig. 2, will not be described here.
Performed subsequently into step S302, being searched in the mass data with the sample data there are same characteristic features to believe
The data of breath, and using the data with same characteristic features information as the matched data.Specifically, those skilled in the art
Step S202 described in above-mentioned embodiment illustrated in fig. 2 is may be referred to, be will not be described here.
Next step S303 is performed, structuring treatment is carried out to the matched data, arranged by preset format with obtaining
Normal data.Specifically, the result of structuring treatment can represent in a tabular form, wherein, the form category
Record has all or part of content of the matched data.More specifically, the normal data can be by the form
Content arranged by the preset format after the result that obtains.In a preference, the matched data is also with URL form tables
Show, the classification recorded in the form includes top layer domain name tld, port (port), match parameter (querykey), match bit
Put, matching content and matching way, this step can by by the URL of the matched data by the class recorded in the form
Do not split, then re-starting sequence to the result for splitting according to the preset format is integrated, and the rearrangement is integrated
Result be exactly the normal data.
Performed subsequently into step S304, the matched rule and duplicate removal are generated based on the normal data.It is excellent at one
Select in example, the normal data can be converted to by the matched rule according to the preset format first, then removal conversion
Duplicate keys in the matched rule for obtaining, obtain the matched rule after the duplicate removal.It will be appreciated by those skilled in the art that by described
The treatment of step S303, what the normal data was potentially included is only the key message needed for carrying out subsequent match work, it is impossible to
Subsequent step is applied directly to, so needing by the treatment of this step, the normal data is changed by the preset format
It is the matched rule, so as to the use of subsequent step;On the other hand, because the design of the network address URL of same web site is general all
With similitude, so this step can be carried out after conversion obtains all of matched rule to all of matched rule
Duplicate removal treatment, to reject the duplicate keys in the matched rule that the conversion of this step is obtained.
Next step S305 is performed, the fingerprint base is updated based on the matched rule after duplicate removal.Specifically, the renewal
Including the matched rule after the duplicate removal is stored to the fingerprint base.More specifically, the renewal also includes that rejecting is described
The matched rule repeated with existing matched rule in the fingerprint base in matched rule after duplicate removal.In a preference,
Matched rule of the matched rule after the duplicate removal in the fingerprint base is compared, with secondary removal duplicate keys, then will
Matched rule after secondary removal duplicate keys is updated to the fingerprint base.
Step S306 is finally performed, matching extraction is carried out in the mass data based on the fingerprint base after renewal, to obtain
In the mass data with the renewal after fingerprint base in the data that match of matched rule, and the number for obtaining will be matched
According to extending to the database.Specifically, those skilled in the art may be referred to step described in above-mentioned embodiment illustrated in fig. 1
S104, will not be described here.
Further, the matched rule can be understood as a kind of combination filtered and extract data.
In one preferably application scenarios, the top layer domain name tld and the match parameter in the matched rule
Can be used for filter data.For example, when the step S305 is performed, the top layer domain name tld and described can be primarily based on
Whether current scanned data are worth further matching work during match parameter tentatively judges the mass data, if described
The top layer domain name tld of current scanned data is not corresponded with the top layer domain name tld of record in the matched rule, then may be used
Directly to reject the current scanned data, so as to save the matching amount of the embodiment of the present invention, matching efficiency is improved.
In another preferably application scenarios, the matching way, matched position in the matched rule and
Can be used for extracting data with content or regular expression, with finally determine the current scanned data whether with it is described
Sample data has identical characteristic information.
Further, the fingerprint base and the database can be stored in the computer for performing the embodiment of the present invention
It is interior, it is also possible to be stored in other storage devices coupled with the computer, or, may be stored in high in the clouds.
By upper, using the scheme of 3rd embodiment, step S303, the step S304 and the step described in the present embodiment
Rapid S305, it can be understood as step S103 described in above-mentioned embodiment illustrated in fig. 1, or described in above-mentioned embodiment illustrated in fig. 2
One specific embodiment of step S203, is processed by structuring, enables to match the multiple for obtaining by different modes
Matched data has the form of high unity, is conducive to subsequent treatment, on the other hand, by the duplicate removal in the step S304 with
And the secondary duplicate removal in the step S305, it is ensured that the matched rule in the fingerprint base is not in duplicate keys, in order to avoid be not intended to
The waste storage resource of justice.
In a typical application scenarios, the data be on a certain website sell commodity, and the data with
URL forms represent that the part commodity sold on the website that is stored with the database, user wishes that obtaining the website sells
Other commodity information, then user can using the embodiment of the present invention technical scheme, it is existing many from the database
Randomly select the commodity of predetermined number in individual commodity, and the commodity that will be selected numbering on the web is selected described in
Commodity signature identification code, for example, the domain name of the website be host.com (i.e. with the top layer domain name tld setting described in
Preset limit condition), user have selected 2 commodity as the sample data in the database, wherein, commodity A is in institute
It is item1234 to state the numbering on website, and commodity B numberings on the web are then item1368, then the sample data is
It is item1234 and item1368.
It is first when the technical scheme for performing the embodiment of the present invention is searched based on the sample data in the mass data
First, the sample data can be loaded in the computer local memory for performing the embodiment of the present invention, and builds dictionary.Wherein,
The dictionary key (key) is the top layer domain name tld (host.com is in this application scene) of the sample data, the word
The value (value) of allusion quotation is the character match tree under top layer domain name tld.Preferably, by by the character string of all sample datas
The character match tree is built after splitting into single character.Preferably, in this application scene, based on the sample data
Item1234 and item1368 can build the character match tree obtained shown in Fig. 4.
Then, the mass data is scanned one by one based on the character match tree to be searched.Specifically, first judge described
Whether the top layer domain name tld in mass data in the URL of current scanned data is equal with host.com, is skipped if not waiting
The currently scanned data;If equal follow-up matching work is carried out for the currently scanned data.
In this application scene, equal with host.com for the top layer domain name tld currently scanned data, it is necessary to
(i.e. described matched position is inquiry to carry out equality matching to the inquiry query parts of the URL of the current scanned data
Query, the matched rule is equality matching).With http://a.host.com/path/test.htmlQk1=i234&
The described currently scanned data instance of qk2=item_1246&item_id=item_1234 this URL representatives, can first tear open
Point URL, obtains the inquiry query parts in the URL of the current scanned data, then by the inquiry query parts
Further split by separator " & " and "=", the dictionary { " qk1 " represented in key-value pair form can be obtained:"i123","
qk2":"item_1246","item_id":" item_1234 " }, the dictionary is then traveled through, by the value difference in the dictionary
Searched one by one according to character on the character match tree shown in Fig. 4.
For example, when being matched to described value i123, first matching i, the match is successful;Down the of matching described value i123 again
Two characters 1, the child list of i characters only has t characters in the character match tree shown in Fig. 4, and not comprising 1, so described
The matching of value i123 is unsuccessful.
Again for example, when being matched to value item_1246, the match is successful for first character i;Second character t, also wraps
It is contained in the child list of i characters in character match tree shown in Fig. 4;3rd character e is also in character match tree shown in Fig. 4
In the child list of t characters;Similarly character e, character m and character 1 match with character match tree shown in Fig. 4;Next
Matching character 2, the character 1 in character match tree shown in Fig. 4 has two child nodes [2,3], comprising the character to be matched
2, it is possible to continue down to match character 4;When the character 4 is matched, due to when a character 2 is matched, it is determined that
Described value item_1246 may with character match tree shown in Fig. 4 in, the branch of character 2 in the child node [2,3] below character 1
Match, so continuing the branch based on the character 2 matches the character 4, but due to character in character match tree shown in Fig. 4
Child node in 2 branch below the node of character 2 is character 3, not comprising character 4 to be matched, therefore described value item_
1246 matching is also unsuccessful.
Again for example, when being matched to value item_1234, step is matched with character match tree shown in Fig. 4 by foregoing
Suddenly, it may be determined that described value item_1234 can be matched completely with character match tree shown in Fig. 4, accordingly, it is determined that described to be scanned
The sample data is included in the URL of data, and the match parameter is commodity ID.
The matched data list that table 1 is represented based on URL
http://a.host.com/path/test.htmlQk1=i234&qk2=item_1246&item_id=item_1234 |
http://b.host.com/testItem_id=item_1368&a=c |
http://c.host.com:1234/testId=item_1234 |
http://item_1368.host.com/detai_info.html |
http://a.host.com:3345/category-1234-item_1234-t12 |
http://a.host.com:3567/item/item_1234/detail.html |
Continue to scan on the mass data, it is also possible to obtain the matched data represented below based on URL.The matched data
The URL shown in above-mentioned table 1 can be included.
Table 2 is to the normal data list after the structuring of table 1 treatment
As shown in table 2, after the scanning one by one based on the sample data in the mass data is completed, can be to this
The secondary matched data for obtaining of searching carries out structuring treatment, with the criterion numeral for obtaining being represented based on the preset format
According to.Preferably, the normal data is according to top layer domain name tld, port (port), match parameter (querykey), matched position,
The order arrangement of matching content and matching way, wherein, then represented with sky for default content.For example, for port, one
As the port be default value (i.e. 80) when, it can omit in the URL and occur without, then in the normal data also with
Space represents.Again for example, searched from the mass data with path as matched position for the embodiment of the present invention obtain
With data, after these matched data structurings are processed as normal data, the match parameter of these normal datas is sky.
Table 3 is based on the matched rule list that the normal data conversion of table 2 is obtained
For the normal data that the table 2 is listed, the normal data is converted into institute according to the preset format
Matched rule is stated, as shown in table 3.Wherein, (item_ d+) is regular expression, and it is used to represent and is started with item_, and after
Face is followed by the character string of numeral.
Further, according to the matched rule and the match parameter, in the matched rule that can be listed in table 3
The row of duplicate removal second;Then it is compared with existing matched rule in the fingerprint base again, may be with the fingerprint in removal table 3
The matched rule that existing matched rule is repeated in storehouse, is most updated to the fingerprint base by the matched rule of duplicate removal twice at last.
Further, by renewal after the fingerprint base be re-applied to the mass data, and Intrusion Detection based on host, path,
The order of inquiry is rescaned, then the newly-increased matched rule can match the URL of more commodity, is obtained by by matching
The URL (or meeting the part of the matched rule in the commodity URL) of the commodity be updated to the database, can be with
Finally realize the expansion to the database.
For example, newly-increased matched rule http://*.host.com/*Item_id=* may match a URL
http://test1.host.com/aThe commodity of item_id=test1&b=c, or URL is http://
test1.host.com/path/subpath/subpath/a.htmlThe business of q1=v1&q2=v2&item_id=11111
Product, in the commodity URL that above-mentioned two is newly matched, the part being consistent with the matched rule is test1 and 11111.
It will be appreciated by those skilled in the art that the technical scheme for passing through the embodiment of the present invention, based on the sample in the database
Data item_1234, has finally expanded the two data of test1 and 11111.And in actual application, using the present invention
The technical scheme of embodiment can be found that potential data in the URL of substantial amounts of long-tail, so that greatly expanding data storehouse, realizes
Depth to mass data is excavated.
Fig. 5 is that the structure of a kind of data snooping and expanding device based on sample data of the fourth embodiment of the present invention is shown
It is intended to.It will be appreciated by those skilled in the art that data snooping described in the present embodiment and expanding device 4 are used to implement above-mentioned Fig. 1 to Fig. 4 institutes
Show the method and technology scheme in embodiment.Specifically, in the present embodiment, the data snooping and expanding device 4 include determining
Module 41, the sample data is determined at least one data in based on database, and the database purchase has from magnanimity
Many datas of acquisition are detected in data;Searching modul 42, for being searched in the mass data based on the sample data,
To obtain the matched data matched with the sample data in the mass data;Update module 43, for the matching
Data are processed to obtain matched rule, and update fingerprint base, and the fingerprint base is stored with the matched rule for obtaining in history;
And extraction module 44, it is described to obtain for carrying out matching extraction in the mass data based on the fingerprint base after renewal
In mass data with the renewal after fingerprint base in the data that match of matched rule, and the data extending for obtaining will be matched
To the database.
Further, the determining module 41 includes selection submodule 411, for selecting present count from the database
The data of amount, and using the characteristic information of the data of the predetermined number as the sample data.Preferably, the characteristic information
The signature identification code of the data including the predetermined number;Or the regular expressions determined according to the data of the predetermined number
Formula.
Further, the searching modul 42 includes that first searches submodule 421, for being searched in the mass data
With the data that the sample data has same characteristic features information, and using the data with same characteristic features information as described
With data.
Further, the searching modul 42 also includes that second searches submodule 422, and described second searches submodule 422
For when being searched in mass data based on the sample data, if there is preset limit condition, in the mass data
In by the partial data of the preset limit conditional definition search, to obtain the matched data.
Further, the update module 43 includes treatment submodule 431, for carrying out structuring to the matched data
Treatment, the normal data of preset format arrangement is pressed to obtain;Generation submodule 432, for generating institute based on the normal data
State matched rule and duplicate removal;And submodule 433 is updated, for updating the fingerprint base based on the matched rule after duplicate removal.
Further, the generation submodule 432 includes converting unit 4321, described in being incited somebody to action according to the preset format
Normal data is converted to the matched rule;And duplicate removal unit 4322, for removing the weight in the matched rule being converted to
Multiple item, obtains the matched rule after the duplicate removal.
Further, the renewal submodule 433 includes comparing unit 4331, for by the matched rule after the duplicate removal
Compared with the matched rule in the fingerprint base, with secondary removal duplicate keys;And updating block 4332, for being gone secondary
Except the matched rule after duplicate keys is updated to the fingerprint base.
Preferably, the data are internet access record.
More contents of operation principle, working method on the data snooping and expanding device 4, are referred to Fig. 1
Associated description into Fig. 4, repeats no more here.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can
Completed with instructing the hardware of correlation by program, the program can be stored in computer-readable recording medium, to store
Medium can include:ROM, RAM, disk or CD etc..
Although present disclosure is as above, the present invention is not limited to this.Any those skilled in the art, are not departing from this
In the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute
The scope of restriction is defined.
Claims (18)
1. a kind of data snooping and extending method based on sample data, it is characterised in that comprise the following steps:
At least one data in based on database determine the sample data, and the database purchase has to be visited from mass data
Survey many datas for obtaining;
Searched in the mass data based on the sample data, with obtain in the mass data with the sample data phase
The matched data of matching;
The matched data is processed to obtain matched rule, and updated fingerprint base, the fingerprint base is stored with history
The matched rule of acquisition;
Matching extraction is carried out in the mass data based on the fingerprint base after renewal, with obtain in the mass data with it is described
The data that the matched rule in fingerprint base after renewal matches, and the data extending for obtaining will be matched to the database.
2. data snooping and extending method based on sample data according to claim 1, it is characterised in that described to be based on
At least one data in database determine the sample data, comprise the following steps:
The data of predetermined number are selected from the database, and using the characteristic information of the data of the predetermined number as described
Sample data.
3. data snooping and extending method based on sample data according to claim 2, it is characterised in that the feature
Information includes:
The signature identification code of the data of the predetermined number;Or
The regular expression that data according to the predetermined number determine.
4. data snooping and extending method based on sample data according to claim 2, it is characterised in that based on described
Sample data is searched in the mass data, to obtain the coupling number matched with the sample data in the mass data
According to comprising the following steps:
The data that there is same characteristic features information with the sample data are searched in the mass data, and described will be had identical
The data of characteristic information are used as the matched data.
5. data snooping and extending method based on sample data according to claim 4, it is characterised in that based on institute
When stating sample data and being searched in mass data, if there is preset limit condition, by described default in the mass data
Searched in the partial data of restrictive condition definition, to obtain the matched data.
6. data snooping and extending method based on sample data according to claim 1, it is characterised in that to described
Processed with data to obtain matched rule, and updated fingerprint base, comprised the following steps:
Structuring treatment is carried out to the matched data, the normal data of preset format arrangement is pressed to obtain;
The matched rule and duplicate removal are generated based on the normal data;
The fingerprint base is updated based on the matched rule after duplicate removal.
7. data snooping and extending method based on sample data according to claim 6, it is characterised in that based on described
Normal data generates the matched rule and duplicate removal, comprises the following steps:
The normal data is converted to by the matched rule according to the preset format;
Duplicate keys in the matched rule that removal is converted to, obtain the matched rule after the duplicate removal.
8. data snooping and extending method based on sample data according to claim 6, it is characterised in that based on duplicate removal
Fingerprint afterwards updates the fingerprint base, comprises the following steps:
Matched rule of the matched rule after the duplicate removal in the fingerprint base is compared, with secondary removal duplicate keys;
Matched rule after secondary removal duplicate keys is updated to the fingerprint base.
9. the data snooping and extending method based on sample data according to any one of claim 1 to 8, its feature exists
In the data are recorded for internet access.
10. a kind of data snooping and expanding device based on sample data, it is characterised in that including:
Determining module, the sample data is determined at least one data in based on database, and the database purchase has
The many datas for obtaining are detected from mass data;
Searching modul, for being searched in the mass data based on the sample data, with obtain in the mass data with
The matched data that the sample data matches;
Update module, for being processed the matched data to obtain matched rule, and updates fingerprint base, the fingerprint base
Be stored with the matched rule for obtaining in history;
Extraction module, for carrying out matching extraction in the mass data based on the fingerprint base after renewal, to obtain the sea
In amount data with the renewal after fingerprint base in the data that match of matched rule,
And the data extending for obtaining will be matched to the database.
11. data snoopings and expanding device based on sample data according to claim 10, it is characterised in that described true
Cover half block includes:
Selection submodule, the data for selecting predetermined number from the database, and by the data of the predetermined number
Characteristic information is used as the sample data.
12. data snoopings and expanding device based on sample data according to claim 11, it is characterised in that the spy
Reference breath includes:
The signature identification code of the data of the predetermined number;Or
The regular expression that data according to the predetermined number determine.
13. data snoopings and expanding device based on sample data according to claim 11, it is characterised in that described to look into
Looking for module includes:
First searches submodule, for searching the number for having same characteristic features information with the sample data in the mass data
According to, and using the data with same characteristic features information as the matched data.
14. data snoopings and expanding device based on sample data according to claim 13, it is characterised in that described to look into
Look for module also include second search submodule, it is described second search submodule be used for based on the sample data in mass data
During middle lookup, if there is preset limit condition, by the part number of the preset limit conditional definition in the mass data
According to middle lookup, to obtain the matched data.
15. data snoopings and expanding device based on sample data according to claim 10, it is characterised in that it is described more
New module includes:
Treatment submodule, for carrying out structuring treatment to the matched data, the criterion numeral of preset format arrangement is pressed to obtain
According to;
Generation submodule, for generating the matched rule and duplicate removal based on the normal data;
Submodule is updated, for updating the fingerprint base based on the matched rule after duplicate removal.
16. data snoopings and expanding device based on sample data according to claim 15, it is characterised in that the life
Include into submodule:
Converting unit, for the normal data to be converted into the matched rule according to the preset format;
Duplicate removal unit, for removing the duplicate keys in the matched rule being converted to, obtains the matched rule after the duplicate removal.
17. data snoopings and expanding device based on sample data according to claim 16, it is characterised in that it is described more
New submodule includes:
Comparing unit, compares, with secondary for the matched rule by the matched rule after the duplicate removal in the fingerprint base
Removal duplicate keys;
Updating block, for the matched rule after secondary removal duplicate keys to be updated into the fingerprint base.
18. data snooping and expanding device based on sample data according to any one of claim 10 to 17, its feature
It is that the data are recorded for internet access.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611264829.8A CN106844553B (en) | 2016-12-30 | 2016-12-30 | Data detection and expansion method and device based on sample data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611264829.8A CN106844553B (en) | 2016-12-30 | 2016-12-30 | Data detection and expansion method and device based on sample data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106844553A true CN106844553A (en) | 2017-06-13 |
CN106844553B CN106844553B (en) | 2020-05-01 |
Family
ID=59117193
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611264829.8A Active CN106844553B (en) | 2016-12-30 | 2016-12-30 | Data detection and expansion method and device based on sample data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106844553B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815488A (en) * | 2018-12-26 | 2019-05-28 | 出门问问信息科技有限公司 | Natural language understanding training data generation method, device, equipment and storage medium |
CN111680286A (en) * | 2020-02-27 | 2020-09-18 | 中国科学院信息工程研究所 | Refinement method of Internet of things equipment fingerprint database |
CN111797085A (en) * | 2020-06-22 | 2020-10-20 | 中国平安财产保险股份有限公司 | Request data processing method and device, computer equipment and storage medium |
CN114511476A (en) * | 2021-12-21 | 2022-05-17 | 中科环森智慧科技(苏州)有限公司 | Intelligent analysis application system for image data |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108647272B (en) * | 2018-04-28 | 2020-12-29 | 江南大学 | Method for predicting concentration of butane at bottom of debutanizer by expanding small samples based on data distribution |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1952929A (en) * | 2005-10-20 | 2007-04-25 | 关涛 | Extraction method and system of structured data of internet based on sample & faced to regime |
CN103942282A (en) * | 2014-04-02 | 2014-07-23 | 新浪网技术(中国)有限公司 | Sample data obtaining method, device and system |
CN104063474A (en) * | 2014-06-30 | 2014-09-24 | 五八同城信息技术有限公司 | Sample data collection system |
CN105095240A (en) * | 2014-05-04 | 2015-11-25 | 中国银联股份有限公司 | Database data sample acquisition |
-
2016
- 2016-12-30 CN CN201611264829.8A patent/CN106844553B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1952929A (en) * | 2005-10-20 | 2007-04-25 | 关涛 | Extraction method and system of structured data of internet based on sample & faced to regime |
CN103942282A (en) * | 2014-04-02 | 2014-07-23 | 新浪网技术(中国)有限公司 | Sample data obtaining method, device and system |
CN105095240A (en) * | 2014-05-04 | 2015-11-25 | 中国银联股份有限公司 | Database data sample acquisition |
CN104063474A (en) * | 2014-06-30 | 2014-09-24 | 五八同城信息技术有限公司 | Sample data collection system |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815488A (en) * | 2018-12-26 | 2019-05-28 | 出门问问信息科技有限公司 | Natural language understanding training data generation method, device, equipment and storage medium |
CN111680286A (en) * | 2020-02-27 | 2020-09-18 | 中国科学院信息工程研究所 | Refinement method of Internet of things equipment fingerprint database |
CN111680286B (en) * | 2020-02-27 | 2022-06-10 | 中国科学院信息工程研究所 | Refinement method of Internet of things equipment fingerprint library |
CN111797085A (en) * | 2020-06-22 | 2020-10-20 | 中国平安财产保险股份有限公司 | Request data processing method and device, computer equipment and storage medium |
CN114511476A (en) * | 2021-12-21 | 2022-05-17 | 中科环森智慧科技(苏州)有限公司 | Intelligent analysis application system for image data |
Also Published As
Publication number | Publication date |
---|---|
CN106844553B (en) | 2020-05-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7818303B2 (en) | Web graph compression through scalable pattern mining | |
CN111782965B (en) | Intention recommendation method, device, equipment and storage medium | |
JP5092165B2 (en) | Data construction method and system | |
CN106844553A (en) | Data snooping and extending method and device based on sample data | |
CN102855309B (en) | A kind of information recommendation method based on user behavior association analysis and device | |
CN112165462A (en) | Attack prediction method and device based on portrait, electronic equipment and storage medium | |
US20080270549A1 (en) | Extracting link spam using random walks and spam seeds | |
JP2015512095A (en) | Method, apparatus and computer readable recording medium for image management in an image database | |
CN102122291A (en) | Blog friend recommendation method based on tree log pattern analysis | |
CN113254630B (en) | Domain knowledge map recommendation method for global comprehensive observation results | |
US20120143844A1 (en) | Multi-level coverage for crawling selection | |
CN110727663A (en) | Data cleaning method, device, equipment and medium | |
CN112231700B (en) | Behavior recognition method and apparatus, storage medium, and electronic device | |
KR102601545B1 (en) | Geographic position point ranking method, ranking model training method and corresponding device | |
CN105389328B (en) | A kind of extensive open source software searching order optimization method | |
CN103226601A (en) | Method and device for image search | |
CN112231481A (en) | Website classification method and device, computer equipment and storage medium | |
CN103761298A (en) | Distributed-architecture-based entity matching method | |
CN109614521A (en) | A kind of efficient secret protection subgraph inquiry processing method | |
CN105357118A (en) | Rule based flow classifying method and system | |
CN110457600B (en) | Method, device, storage medium and computer equipment for searching target group | |
KR101592670B1 (en) | Apparatus for searching data using index and method for using the apparatus | |
JP2017530477A (en) | System and method for processing graphs | |
Shaikh | Web Usage Mining Using Apriori and FP Growth Alogrithm | |
CN110222156B (en) | Method and device for discovering entity, electronic equipment and computer readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |