[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN106844553A - Data snooping and extending method and device based on sample data - Google Patents

Data snooping and extending method and device based on sample data Download PDF

Info

Publication number
CN106844553A
CN106844553A CN201611264829.8A CN201611264829A CN106844553A CN 106844553 A CN106844553 A CN 106844553A CN 201611264829 A CN201611264829 A CN 201611264829A CN 106844553 A CN106844553 A CN 106844553A
Authority
CN
China
Prior art keywords
data
matched
sample
sample data
matched rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611264829.8A
Other languages
Chinese (zh)
Other versions
CN106844553B (en
Inventor
汤奇峰
李炳辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Original Assignee
ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd filed Critical ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority to CN201611264829.8A priority Critical patent/CN106844553B/en
Publication of CN106844553A publication Critical patent/CN106844553A/en
Application granted granted Critical
Publication of CN106844553B publication Critical patent/CN106844553B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of data snooping and extending method and device based on sample data, methods described comprise the following steps:At least one data in based on database determine the sample data, and the database purchase has many datas for being detected from mass data and being obtained;Searched in the mass data based on the sample data, to obtain the matched data matched with the sample data in the mass data;The matched data is processed to obtain matched rule, and updated fingerprint base, the fingerprint base is stored with the matched rule for obtaining in history;Matching extraction is carried out in the mass data based on the fingerprint base after renewal, with obtain in the mass data with the renewal after fingerprint base in the data that match of matched rule, and the data extending that obtains to the database will be matched.The technical scheme provided by the present invention can more accurately and efficiently carry out the analysis and treatment of global, system to mass data.

Description

Data snooping and extending method and device based on sample data
Technical field
The present invention relates to Internet technical field, more particularly to a kind of data snooping based on sample data and expansion side Method and device.
Background technology
With the high speed development of Internet technology, China Internet website and number of netizens rapidly rise, with netizen Be skyrocketed through, and Internet resources increasingly enrich, on internet produce access log data also rapid expanding is formed Mass data so that how detection finds and expands required data message as current information treatment side work from mass data The most important thing of work.
At present, the method for data is concentrated mainly on following two needed for being found from mass data and expanded:First, being people Work checks data mode, by manually to internet Shang Ge websites or application program (Application, abbreviation APP, for example, Be loaded in the application software in mobile phone) user's accessing united resource positioning symbol (Uniform Resource Locator, referred to as URL) it is analyzed and summarizes, obtain a series of matched rule, is then based on these matched rules again to the magnanimity of internet Matched again in data resource, so as to extract the data expanded needed for obtaining.Second, being then application programming interface (Application Programming Interface, abbreviation API) inquiry mode, this method is by API provider Document description, calls the interface of other side so as to obtain required data as needed.
Although both approaches can to a certain extent meet user and wish certain kinds are found and expanded from mass data The data of type, but, both approaches are individually present the defect that cannot avoid.For hand inspection data mode, Substantial amounts of manpower is needed to go to carry out analysis and the statistics of correlation manually in practical operation, detection and expansion efficiency are low;API issuers Formula then depends on the document description that API provider provides, with uncertainty.
On the other hand, found and extending method including the available data including above two mode, what is finally obtained is all Data on some specific websites.But due to the tremendous expansion of website scale in internet, and many websites and APP pairs The building mode of URL does not formulate unified standard and rule, therefore is mass data by the data that existing method is obtained In sub-fraction, be unfavorable for that user carries out global, the analysis and treatment of system to mass data, have impact on user detect with Expand the degree of accuracy of the data for obtaining.
The content of the invention
Present invention solves the technical problem that being that prior art cannot be so that more accurately and efficiently mode is carried out to mass data Global, the analysis and treatment of system.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of data snooping based on sample data and expansion side Method, comprises the following steps:At least one data in based on database determine the sample data, the database purchase have from Many datas of acquisition are detected in mass data;Searched in the mass data based on the sample data, it is described to obtain The matched data matched with the sample data in mass data;The matched data is processed to obtain matching rule Then, and fingerprint base is updated, the fingerprint base is stored with the matched rule for obtaining in history;Based on the fingerprint base after renewal described Matching extraction is carried out in mass data, with obtain in the mass data with the renewal after fingerprint base in matched rule phase The data of matching, and the data extending for obtaining will be matched to the database.
Optionally, it is described based on database at least one data determine the sample data, comprise the following steps:From The data of predetermined number are selected in the database, and using the characteristic information of the data of the predetermined number as the sample number According to.
Optionally, the characteristic information includes:The signature identification code of the data of the predetermined number;Or according to described pre- If the regular expression that the data of quantity determine.
Optionally, searched in the mass data based on the sample data, with obtain in the mass data with institute The matched data that sample data matches is stated, is comprised the following steps:Searched in the mass data and had with the sample data There are a data of same characteristic features information, and using the data with same characteristic features information as the matched data.
Optionally, when being searched in mass data based on the sample data, if there is preset limit condition, in institute State in mass data and searched by the partial data of the preset limit conditional definition, to obtain the matched data.
Optionally, the matched data is processed to obtain matched rule, and is updated fingerprint base, including following step Suddenly:Structuring treatment is carried out to the matched data, the normal data of preset format arrangement is pressed to obtain;Based on the criterion numeral According to the generation matched rule and duplicate removal;The fingerprint base is updated based on the matched rule after duplicate removal.
Optionally, the matched rule and duplicate removal are generated based on the normal data, is comprised the following steps:According to described pre- If the normal data is converted to the matched rule by form;Duplicate keys in the matched rule that removal is converted to, obtain Matched rule after the duplicate removal.
Optionally, the fingerprint base is updated based on the fingerprint after duplicate removal, is comprised the following steps:By the matching after the duplicate removal Compared with the regular matched rule in the fingerprint base, with secondary removal duplicate keys;By the matching after secondary removal duplicate keys Policy Updates are to the fingerprint base.
Optionally, the data are internet access record.
The embodiment of the present invention also provides a kind of data snooping and expanding device based on sample data, including:Determining module, Determine the sample data at least one data in based on database, the database purchase has to be visited from mass data Survey many datas for obtaining;Searching modul, it is described to obtain for being searched in the mass data based on the sample data The matched data matched with the sample data in mass data;Update module, for processing the matched data To obtain matched rule, and fingerprint base is updated, the fingerprint base is stored with the matched rule for obtaining in history;Extraction module, uses Fingerprint base after based on renewal carries out matching extraction in the mass data, with obtain in the mass data with it is described more The data that the matched rule in fingerprint base after new matches, and the data extending that obtains to the database will be matched.
Optionally, the determining module includes:Selection submodule, the number for selecting predetermined number from the database According to, and using the characteristic information of the data of the predetermined number as the sample data.
Optionally, the characteristic information includes:The signature identification code of the data of the predetermined number;Or according to described pre- If the regular expression that the data of quantity determine.
Optionally, the searching modul includes:First search submodule, for searched in the mass data with it is described Sample data has a data of same characteristic features information, and using the data with same characteristic features information as the coupling number According to.
Optionally, the searching modul also includes that second searches submodule, and the second lookup submodule is used to be based on When the sample data is searched in mass data, if there is preset limit condition, by described pre- in the mass data Searched in the partial data of restriction conditional definition, to obtain the matched data.
Optionally, the update module includes:Treatment submodule, for carrying out structuring treatment to the matched data, The normal data of preset format arrangement is pressed to obtain;Generation submodule, for based on the normal data generation matching rule Then and duplicate removal;Submodule is updated, for updating the fingerprint base based on the matched rule after duplicate removal.
Optionally, the generation submodule includes:Converting unit, for according to the preset format by the normal data Be converted to the matched rule;Duplicate removal unit, for removing the duplicate keys in the matched rule being converted to, obtains the duplicate removal Matched rule afterwards.
Optionally, the renewal submodule includes:Comparing unit, for by the matched rule after the duplicate removal and the finger Matched rule in line storehouse compares, with secondary removal duplicate keys;Updating block, for by it is secondary removal duplicate keys after matching Policy Updates are to the fingerprint base.
Optionally, the data are internet access record.
Compared with prior art, the technical scheme of the embodiment of the present invention has the advantages that:
First at least one data in database determine sample data, and based on the sample data to mass data Middle lookup, obtains the matched data matched with the sample data, then again to described with the detection from the mass data Matched data is processed to obtain matched rule, so as to update fingerprint base, is based ultimately upon the fingerprint base after updating and is arrived described again Matching extraction is carried out in mass data, thus in obtaining the mass data with the renewal after fingerprint base in matched rule Data for matching, and the data extending for obtaining will be matched to the database, realize data snooping based on sample data and Expand.Find to be compared with expansion scheme than the existing data for being based primarily upon artificial or API inquiries, the embodiment of the present invention Technical scheme is based on sample data and generates matched rule, further according to being done in matched rule to original data source (i.e. mass data) With extraction, with expanding data storehouse, sample data and repetition abovementioned steps, most end form are then determined from the database after expansion again Into closed loop cycle flow.The technical scheme provided by the present invention, more accurately and efficiently can be carried out entirely to mass data Office, the analysis and treatment of system.
Further, the data of predetermined number are selected from the database, and by the feature of the data of the predetermined number Information is detected by template of the sample data as the sample data in mass data, to obtain and the sample number Come expanding data storehouse according to the data for matching, it is ensured that the data stored in the database are the number with same characteristic features information According to meeting the use demand that user had found and collected specific type of data from mass data.
Brief description of the drawings
Fig. 1 is the flow of a kind of data snooping and extending method based on sample data of the first embodiment of the present invention Figure;
Fig. 2 is the flow of a kind of data snooping and extending method based on sample data of the second embodiment of the present invention Figure;
Fig. 3 is the flow of a kind of data snooping and extending method based on sample data of the third embodiment of the present invention Figure;
Fig. 4 is the character match that data snooping and extending method based on sample data build using the embodiment of the present invention Tree schematic diagram;
Fig. 5 is that the structure of a kind of data snooping and expanding device based on sample data of the fourth embodiment of the present invention is shown It is intended to.
Specific embodiment
As background technology is sayed, the existing method that user requested data is found and expanded from mass data still is limited to Manual retrieval or API inquiry two ways.But, the former needs to expend substantial amounts of manpower and goes to be analyzed data system manually Meter;The latter cannot then adapt to the analysis of overall importance and treatment of data.
In order to solve this technical problem, at least one data of the technical scheme of the present invention first in database are true Random sample notebook data, and based on being searched in the sample data to mass data, obtained and institute with being detected from the mass data The matched data that sample data matches is stated, the matched data is processed to obtain matched rule again then, so that more New fingerprint base, is based ultimately upon the fingerprint base after updating again to matching extraction is carried out in the mass data, so as to obtain the sea In amount data with the renewal after fingerprint base in the data that match of matched rule, and the data extending for obtaining will be matched extremely The database, realizes the data snooping based on sample data and expansion.
It will be appreciated by those skilled in the art that as the expansion type of Internet user increases, the substantial increase of internet site with And the lifting at full speed of Internet bandwidth, increasing user generates increasing internet on increasing website User behavior (i.e. internet access record).And these behaviors are recorded and made by Various types of data picker in the form of daily record For data (i.e. mass data) are stored.The technical scheme of the embodiment of the present invention is based on sample data and generates matched rule, then Extracted according to matching is done in matched rule to original data source (i.e. mass data), with expanding data storehouse, then again from after expansion Database in determine sample data and repeatedly abovementioned steps, ultimately form closed loop cycle flow.The skill provided by the present invention Art scheme, can more accurately and efficiently carry out global, the analysis and treatment of system to mass data.
It is understandable to enable above-mentioned purpose of the invention, feature and beneficial effect to become apparent, below in conjunction with the accompanying drawings to this The specific embodiment of invention is described in detail.
Fig. 1 is the flow of a kind of data snooping and extending method based on sample data of the first embodiment of the present invention Figure.Wherein, the data can be recorded for internet access.
Specifically, in the present embodiment, step S101 is first carried out, based on database at least one data determine institute Sample data is stated, the database purchase there are many datas for being detected from mass data and being obtained.More specifically, the magnanimity Data can be the data that history is obtained from internet, for example, the internet access record of all users, Huo Zhe in history The internet access record of selected user in during selected.In a preference, the quantity of the sample data can basis The data-handling capacity of the hardware or software that perform the embodiment of the present invention carries out personalized setting, for example, the general sample number According to quantity can be between 10,000 to 100,000.Preferably, the data can be with URL (Uniform Resource Locator, abbreviation URL) form represents, or, the data can also more than the URL (Refer of URL), the form such as user agent (user agent) or cookie represents that those skilled in the art can also be according to actual needs Change dissolves more embodiments, will not be described here.
Performed subsequently into step S102, searched in the mass data based on the sample data, it is described to obtain The matched data matched with the sample data in mass data.Specifically, the matching can refer to the matched data with The sample data has identical rule.Preferably, this step can be simultaneously or priority is at least one device clusters Carry out, wherein, the device clusters can be formed by the coupling of one or more computers.In a preference, can be by The mass data is distributed on the computer of multiple cluster compositions and is processed, and then collects the computer institute in each cluster The matched data being fitted on, for example, can be by based on distributed system architecture (Hadoop Distributed File System mapping stipulations (Mapreduce) task) is realized to the decentralized processing of the mass data and collected.
Next step S103 is performed, the matched data is processed to obtain matched rule, and update fingerprint base, The fingerprint base is stored with the matched rule for obtaining in history.Specifically, the matched rule is used to describe the sample data The rule having jointly with the matched data.More specifically, the fingerprint base is used to store the history execution present invention in fact After applying the technical scheme of example, the matched rule extracted from the matched data.It will be appreciated by those skilled in the art that passing through The fingerprint base is constantly enriched, follow-up iterative operation can be preferably promoted so that the technical scheme of the embodiment of the present invention The fingerprint base that can be based on updating matches the more data of acquisition in mass data.
Step S104 is finally performed, matching extraction is carried out in the mass data based on the fingerprint base after renewal, to obtain In the mass data with the renewal after fingerprint base in the data that match of matched rule, and the number for obtaining will be matched According to extending to the database.In a preference, the mass data is entered one by one based on the fingerprint base after the renewal Row treatment, and matching result one by one is collated the minutes, the data for obtaining will be matched and be updated to the database, so that real Now to effective expansion of the scale of construction of the database.
In a change case of the present embodiment, after the step S104 has been performed, the number after expanding is also based on The execution step S101 is started again at according to storehouse, more sample datas are generated with based on the database after the expansion, and then Detected in the mass data and obtain more matched datas, finally further expand the database.
By upper, using the scheme of first embodiment, matched rule is generated based on sample data, further according to matched rule to original Matching is done in beginning data source (i.e. mass data) to extract, and with expanding data storehouse, then determines sample from the database after expansion again Notebook data simultaneously repeats abovementioned steps.By the technical scheme of the embodiment of the present invention, can be formed at an iteration for closed loop Reason mechanism, is conducive to user that global, the analysis and treatment of system are more accurately and efficiently carried out to mass data.
Fig. 2 is the flow of a kind of data snooping and extending method based on sample data of the second embodiment of the present invention Figure.Specifically, in the present embodiment, step S201 is first carried out, the data of predetermined number is selected from the database, and will The characteristic information of the data of the predetermined number is used as the sample data.More specifically, the predetermined number is by user's root Determine according to the hardware or the data-handling capacity of software that perform the embodiment of the present invention.Preferably, the characteristic information can be institute State the signature identification code of the data of predetermined number.For example, when URL information of the data for commodity, the signature identification code Can be the identity code (identification, abbreviation ID) of the commodity, the identity code can be from the business Extracted in the corresponding URL information of product.
Performed subsequently into step S202, being searched in the mass data with the sample data there are same characteristic features to believe The data of breath, and using the data with same characteristic features information as the matched data.Preferably for equally with URL The mass data for representing, can be split as three matched position (main frames by the URL of each mass data by structure Host, path path and inquiry query), and in the way of selecting one or select two or all matching, the matched position that will choose with The sample data compares, to search the number for having same characteristic features information with the sample data from the mass data According to.Preferably for the sample data that information is characterized with signature identification code, can be by different matched rule to the magnanimity The data that there is same characteristic features information with the sample data are searched in data.
In a preference, the position of host machine in the URL of the mass data can be matched, and can adopt The position of host machine of the URL for searching which data in the mass data with mode of the left side comprising matching has with the sample data There is identical signature identification code.Preferably, the left side can refer to that position to be matched is (in i.e. foregoing preference comprising matching Position of host machine) character string left side complete match described in sample data signature identification code.For example, a certain number in mass data According to URL position of host machine include character string item_44123_abcde, then it is considered that the character string complete match with feature The sample data that identification code item_44123 is represented, so that it is determined that the mass data with the sample data there are same characteristic features to believe Breath.
Next step S203 is performed, the matched data is processed to obtain matched rule, and update fingerprint base, The fingerprint base is stored with the matched rule for obtaining in history.Specifically, those skilled in the art may be referred to shown in above-mentioned Fig. 1 Step S103 described in embodiment, will not be described here.Preferably, the matched rule is used to filtering and extract described in multiple With the common feature that data have.
Step S204 is finally performed, matching extraction is carried out in the mass data based on the fingerprint base after renewal, to obtain In the mass data with the renewal after fingerprint base in the data that match of matched rule, and the number for obtaining will be matched According to extending to the database.Specifically, those skilled in the art may be referred to step described in above-mentioned embodiment illustrated in fig. 1 S104, will not be described here.In a preference, all data included to the mass data by the matched position by Bar is matched, for example, can be by first main frame, path, the matching order finally inquired about are matched again.Specifically, sentence first The host machine part of the matched rule that the host machine part of the URL of the data of breaking can include with the fingerprint base is matched, if both Host machine part is mismatched, then skip the data and transfer to match other data that the mass data includes, if both host machine parts Matching, then continue to match the path sections of the data with the path sections of the matched rule, when both path sections also Whether timing, then match the query portion of the data with the query portion of the matched rule, finally to determine the data Match with the matched rule in the fingerprint base after the renewal.
Further, however, it is determined that the data meet the matching condition of the matched rule, then from the extracting data The part that matches with the matched rule is simultaneously updated to the database.
Further, the mass data is matched one by one based on the fingerprint base after the renewal, it is described to determine In mass data with the renewal after fingerprint base in the data that match of matched rule, and by the data content of compatible portion Extract and arrange to the database, so as to greatly expand the scale of construction of the database.
Further, for implement the embodiment of the present invention during, can be obtainable dirty during for detection and expansion Data, the mode that can be combined with artificial and/or Computer Automatic Recognition is subject to examination, to ensure final updated to the data Data validity and accuracy in storehouse.
In a change case of the step S201, the characteristic information can also be the number according to the predetermined number According to the regular expression for determining.It will be appreciated by those skilled in the art that the regular expression can be used for matching from the database The characteristic information of middle random selection all data out, or, the regular expression can be also used for matching user wishes The characteristic information of all data for obtaining is detected and expanded from the mass data.
If for example, wish to go in mass data to be detected and expanded for sample data with equipment signature identification code, from institute State and the sample data of acquisition is randomly choosed in database include the equipment signature identification code of telecommunication apparatus and setting for mobile device Standby signature identification code, because the equipment signature identification code of telecommunication apparatus is based on International Mobile Equipment Identity code (International Mobile Equipment Identity, abbreviation IMEI) represent, and the equipment signature identification of mobile device code is set based on movement Standby identification code (Mobile Equipment Identifier, abbreviation MEID) represents, and both equipment signature identifications code is total to It is that it is numeral that both are with the 11 of 1 beginning, thus is referred to the common ground and determines the regular expression with point.
The all data for for example being randomly choosed out from the database again are media access control (Media Access Control, abbreviation MAC) address, then the regular expression can be expressed as "/^ (and [a-zA-Z0-9] { 8 }-[a-zA-Z0- 9]{4}\-[a-zA-Z0-9]{4}\-[a-zA-Z0-9]{4}\-[a-zA-Z0-9]{12})$/”。
Again for example, user wishes to detect and expand from the mass data data obtained in specific geographical area, then The regular expression can also be used, the specific geographical area is represented by the restriction of pair warp and weft degree.
It is possible to further there is phase with the sample data by being searched in different matched rule to the mass data Data with characteristic information include, directly by the part to be matched (matched position chosen) of the data and the sample number According to regular expression carry out canonical matching, if the matching condition that the regular expression is met with compatible portion, can To determine that the data have identical characteristic information with the sample data.For example, the regular expression of the sample data Can for shop- (d+)-, then for a data, if the URL of the part to be matched of the data be shop-33415-23- Test, the logic of the regular expression is met due to the URL of the part to be matched, it is possible to determine the data and institute Stating sample data has identical characteristic information.
In a change case of the step S202, the matched rule also includes that right side includes matching, if the sea The signature identification code complete match of character string and the sample data of the position to be matched of a certain data in amount data, it is determined that The data have identical characteristic information with the sample data.For example, in the mass data URL of a certain data road Path position includes character string car_shanghai_ser33456, and the signature identification code of the sample data is ser3356, then may be used To determine that the mass data has identical characteristic information with the sample data.
In another change case of the step S202, the matched rule also includes essentially equal matching, if described The character string of the position to be matched of a certain data is essentially equal with the signature identification code of the sample data in mass data, then really The fixed data have identical characteristic information with the sample data.It may for instance be considered that character string shop=33415& Category=23&item=test is essentially equal with signature identification code 33415.
In another change case of the step S202, the matched rule also includes comprising matching, if the magnanimity Signature identification code of the character string of the position to be matched of a certain data comprising the sample data in data, it is determined that the data There is identical characteristic information with the sample data.It may for instance be considered that character string shop-33415-23-test includes spy Levy identification code 33415.
In a change case of the step S204, when the characteristic information is true according to the data of the predetermined number Current scanned data have identical special with the sample information during fixed regular expression, and in the mass data When reference ceases, directly the data being currently scanned can be carried out with the extraction of regular expression, and by the regular expressions Formula is updated to the fingerprint base.
In a change case of the present embodiment, the step S202 is looked into based on the sample data in mass data When looking for, if there is preset limit condition, in the partial data in the mass data by the preset limit conditional definition Search, to obtain the matched data.Preferably for the data and sample data that are represented with URL, the preset limit condition Can be the top layer domain name tld in URL.For example, user can select the part or complete to selecting to determine in the step S201 Portion's sample data defines the top layer domain name tld, then the technical scheme of the embodiment of the present invention is performing the step S202 to institute When stating step S204, for the sample data defined by the top layer domain name tld, the top level domain is preferably only detected and expanded Data where name tld on website are to the database.
Further, the preset limit condition can be according to user's request, or the technology for performing the embodiment of the present invention The data-handling capacity of the equipment of scheme sets.
Further, the top layer domain name tld of the sample data be able to can also be differed with identical, for example, can by from In selecting all sample datas for determining in the database, the top layer domain name tld of half sample data and an other half data Top layer domain name tld be set as different websites, with the technical scheme based on the embodiment of the present invention simultaneously enter in two websites The detection and retrieval of row data.
In a typical application scenarios, when computer performs the technical scheme of the embodiment of the present invention, first by institute State sample data to be loaded into the local memory of the computer, when the part or all of data in the sample data are present in advance If the top layer domain name tld when, can build a mapping table in the internal memory, the mapping table is used for the sample The characteristic information or regular expression classified and stored of one or more sample datas with identical top layer domain name tld in data.
Can be with identical top layer preferably for the application scenarios that the characteristic information is the signature identification code One or more sample datas of domain name tld build character match tree, matching during to improve subsequent probe and expanding data Efficiency.
Preferably for the application scenarios that the characteristic information is the regular expression, can also there will be identical top The respective regular expression of one or more sample datas of layer domain name tld is stored as a list, to perform follow-up spy Survey and expand step.As a change case, the multiple sample datas with identical top layer domain name tld can also be directed to and determine institute State regular expression.
Further, when the sample data and the mass data are based on URL to be represented, the step S202 is preferred Ground first to the URL treatment of the sample data, to obtain top layer domain name tld, Ran Hou corresponding with the sample data When scanning the mass data one by one, judge whether the URL of current scanned data includes the top layer domain name tld, if sentencing Disconnected result shows that the URL of the current scanned data not comprising the top layer domain name tld, then directly skips the data;It is no Then, if judged result shows that the URL of the current scanned data includes the top layer domain name tld, it is further continued for performing institute Step S202, is compared the URL of the data with the characteristic information of the sample data based on the matched position chosen, with from The data that there is same characteristic features information with the sample data are searched in the mass data.
By upper, using the scheme of second embodiment, have with sample data during mass data can be detected according to sample data There are the data of same characteristic features information, so that the data finally extended in the database are all believed with identical feature Breath, meets the actually used demand that user had found and expanded specific type of data in mass data.
It will be appreciated by those skilled in the art that step S201 described in the present embodiment and step S202 and corresponding change Example can be understood as a specific embodiment party of step S101 described in above-mentioned embodiment illustrated in fig. 1 and the step S102 Formula, the matching workload during matching in the mass data is reduced by the preset limit condition, while allowing user Data snooping and expansion can be carried out for specific website.Further, user can according to the actual requirements choose whether needs The preset limit condition is set, wherein, when the user does not set the preset limit condition, the embodiment of the present invention is by institute State all of record that accesses on internet carries out data snooping and expands (i.e. the whole network search) as the mass data;When described When user sets the preset limit condition, the embodiment of the present invention preset limit condition is limited one or more Access on website is recorded as the mass data, to obtain the data (i.e. specific website search) needed for user.
Used as a change case, when user's selection carries out the whole network searches for, the embodiment of the present invention can first in multiple websites Technical scheme for the embodiment of the present invention of upper execution, the matched rule is obtained with from each website, by the multiple website After respective matched rule is integrated into general matching symbol, then using the general matching symbol as the sample data characteristic information Carry out the whole network search.
Fig. 3 is the flow of a kind of data snooping and extending method based on sample data of the third embodiment of the present invention Figure.Specifically, in the present embodiment, step S301 is first carried out, the data of predetermined number is selected from the database, and will The characteristic information of the data of the predetermined number is used as the sample data.More specifically, those skilled in the art can join It is admitted to and states step S201 described in embodiment illustrated in fig. 2, will not be described here.
Performed subsequently into step S302, being searched in the mass data with the sample data there are same characteristic features to believe The data of breath, and using the data with same characteristic features information as the matched data.Specifically, those skilled in the art Step S202 described in above-mentioned embodiment illustrated in fig. 2 is may be referred to, be will not be described here.
Next step S303 is performed, structuring treatment is carried out to the matched data, arranged by preset format with obtaining Normal data.Specifically, the result of structuring treatment can represent in a tabular form, wherein, the form category Record has all or part of content of the matched data.More specifically, the normal data can be by the form Content arranged by the preset format after the result that obtains.In a preference, the matched data is also with URL form tables Show, the classification recorded in the form includes top layer domain name tld, port (port), match parameter (querykey), match bit Put, matching content and matching way, this step can by by the URL of the matched data by the class recorded in the form Do not split, then re-starting sequence to the result for splitting according to the preset format is integrated, and the rearrangement is integrated Result be exactly the normal data.
Performed subsequently into step S304, the matched rule and duplicate removal are generated based on the normal data.It is excellent at one Select in example, the normal data can be converted to by the matched rule according to the preset format first, then removal conversion Duplicate keys in the matched rule for obtaining, obtain the matched rule after the duplicate removal.It will be appreciated by those skilled in the art that by described The treatment of step S303, what the normal data was potentially included is only the key message needed for carrying out subsequent match work, it is impossible to Subsequent step is applied directly to, so needing by the treatment of this step, the normal data is changed by the preset format It is the matched rule, so as to the use of subsequent step;On the other hand, because the design of the network address URL of same web site is general all With similitude, so this step can be carried out after conversion obtains all of matched rule to all of matched rule Duplicate removal treatment, to reject the duplicate keys in the matched rule that the conversion of this step is obtained.
Next step S305 is performed, the fingerprint base is updated based on the matched rule after duplicate removal.Specifically, the renewal Including the matched rule after the duplicate removal is stored to the fingerprint base.More specifically, the renewal also includes that rejecting is described The matched rule repeated with existing matched rule in the fingerprint base in matched rule after duplicate removal.In a preference, Matched rule of the matched rule after the duplicate removal in the fingerprint base is compared, with secondary removal duplicate keys, then will Matched rule after secondary removal duplicate keys is updated to the fingerprint base.
Step S306 is finally performed, matching extraction is carried out in the mass data based on the fingerprint base after renewal, to obtain In the mass data with the renewal after fingerprint base in the data that match of matched rule, and the number for obtaining will be matched According to extending to the database.Specifically, those skilled in the art may be referred to step described in above-mentioned embodiment illustrated in fig. 1 S104, will not be described here.
Further, the matched rule can be understood as a kind of combination filtered and extract data.
In one preferably application scenarios, the top layer domain name tld and the match parameter in the matched rule Can be used for filter data.For example, when the step S305 is performed, the top layer domain name tld and described can be primarily based on Whether current scanned data are worth further matching work during match parameter tentatively judges the mass data, if described The top layer domain name tld of current scanned data is not corresponded with the top layer domain name tld of record in the matched rule, then may be used Directly to reject the current scanned data, so as to save the matching amount of the embodiment of the present invention, matching efficiency is improved.
In another preferably application scenarios, the matching way, matched position in the matched rule and Can be used for extracting data with content or regular expression, with finally determine the current scanned data whether with it is described Sample data has identical characteristic information.
Further, the fingerprint base and the database can be stored in the computer for performing the embodiment of the present invention It is interior, it is also possible to be stored in other storage devices coupled with the computer, or, may be stored in high in the clouds.
By upper, using the scheme of 3rd embodiment, step S303, the step S304 and the step described in the present embodiment Rapid S305, it can be understood as step S103 described in above-mentioned embodiment illustrated in fig. 1, or described in above-mentioned embodiment illustrated in fig. 2 One specific embodiment of step S203, is processed by structuring, enables to match the multiple for obtaining by different modes Matched data has the form of high unity, is conducive to subsequent treatment, on the other hand, by the duplicate removal in the step S304 with And the secondary duplicate removal in the step S305, it is ensured that the matched rule in the fingerprint base is not in duplicate keys, in order to avoid be not intended to The waste storage resource of justice.
In a typical application scenarios, the data be on a certain website sell commodity, and the data with URL forms represent that the part commodity sold on the website that is stored with the database, user wishes that obtaining the website sells Other commodity information, then user can using the embodiment of the present invention technical scheme, it is existing many from the database Randomly select the commodity of predetermined number in individual commodity, and the commodity that will be selected numbering on the web is selected described in Commodity signature identification code, for example, the domain name of the website be host.com (i.e. with the top layer domain name tld setting described in Preset limit condition), user have selected 2 commodity as the sample data in the database, wherein, commodity A is in institute It is item1234 to state the numbering on website, and commodity B numberings on the web are then item1368, then the sample data is It is item1234 and item1368.
It is first when the technical scheme for performing the embodiment of the present invention is searched based on the sample data in the mass data First, the sample data can be loaded in the computer local memory for performing the embodiment of the present invention, and builds dictionary.Wherein, The dictionary key (key) is the top layer domain name tld (host.com is in this application scene) of the sample data, the word The value (value) of allusion quotation is the character match tree under top layer domain name tld.Preferably, by by the character string of all sample datas The character match tree is built after splitting into single character.Preferably, in this application scene, based on the sample data Item1234 and item1368 can build the character match tree obtained shown in Fig. 4.
Then, the mass data is scanned one by one based on the character match tree to be searched.Specifically, first judge described Whether the top layer domain name tld in mass data in the URL of current scanned data is equal with host.com, is skipped if not waiting The currently scanned data;If equal follow-up matching work is carried out for the currently scanned data.
In this application scene, equal with host.com for the top layer domain name tld currently scanned data, it is necessary to (i.e. described matched position is inquiry to carry out equality matching to the inquiry query parts of the URL of the current scanned data Query, the matched rule is equality matching).With http://a.host.com/path/test.htmlQk1=i234& The described currently scanned data instance of qk2=item_1246&item_id=item_1234 this URL representatives, can first tear open Point URL, obtains the inquiry query parts in the URL of the current scanned data, then by the inquiry query parts Further split by separator " & " and "=", the dictionary { " qk1 " represented in key-value pair form can be obtained:"i123"," qk2":"item_1246","item_id":" item_1234 " }, the dictionary is then traveled through, by the value difference in the dictionary Searched one by one according to character on the character match tree shown in Fig. 4.
For example, when being matched to described value i123, first matching i, the match is successful;Down the of matching described value i123 again Two characters 1, the child list of i characters only has t characters in the character match tree shown in Fig. 4, and not comprising 1, so described The matching of value i123 is unsuccessful.
Again for example, when being matched to value item_1246, the match is successful for first character i;Second character t, also wraps It is contained in the child list of i characters in character match tree shown in Fig. 4;3rd character e is also in character match tree shown in Fig. 4 In the child list of t characters;Similarly character e, character m and character 1 match with character match tree shown in Fig. 4;Next Matching character 2, the character 1 in character match tree shown in Fig. 4 has two child nodes [2,3], comprising the character to be matched 2, it is possible to continue down to match character 4;When the character 4 is matched, due to when a character 2 is matched, it is determined that Described value item_1246 may with character match tree shown in Fig. 4 in, the branch of character 2 in the child node [2,3] below character 1 Match, so continuing the branch based on the character 2 matches the character 4, but due to character in character match tree shown in Fig. 4 Child node in 2 branch below the node of character 2 is character 3, not comprising character 4 to be matched, therefore described value item_ 1246 matching is also unsuccessful.
Again for example, when being matched to value item_1234, step is matched with character match tree shown in Fig. 4 by foregoing Suddenly, it may be determined that described value item_1234 can be matched completely with character match tree shown in Fig. 4, accordingly, it is determined that described to be scanned The sample data is included in the URL of data, and the match parameter is commodity ID.
The matched data list that table 1 is represented based on URL
http://a.host.com/path/test.htmlQk1=i234&qk2=item_1246&item_id=item_1234
http://b.host.com/testItem_id=item_1368&a=c
http://c.host.com:1234/testId=item_1234
http://item_1368.host.com/detai_info.html
http://a.host.com:3345/category-1234-item_1234-t12
http://a.host.com:3567/item/item_1234/detail.html
Continue to scan on the mass data, it is also possible to obtain the matched data represented below based on URL.The matched data The URL shown in above-mentioned table 1 can be included.
Table 2 is to the normal data list after the structuring of table 1 treatment
As shown in table 2, after the scanning one by one based on the sample data in the mass data is completed, can be to this The secondary matched data for obtaining of searching carries out structuring treatment, with the criterion numeral for obtaining being represented based on the preset format According to.Preferably, the normal data is according to top layer domain name tld, port (port), match parameter (querykey), matched position, The order arrangement of matching content and matching way, wherein, then represented with sky for default content.For example, for port, one As the port be default value (i.e. 80) when, it can omit in the URL and occur without, then in the normal data also with Space represents.Again for example, searched from the mass data with path as matched position for the embodiment of the present invention obtain With data, after these matched data structurings are processed as normal data, the match parameter of these normal datas is sky.
Table 3 is based on the matched rule list that the normal data conversion of table 2 is obtained
For the normal data that the table 2 is listed, the normal data is converted into institute according to the preset format Matched rule is stated, as shown in table 3.Wherein, (item_ d+) is regular expression, and it is used to represent and is started with item_, and after Face is followed by the character string of numeral.
Further, according to the matched rule and the match parameter, in the matched rule that can be listed in table 3 The row of duplicate removal second;Then it is compared with existing matched rule in the fingerprint base again, may be with the fingerprint in removal table 3 The matched rule that existing matched rule is repeated in storehouse, is most updated to the fingerprint base by the matched rule of duplicate removal twice at last.
Further, by renewal after the fingerprint base be re-applied to the mass data, and Intrusion Detection based on host, path, The order of inquiry is rescaned, then the newly-increased matched rule can match the URL of more commodity, is obtained by by matching The URL (or meeting the part of the matched rule in the commodity URL) of the commodity be updated to the database, can be with Finally realize the expansion to the database.
For example, newly-increased matched rule http://*.host.com/*Item_id=* may match a URL http://test1.host.com/aThe commodity of item_id=test1&b=c, or URL is http:// test1.host.com/path/subpath/subpath/a.htmlThe business of q1=v1&q2=v2&item_id=11111 Product, in the commodity URL that above-mentioned two is newly matched, the part being consistent with the matched rule is test1 and 11111.
It will be appreciated by those skilled in the art that the technical scheme for passing through the embodiment of the present invention, based on the sample in the database Data item_1234, has finally expanded the two data of test1 and 11111.And in actual application, using the present invention The technical scheme of embodiment can be found that potential data in the URL of substantial amounts of long-tail, so that greatly expanding data storehouse, realizes Depth to mass data is excavated.
Fig. 5 is that the structure of a kind of data snooping and expanding device based on sample data of the fourth embodiment of the present invention is shown It is intended to.It will be appreciated by those skilled in the art that data snooping described in the present embodiment and expanding device 4 are used to implement above-mentioned Fig. 1 to Fig. 4 institutes Show the method and technology scheme in embodiment.Specifically, in the present embodiment, the data snooping and expanding device 4 include determining Module 41, the sample data is determined at least one data in based on database, and the database purchase has from magnanimity Many datas of acquisition are detected in data;Searching modul 42, for being searched in the mass data based on the sample data, To obtain the matched data matched with the sample data in the mass data;Update module 43, for the matching Data are processed to obtain matched rule, and update fingerprint base, and the fingerprint base is stored with the matched rule for obtaining in history; And extraction module 44, it is described to obtain for carrying out matching extraction in the mass data based on the fingerprint base after renewal In mass data with the renewal after fingerprint base in the data that match of matched rule, and the data extending for obtaining will be matched To the database.
Further, the determining module 41 includes selection submodule 411, for selecting present count from the database The data of amount, and using the characteristic information of the data of the predetermined number as the sample data.Preferably, the characteristic information The signature identification code of the data including the predetermined number;Or the regular expressions determined according to the data of the predetermined number Formula.
Further, the searching modul 42 includes that first searches submodule 421, for being searched in the mass data With the data that the sample data has same characteristic features information, and using the data with same characteristic features information as described With data.
Further, the searching modul 42 also includes that second searches submodule 422, and described second searches submodule 422 For when being searched in mass data based on the sample data, if there is preset limit condition, in the mass data In by the partial data of the preset limit conditional definition search, to obtain the matched data.
Further, the update module 43 includes treatment submodule 431, for carrying out structuring to the matched data Treatment, the normal data of preset format arrangement is pressed to obtain;Generation submodule 432, for generating institute based on the normal data State matched rule and duplicate removal;And submodule 433 is updated, for updating the fingerprint base based on the matched rule after duplicate removal.
Further, the generation submodule 432 includes converting unit 4321, described in being incited somebody to action according to the preset format Normal data is converted to the matched rule;And duplicate removal unit 4322, for removing the weight in the matched rule being converted to Multiple item, obtains the matched rule after the duplicate removal.
Further, the renewal submodule 433 includes comparing unit 4331, for by the matched rule after the duplicate removal Compared with the matched rule in the fingerprint base, with secondary removal duplicate keys;And updating block 4332, for being gone secondary Except the matched rule after duplicate keys is updated to the fingerprint base.
Preferably, the data are internet access record.
More contents of operation principle, working method on the data snooping and expanding device 4, are referred to Fig. 1 Associated description into Fig. 4, repeats no more here.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can Completed with instructing the hardware of correlation by program, the program can be stored in computer-readable recording medium, to store Medium can include:ROM, RAM, disk or CD etc..
Although present disclosure is as above, the present invention is not limited to this.Any those skilled in the art, are not departing from this In the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute The scope of restriction is defined.

Claims (18)

1. a kind of data snooping and extending method based on sample data, it is characterised in that comprise the following steps:
At least one data in based on database determine the sample data, and the database purchase has to be visited from mass data Survey many datas for obtaining;
Searched in the mass data based on the sample data, with obtain in the mass data with the sample data phase The matched data of matching;
The matched data is processed to obtain matched rule, and updated fingerprint base, the fingerprint base is stored with history The matched rule of acquisition;
Matching extraction is carried out in the mass data based on the fingerprint base after renewal, with obtain in the mass data with it is described The data that the matched rule in fingerprint base after renewal matches, and the data extending for obtaining will be matched to the database.
2. data snooping and extending method based on sample data according to claim 1, it is characterised in that described to be based on At least one data in database determine the sample data, comprise the following steps:
The data of predetermined number are selected from the database, and using the characteristic information of the data of the predetermined number as described Sample data.
3. data snooping and extending method based on sample data according to claim 2, it is characterised in that the feature Information includes:
The signature identification code of the data of the predetermined number;Or
The regular expression that data according to the predetermined number determine.
4. data snooping and extending method based on sample data according to claim 2, it is characterised in that based on described Sample data is searched in the mass data, to obtain the coupling number matched with the sample data in the mass data According to comprising the following steps:
The data that there is same characteristic features information with the sample data are searched in the mass data, and described will be had identical The data of characteristic information are used as the matched data.
5. data snooping and extending method based on sample data according to claim 4, it is characterised in that based on institute When stating sample data and being searched in mass data, if there is preset limit condition, by described default in the mass data Searched in the partial data of restrictive condition definition, to obtain the matched data.
6. data snooping and extending method based on sample data according to claim 1, it is characterised in that to described Processed with data to obtain matched rule, and updated fingerprint base, comprised the following steps:
Structuring treatment is carried out to the matched data, the normal data of preset format arrangement is pressed to obtain;
The matched rule and duplicate removal are generated based on the normal data;
The fingerprint base is updated based on the matched rule after duplicate removal.
7. data snooping and extending method based on sample data according to claim 6, it is characterised in that based on described Normal data generates the matched rule and duplicate removal, comprises the following steps:
The normal data is converted to by the matched rule according to the preset format;
Duplicate keys in the matched rule that removal is converted to, obtain the matched rule after the duplicate removal.
8. data snooping and extending method based on sample data according to claim 6, it is characterised in that based on duplicate removal Fingerprint afterwards updates the fingerprint base, comprises the following steps:
Matched rule of the matched rule after the duplicate removal in the fingerprint base is compared, with secondary removal duplicate keys;
Matched rule after secondary removal duplicate keys is updated to the fingerprint base.
9. the data snooping and extending method based on sample data according to any one of claim 1 to 8, its feature exists In the data are recorded for internet access.
10. a kind of data snooping and expanding device based on sample data, it is characterised in that including:
Determining module, the sample data is determined at least one data in based on database, and the database purchase has The many datas for obtaining are detected from mass data;
Searching modul, for being searched in the mass data based on the sample data, with obtain in the mass data with The matched data that the sample data matches;
Update module, for being processed the matched data to obtain matched rule, and updates fingerprint base, the fingerprint base Be stored with the matched rule for obtaining in history;
Extraction module, for carrying out matching extraction in the mass data based on the fingerprint base after renewal, to obtain the sea In amount data with the renewal after fingerprint base in the data that match of matched rule,
And the data extending for obtaining will be matched to the database.
11. data snoopings and expanding device based on sample data according to claim 10, it is characterised in that described true Cover half block includes:
Selection submodule, the data for selecting predetermined number from the database, and by the data of the predetermined number Characteristic information is used as the sample data.
12. data snoopings and expanding device based on sample data according to claim 11, it is characterised in that the spy Reference breath includes:
The signature identification code of the data of the predetermined number;Or
The regular expression that data according to the predetermined number determine.
13. data snoopings and expanding device based on sample data according to claim 11, it is characterised in that described to look into Looking for module includes:
First searches submodule, for searching the number for having same characteristic features information with the sample data in the mass data According to, and using the data with same characteristic features information as the matched data.
14. data snoopings and expanding device based on sample data according to claim 13, it is characterised in that described to look into Look for module also include second search submodule, it is described second search submodule be used for based on the sample data in mass data During middle lookup, if there is preset limit condition, by the part number of the preset limit conditional definition in the mass data According to middle lookup, to obtain the matched data.
15. data snoopings and expanding device based on sample data according to claim 10, it is characterised in that it is described more New module includes:
Treatment submodule, for carrying out structuring treatment to the matched data, the criterion numeral of preset format arrangement is pressed to obtain According to;
Generation submodule, for generating the matched rule and duplicate removal based on the normal data;
Submodule is updated, for updating the fingerprint base based on the matched rule after duplicate removal.
16. data snoopings and expanding device based on sample data according to claim 15, it is characterised in that the life Include into submodule:
Converting unit, for the normal data to be converted into the matched rule according to the preset format;
Duplicate removal unit, for removing the duplicate keys in the matched rule being converted to, obtains the matched rule after the duplicate removal.
17. data snoopings and expanding device based on sample data according to claim 16, it is characterised in that it is described more New submodule includes:
Comparing unit, compares, with secondary for the matched rule by the matched rule after the duplicate removal in the fingerprint base Removal duplicate keys;
Updating block, for the matched rule after secondary removal duplicate keys to be updated into the fingerprint base.
18. data snooping and expanding device based on sample data according to any one of claim 10 to 17, its feature It is that the data are recorded for internet access.
CN201611264829.8A 2016-12-30 2016-12-30 Data detection and expansion method and device based on sample data Active CN106844553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611264829.8A CN106844553B (en) 2016-12-30 2016-12-30 Data detection and expansion method and device based on sample data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611264829.8A CN106844553B (en) 2016-12-30 2016-12-30 Data detection and expansion method and device based on sample data

Publications (2)

Publication Number Publication Date
CN106844553A true CN106844553A (en) 2017-06-13
CN106844553B CN106844553B (en) 2020-05-01

Family

ID=59117193

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611264829.8A Active CN106844553B (en) 2016-12-30 2016-12-30 Data detection and expansion method and device based on sample data

Country Status (1)

Country Link
CN (1) CN106844553B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815488A (en) * 2018-12-26 2019-05-28 出门问问信息科技有限公司 Natural language understanding training data generation method, device, equipment and storage medium
CN111680286A (en) * 2020-02-27 2020-09-18 中国科学院信息工程研究所 Refinement method of Internet of things equipment fingerprint database
CN111797085A (en) * 2020-06-22 2020-10-20 中国平安财产保险股份有限公司 Request data processing method and device, computer equipment and storage medium
CN114511476A (en) * 2021-12-21 2022-05-17 中科环森智慧科技(苏州)有限公司 Intelligent analysis application system for image data

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647272B (en) * 2018-04-28 2020-12-29 江南大学 Method for predicting concentration of butane at bottom of debutanizer by expanding small samples based on data distribution

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1952929A (en) * 2005-10-20 2007-04-25 关涛 Extraction method and system of structured data of internet based on sample & faced to regime
CN103942282A (en) * 2014-04-02 2014-07-23 新浪网技术(中国)有限公司 Sample data obtaining method, device and system
CN104063474A (en) * 2014-06-30 2014-09-24 五八同城信息技术有限公司 Sample data collection system
CN105095240A (en) * 2014-05-04 2015-11-25 中国银联股份有限公司 Database data sample acquisition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1952929A (en) * 2005-10-20 2007-04-25 关涛 Extraction method and system of structured data of internet based on sample & faced to regime
CN103942282A (en) * 2014-04-02 2014-07-23 新浪网技术(中国)有限公司 Sample data obtaining method, device and system
CN105095240A (en) * 2014-05-04 2015-11-25 中国银联股份有限公司 Database data sample acquisition
CN104063474A (en) * 2014-06-30 2014-09-24 五八同城信息技术有限公司 Sample data collection system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815488A (en) * 2018-12-26 2019-05-28 出门问问信息科技有限公司 Natural language understanding training data generation method, device, equipment and storage medium
CN111680286A (en) * 2020-02-27 2020-09-18 中国科学院信息工程研究所 Refinement method of Internet of things equipment fingerprint database
CN111680286B (en) * 2020-02-27 2022-06-10 中国科学院信息工程研究所 Refinement method of Internet of things equipment fingerprint library
CN111797085A (en) * 2020-06-22 2020-10-20 中国平安财产保险股份有限公司 Request data processing method and device, computer equipment and storage medium
CN114511476A (en) * 2021-12-21 2022-05-17 中科环森智慧科技(苏州)有限公司 Intelligent analysis application system for image data

Also Published As

Publication number Publication date
CN106844553B (en) 2020-05-01

Similar Documents

Publication Publication Date Title
US7818303B2 (en) Web graph compression through scalable pattern mining
CN111782965B (en) Intention recommendation method, device, equipment and storage medium
JP5092165B2 (en) Data construction method and system
CN106844553A (en) Data snooping and extending method and device based on sample data
CN102855309B (en) A kind of information recommendation method based on user behavior association analysis and device
CN112165462A (en) Attack prediction method and device based on portrait, electronic equipment and storage medium
US20080270549A1 (en) Extracting link spam using random walks and spam seeds
JP2015512095A (en) Method, apparatus and computer readable recording medium for image management in an image database
CN102122291A (en) Blog friend recommendation method based on tree log pattern analysis
CN113254630B (en) Domain knowledge map recommendation method for global comprehensive observation results
US20120143844A1 (en) Multi-level coverage for crawling selection
CN110727663A (en) Data cleaning method, device, equipment and medium
CN112231700B (en) Behavior recognition method and apparatus, storage medium, and electronic device
KR102601545B1 (en) Geographic position point ranking method, ranking model training method and corresponding device
CN105389328B (en) A kind of extensive open source software searching order optimization method
CN103226601A (en) Method and device for image search
CN112231481A (en) Website classification method and device, computer equipment and storage medium
CN103761298A (en) Distributed-architecture-based entity matching method
CN109614521A (en) A kind of efficient secret protection subgraph inquiry processing method
CN105357118A (en) Rule based flow classifying method and system
CN110457600B (en) Method, device, storage medium and computer equipment for searching target group
KR101592670B1 (en) Apparatus for searching data using index and method for using the apparatus
JP2017530477A (en) System and method for processing graphs
Shaikh Web Usage Mining Using Apriori and FP Growth Alogrithm
CN110222156B (en) Method and device for discovering entity, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant