CN107463711A - A kind of tag match method and device of data - Google Patents
A kind of tag match method and device of data Download PDFInfo
- Publication number
- CN107463711A CN107463711A CN201710723820.7A CN201710723820A CN107463711A CN 107463711 A CN107463711 A CN 107463711A CN 201710723820 A CN201710723820 A CN 201710723820A CN 107463711 A CN107463711 A CN 107463711A
- Authority
- CN
- China
- Prior art keywords
- label
- sample
- sample label
- matching
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 230000008569 process Effects 0.000 claims description 15
- 238000012163 sequencing technique Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 10
- 238000010276 construction Methods 0.000 claims description 5
- 239000002184 metal Substances 0.000 description 8
- 230000009471 action Effects 0.000 description 4
- 238000007405 data analysis Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 241000287828 Gallus gallus Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of tag match method and device of data, this method includes:Sample label table is built, the sample label table includes at least one sample label, and the hierarchical relationship of each sample label, and each sample label corresponds to same tag types;According to the tag types of at least one sample label, the aiming field corresponding with the tag types is gone out from the extracting data obtained in advance, the aiming field includes at least one keyword;For sample label each described, it is performed both by:Determine to whether there is the target keyword corresponding with the sample label in the aiming field, if it is, the sample label is defined as into reference label;According to the reference label and the hierarchical relationship of each sample label determined, from least one sample label, it is determined that with the aiming field corresponding to the corresponding at least one matching label of data.This programme can improve the accuracy of tag match.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a data tag matching method and device.
Background
The data analysis can help people to make accurate judgment on data so as to take proper action, and plays an important role in the practical process, and the premise of the data analysis is the cleaning, processing and label matching of the data.
When the data is subjected to tag matching, the data corresponding to the searched related words is determined as the data matched with the tags mainly by searching the related words corresponding to the types of the tags. For example, when the tag is beijing city, and tag matching is performed, whether a related word "city" exists in data collected on the internet is searched, if yes, data before the related word is determined as a keyword corresponding to the tag, that is, data before the "city" is determined as the keyword "beijing", and then the tag is determined as a matching tag of the data.
In the process, the matching tag is determined only by searching the related word, and whether the keyword corresponding to the related word accurately corresponds to the tag content is not accurately determined, for example, when the character in front of the related word "city" is a messy code, the character still matches with the tag Beijing city in the method, which results in low accuracy of tag matching.
Disclosure of Invention
The embodiment of the invention provides a data tag matching method and device, which can improve the accuracy of tag matching.
In a first aspect, an embodiment of the present invention provides a data tag matching method, including:
constructing a sample label table, wherein the sample label table comprises at least one sample label and the hierarchical relationship of each sample label; each sample label corresponds to the same label type;
extracting a target field corresponding to the label type from pre-acquired data according to the label type of the at least one sample label;
for each of the sample tags, performing:
determining whether a target keyword corresponding to the sample label exists in the target field, and if so, determining the sample label as a reference label;
and determining at least one matching label corresponding to the data corresponding to the target field from the at least one sample label according to the determined reference label and the hierarchical relationship of each sample label.
Preferably, the first and second electrodes are formed of a metal,
after the extracting, according to the tag type of the at least one sample tag, a target field corresponding to the tag type from the pre-acquired data, the method further includes:
setting a lexical analyzer corresponding to the data format according to the data format of the at least one sample label;
establishing a full-text index for the target field, and specifying the set lexical analyzer;
splitting the target field into at least one keyword by using the specified lexical analyzer;
then the process of the first step is carried out,
the determining whether a target keyword corresponding to the sample tag exists in the target field includes:
and searching whether a target keyword corresponding to the sample label exists in the at least one keyword or not by using the full-text index established by the target field.
Preferably, the first and second electrodes are formed of a metal,
prior to the determining whether the target keyword corresponding to the sample tag exists in the target field, further comprising:
respectively setting a cursor corresponding to each level according to the level relation of each sample label;
then the process of the first step is carried out,
the determining whether a target keyword corresponding to the sample tag exists in the target field includes:
determining a cursor corresponding to the sample label according to the level corresponding to the sample label;
and searching whether the target keyword exists in the target field or not by using the determined cursor.
Preferably, the first and second electrodes are formed of a metal,
performing, at said each of said sample tags: before determining whether a target keyword corresponding to the sample tag exists in the target field, further comprising:
for at least one sample label corresponding to each hierarchy, performing: determining the character lengths corresponding to the at least one sample label respectively, and sequencing the at least one sample label according to the character lengths;
then the process of the first step is carried out,
for each sample label, performing: determining whether a target keyword corresponding to the sample tag exists in the target field, including:
and according to the sequencing result of the at least one sample label, sequentially determining whether target keywords corresponding to the sample labels exist in the target field by using the cursors corresponding to the hierarchy to which the at least one sample label belongs.
Preferably, the first and second electrodes are formed of a metal,
determining at least one matching label corresponding to the data corresponding to the target field from the at least one sample label according to the determined reference label and the hierarchical relationship of each sample label, including:
determining whether an upper label of the reference label exists in the at least one sample label according to the hierarchical relationship, and if so, taking the upper label and the reference label as the matching label; otherwise, the reference label is used as the matching label.
In a second aspect, an embodiment of the present invention provides a tag matching apparatus for data, including: the system comprises a construction unit, a field extraction unit and a label matching unit; wherein,
the constructing unit is used for constructing a sample label table, and the sample label table comprises at least one sample label and the hierarchical relationship of each sample label; each sample label corresponds to the same label type;
the field extraction unit is used for extracting a target field corresponding to the label type from pre-acquired data according to the label type of at least one sample label in the sample label table constructed by the construction unit;
the label matching unit is configured to, for each sample label, perform: determining whether a target keyword corresponding to the sample label exists in a target field extracted from the field extraction unit, and if so, determining the sample label as a reference label; and determining at least one matching label corresponding to the data corresponding to the target field from the at least one sample label according to the determined reference label and the hierarchical relationship of each sample label in the sample label table.
Preferably, the first and second electrodes are formed of a metal,
the field extraction unit is further used for setting a lexical analyzer corresponding to the data format according to the data format of the at least one sample label; establishing a full-text index for the target field, designating the set lexical analyzer, and splitting the target field into at least one keyword by using the designated lexical analyzer;
and the label matching unit is used for searching whether a target keyword corresponding to the sample label exists in the at least one keyword by using the full-text index established by the target field.
Preferably, the first and second electrodes are formed of a metal,
the apparatus further comprises: a setting unit; wherein,
the setting unit is used for respectively setting a cursor corresponding to each level according to the level relation of each sample label;
the label matching unit is used for determining the vernier corresponding to the sample label from the vernier corresponding to each level set by the setting unit according to the level to which the sample label belongs; and searching whether the target keyword exists in the target field or not by using the determined cursor.
Preferably, the first and second electrodes are formed of a metal,
the setting unit is further configured to, for at least one sample label corresponding to each hierarchy, perform: determining the character lengths corresponding to the at least one sample label respectively, and sequencing the at least one sample label according to the character lengths;
and the label matching unit is used for sequentially determining whether target keywords corresponding to the sample labels exist in the target field by using cursors corresponding to the hierarchy to which the sample labels belong according to the sequencing result of the sample labels.
Preferably, the first and second electrodes are formed of a metal,
the tag matching unit is used for determining whether an upper-level tag of the reference tag exists in the at least one sample tag according to the hierarchical relationship, and if so, taking the upper-level tag and the reference tag as the matching tags; otherwise, the reference label is used as the matching label.
The embodiment of the invention provides a data tag matching method and device, which extracts a target field corresponding to a tag type of a sample tag from pre-collected data, then accurately determines whether a keyword corresponding to the sample tag exists in the target field, and if so, determines a matching tag corresponding to the data corresponding to the target field according to the hierarchical relationship of each sample tag. Therefore, when the labels are matched, the keywords corresponding to the labels are directly matched, and the intercepted data corresponding to the relevant words are not matched with the labels, so that the accuracy of the sample labels is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a method for matching tags of data according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for matching tags of data according to another embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data tag matching apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data tag matching apparatus according to another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention, and based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a method for matching a tag of data, where the method may include the following steps:
step 101: constructing a sample label table, wherein the sample label table comprises at least one sample label and the hierarchical relationship of each sample label, and each sample label corresponds to the same label type;
step 102: extracting a target field corresponding to the label type from pre-acquired data according to the label type of the at least one sample label;
step 103: for each of the sample tags, performing: determining whether a target keyword corresponding to the sample label exists in the target field, and if so, determining the sample label as a reference label;
step 104: and determining at least one matching label corresponding to the data corresponding to the target field from the at least one sample label according to the determined reference label and the hierarchical relationship of each sample label.
In the above embodiment, the target field corresponding to the label type of the sample label is extracted from the pre-collected data, then whether the keyword corresponding to the sample label exists in the target field is accurately determined, and if yes, the matching label corresponding to the data corresponding to the target field is determined according to the hierarchical relationship of each sample label. Therefore, when the labels are matched, the keywords corresponding to the labels are directly matched, and the intercepted data corresponding to the relevant words are not matched with the labels, so that the accuracy of the sample labels is improved.
For example, at least one sample label included in the sample label table is a province code, a province name and a province abbreviation corresponding to 23 provincial administrative districts and 5 autonomous districts, a city code, a city name and a city abbreviation corresponding to 4 direct prefectures and subordinate cities of each provincial administrative district and autonomous district, a district code, a district name and a district abbreviation corresponding to a subordinate district and county of each city, the hierarchical relationship of each sample label is an administrative level corresponding to each city, and the label type of each label is an address class. When the labels are matched, an address field is found from the acquired massive internet data, for example, an address field "Jinan City area under the calendar" can be found from one piece of text data, so that the fact that the keywords corresponding to the sample labels "Jinan City" and the area under the calendar "exist in the field can be determined, and then according to the hierarchical relationship in the sample label table, namely the hierarchical relationship in each level of administrative division, the fact that the sample labels" Shandong province "," Jinan City "and the area under the calendar" are determined to belong to the Shandong province, so that the sample labels "Shandong province", "Jinan City" and the area under the calendar "are all used as matching labels of the text data corresponding to the. Therefore, after the keywords corresponding to the sample tags are accurately determined, further matching is performed according to the hierarchical relationship among the sample tags, so that the keywords corresponding to the tags are directly matched, the intercepted data corresponding to the relevant words are not matched with the tags, and the accuracy of the sample tags is improved.
In an embodiment of the present invention, after step 102, the method may further include:
setting a lexical analyzer corresponding to the data format according to the data format of the at least one sample label;
establishing a full-text index for the target field, and specifying the set lexical analyzer;
splitting the target field into at least one keyword by using the specified lexical analyzer;
specific embodiments of step 103 may include: and searching whether a target keyword corresponding to the sample label exists in the at least one keyword or not by using the full-text index established by the target field.
For example, if the pre-acquired data is in a text format, the corresponding lexical analyzer is set as a chinese lexical analyzer, then a full-text index is established for the target field, and the set chinese lexical analyzer is designated, the determined target field can be segmented by the designated chinese lexical analyzer, so that the target field is split into a plurality of keywords. For example, if the determined target field is "the undergrowth area of the juan city", the target field may be split into the keywords "the juan city" and "the undergrowth area" by using a chinese lexical analyzer, and when the labels are matched, taking the sample label "the juan city" as an example, it is determined whether a target keyword corresponding to the "the juan city" exists in each of the split keywords, and if there is a corresponding keyword, the sample label "the juan city" is used as a reference label corresponding to the target field. Therefore, the lexical analyzer corresponding to the data format is selected, so that the acquired data can be more accurately disassembled, the occurrence rate of errors such as wrong disassembly and incomplete disassembly is reduced, the sample label and the keyword can be accurately matched, and the label matching accuracy can be further improved.
In an embodiment of the present invention, before step 103, the method may further include:
respectively setting a cursor corresponding to each level according to the level relation of each sample label;
specific embodiments of step 103 may include:
determining a cursor corresponding to the sample label according to the level corresponding to the sample label;
and searching whether the target keyword exists in the target field or not by using the determined cursor.
Here, a province-level cursor, a city-level cursor and a county-level cursor are respectively set according to the administrative level of each city, the province-level cursor includes the short names of all provinces in the sample label table, the city-level cursor includes the short names of all cities in the sample label table, and the county-level cursor includes the full names of all counties in the sample label table. And then, searching whether the target keyword exists in the target field or not by using each cursor. Taking a district-level cursor as an example, sequentially taking the full names of all the districts as keywords to index the target field, and if the keywords identical to the full names of a certain district exist in the target field, taking the sample label corresponding to the district as the matching label of the data corresponding to the target field. Similarly, province-level cursors and city-level cursors can be used to match province-level and city-level labels. The target keywords are determined from the target field by utilizing the cursors corresponding to the sample labels, the sample labels can be automatically traversed in the target field, and the target field can be prevented from being missed to be read, so that the efficiency of label matching can be improved, and the accuracy of label matching can also be improved.
In an embodiment of the present invention, before step 103, the method may further include:
for at least one sample label corresponding to each hierarchy, performing: determining the character lengths corresponding to the at least one sample label respectively, and sequencing the at least one sample label according to the character lengths;
specific embodiments of step 103 may include:
and according to the sequencing result of the at least one sample label, sequentially determining whether target keywords corresponding to the sample labels exist in the target field by using the cursors corresponding to the hierarchy to which the at least one sample label belongs.
For example, the full names of the counties are sorted, for example, the characters of the full names of the cities are sorted in a descending order, and then the county-level cursor is used to determine whether the keywords corresponding to the full names of the counties exist in the target field according to the sorting result of the full names of the counties in the county-level cursor. This may effectively avoid tag mismatches due to incomplete correspondence of sample tags, e.g., the case where the full name "east city district" in prefecture is matched to "city district".
In an embodiment of the present invention, the detailed implementation of step 104 may include:
determining whether an upper label of the reference label exists in the at least one sample label according to the hierarchical relationship, and if so, taking the upper label and the reference label as the matching label; otherwise, the reference label is used as the matching label.
For example, when it is determined that "kannan city" is the reference label, according to the administrative level of each city, it is determined that kannan city is a superior label belonging to shandong province, that is, "shandong province" is "kannan city", and both "shandong province" and "kannan city" are used as matching labels of data. For another example, when the determined reference tag is "shandong province", and there is no higher-level tag corresponding to the reference tag, the tag is only used as a matching tag of the data, so that multi-level tag matching of the data can be realized, and the accuracy of tag matching is further improved.
The following describes in detail a tag matching method for data provided by an embodiment of the present invention, taking matching of provincial, urban and prefectural sample tags by using an Oracle database as an example, and as shown in fig. 2, the method may include the following steps:
step 201: constructing a sample label table, wherein the sample label table comprises at least one provincial sample label, at least one city sample label and at least one district-county sample label.
Acquiring the latest administrative divisions (including administrative division names and codes) issued by the national statistical office through the Internet, importing and processing administrative division data of each level into a database, and generating an administrative division code table: DM _ REGION. The list comprises provincial CODE provider _ CODE, provincial NAME provider _ NAME and provincial abbreviation provider _ JC which are respectively corresponding to 23 provincial administrative districts and 5 autonomous districts, CITY CODE CITY _ CODE, CITY NAME CITY _ NAME and CITY abbreviation _ JC which are respectively corresponding to 4 direct prefectures and subordinate cities of each provincial administrative district and autonomous district, and region COUNTY CODE country _ CODE, region NAME COUNTY _ NAME and region COUNTY _ JC which are respectively corresponding to subordinate district and COUNTY of each CITY.
Step 202: and extracting an address field from the pre-acquired network text data, and setting a Chinese lexical analyzer.
For example, the user of the operating database is DAT _ CL, the pre-acquired web text data is data collected by the internet, and is stored in a database table DAT _ CL. Because the sample labels in the sample labels are all address class labels, the address field is extracted from the acquired network data. Since the address field is in text format, a chinese lexical analyzer is provided. For example, the chinese lexical analyzer is set to chinese _ lexer, and the setting process can be implemented by at least the following programming languages: create _ prediction ('lexer _1', 'chicken _ lexer').
Step 203: adding full-text indexes to the address fields, designating a Chinese lexical analyzer, and splitting the address fields into at least one keyword by using the designated Chinese lexical analyzer.
Here, a full-text index, named IDX _ ADDR, is added to the extracted address field, and in this example, the process can be implemented by the following programming language:
CREATE INDEX IDX_ADDR ON ADDRESS_INFO(ADDRESS)
INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS('lexer lexer_1')。
step 204: respectively setting a provincial level vernier, a city level vernier and a district level vernier, wherein the provincial level vernier corresponds to at least one provincial level sample label, the city level sample label corresponds to at least one city level sample label, and the district level vernier corresponds to at least one district level sample label.
For example, the established provincial-level cursor is cur _ progress, the city-level cursor is cur _ city, and the district-level cursor is cur _ county.
The process of establishing a provincial cursor can be realized by at least the following programming languages:
cur_province:
CURSOR CUR_PROVINCE IS
SELECT DISTINCT W.PROVINCE_JC
FROM DM_REGION W
WHERE W.PROVINCE_JC IS NOT NULL;
ROW_PROVINCE CUR_PROVINCE%ROWTYPE
ORFER BY DESC
step 205: for each level of cursor, executing: and determining the character length corresponding to each sample label, and sorting each sample label in a descending order according to each character length.
For example, according to the name length of each city, each sample label is sorted in descending order, that is, the sample label corresponding to the city with the longer name is arranged in front of the sample label corresponding to the city with the shorter name is arranged behind the sample label corresponding to the city with the shorter name.
Step 206: and traversing each cursor, sequentially determining whether the address field has a keyword corresponding to each sample label according to the sequencing result, and if so, determining the corresponding sample label as a reference label.
Taking a province-level cursor as an example, sequentially taking the names of the provinces as keywords according to the descending order of the provinces, indexing an address field, and if the address field has the keywords identical to the name of a certain province, taking a sample label corresponding to the province as a matching label of the data corresponding to the address field.
Taking the shandong as an example, when determining whether the keyword corresponding to the shandong exists in the address field, the method can be implemented by the following programming languages:
INSERT INTO ADDRESS_INFO_LABEL
SELECT W., ' shandong ', ') "
FROM ADDRESS_INFO W
WHERE contacts (ADDRESS, 'Shandong', 1) > 0.
Similarly, the city-level cursor and the county-level cursor can be traversed to complete the city-level and county-level label matching. For example, the region-level cursor cur _ count can be traversed, the names of the regions and the counties are sequentially taken as keywords to perform full-text index on the address field of the acquired data, and region-level label matching is completed.
Step 207: and judging whether the sample label table has the upper label of the determined reference label according to the administrative level of each sample label, if so, executing the step 208, and otherwise, executing the step 209.
Step 208: and taking the superior label and the reference label as matching labels of the text data corresponding to the address field, and ending the current flow.
Step 209: the reference label is taken as a matching label.
For example, an address field "kanan city inferior zone" may be extracted from a text data, and it may be determined that there are keywords corresponding to the sample labels "kanan city" and "inferior zone" in the field, and then it is determined that kanan city is affiliated to shandong province according to the hierarchical relationship in the sample label table, i.e., the hierarchical relationship in each level of administrative zones, so that the sample labels "kanan province", "kanan city", and "inferior zone" are all used as matching labels for the text data corresponding to the address field.
In addition, when matching is performed by using the Oracle database, it is necessary to check whether the database has users and roles required by the full-text index, and assign corresponding permissions to the users with matched labels. When the tags match, a tag matching table ADDRESS _ INFO _ LABEL may be pre-constructed to store the matching result. Three fields of ProVINCE, CITY and COUNTY are added to the label matching table. When matching, sampling a PROVINCE in the tag table DM _ REGION, such as Shandong PROVINCE, called Shandong for short, searching the ADDRESS in full text by taking the Shandong as a keyword, successfully matching the Shandong with the PROVINCE tag value as the Shandong, and inserting the data into the ADDRESS _ INFO _ LABEL.
In addition, because the collected address information is not standard, the sample tag table includes the full name, the short name and the code of each address, and when matching tags, each hierarchy generally adopts multiple ways to match, on one hand, the full name, the short name and the code can also be adopted to match. For example, if a certain address field is "the xx street of the Guangxi Zhuang autonomous region", it may be matched with "Guangxi" for short, or may be matched with "Guangxi Zhuang autonomous region" for all. If the address field is 'Guangxi Guilin City', the address field can be matched to the sample label 'Guangxi' through province abbreviation, namely, the matching label is 'Guangxi', and at the moment, if matching is carried out only through province abbreviation 'Guangxi Zhuang autonomous region', the matching label 'Guangxi' cannot be successfully matched to the address field. Therefore, the accuracy of label matching can be further improved by adopting various matching modes of full names, short names and codes in each level.
And when the acquired text data is matched with the label, the four types of data including three-level data, two-level data and one-level data of province and city and county are successfully matched, and the four types of data are not matched with the four types of data of province and city county. And the data matched with any one level of label can provide dimensional support for data analysis. In addition, for address data completing one-level, two-level, and three-level matching, matching result verification is required. Possible error situations are: the street and cell information is the same as administrative divisions of provinces, cities and counties, and the like, the data quantity is small, and the data can be adjusted manually.
In an embodiment of the invention, the label matching method of the data provided by the invention is used for carrying out province, city and county grade label matching test on 168608 data sets with text addresses, 83.33% of label matching of province, city and county grade is completed, 11% of label matching of province and city grade is completed, 0.6% of label matching of province is completed, and 5.07% of data cannot be matched with administrative divisions. Takes less than 1 minute. Therefore, the invention realizes automatic batch matching of labels of mass text information acquired by the Internet in a short time, adds labels to acquired data and provides more analysis dimensions for data analysis.
As shown in fig. 3, an embodiment of the present invention provides a data tag matching apparatus, including: a construction unit 301, a field extraction unit 302 and a tag matching unit 303; wherein,
the constructing unit 301 is configured to construct a sample label table, where the sample label table includes at least one sample label and a hierarchical relationship between the sample labels, and each sample label corresponds to a same label type;
the field extracting unit 302 is configured to extract, according to the label type of at least one sample label in the sample label table constructed by the constructing unit 301, a target field corresponding to the label type from pre-acquired data;
the label matching unit 303 is configured to, for each sample label, perform: determining whether a target keyword corresponding to the sample label exists in the target field extracted in the field extraction unit 302, and if so, determining the sample label as a reference label; and determining at least one matching label corresponding to the data corresponding to the target field from the at least one sample label according to the determined reference label and the hierarchical relationship of each sample label in the sample label table.
In an embodiment of the present invention, the field extracting unit 302 may be further configured to set a lexical analyzer corresponding to the data format according to the data format of the at least one sample tag; establishing a full-text index for the target field, designating the set lexical analyzer, and splitting the target field into at least one keyword by using the designated lexical analyzer;
and the label matching unit is used for searching whether a target keyword corresponding to the sample label exists in the at least one keyword by using the full-text index established by the target field.
As shown in fig. 4, in an embodiment of the present invention, the apparatus may further include: a setting unit 401; wherein,
the setting unit 401 is configured to set a cursor corresponding to each level according to the hierarchical relationship of each sample label;
the label matching unit 303 is configured to determine a cursor corresponding to the sample label from the cursors corresponding to each of the levels set by the setting unit according to the level to which the sample label belongs; and searching whether the target keyword exists in the target field or not by using the determined cursor.
In an embodiment of the present invention, the setting unit 401 is further configured to, for at least one sample label corresponding to each hierarchy, perform: determining the character lengths corresponding to the at least one sample label respectively, and sequencing the at least one sample label according to the character lengths;
the tag matching unit 303 is configured to sequentially determine whether a target keyword corresponding to each sample tag exists in the target field according to the sorting result of the at least one sample tag and by using a cursor corresponding to the hierarchy to which the at least one sample tag belongs.
In an embodiment of the present invention, the tag matching unit 303 is configured to determine whether an upper-level tag of the reference tag exists in the at least one sample tag according to the hierarchical relationship, and if so, take the upper-level tag and the reference tag as the matching tag; otherwise, the reference label is used as the matching label.
Because the information interaction, execution process, and other contents between the units in the device are based on the same concept as the method embodiment of the present invention, specific contents may refer to the description in the method embodiment of the present invention, and are not described herein again.
The invention also provides a readable medium comprising executable instructions which, when executed by a processor of a storage controller, cause the storage controller to perform a method as provided by any of the above-described embodiments of the invention.
In addition, the present invention also provides a memory controller comprising: a processor, a memory, and a bus; the memory is used for storing execution instructions, the processor is connected with the memory through the bus, and when the storage controller runs, the processor executes the execution instructions stored in the memory, so that the storage controller executes the method provided by any one of the above embodiments of the invention.
In summary, the embodiments of the present invention have at least the following advantages:
1. in the embodiment of the invention, the target field corresponding to the label type of the sample label is extracted from the pre-collected data, then whether the keyword corresponding to the sample label exists in the target field is accurately determined, and if yes, the matching label corresponding to the data corresponding to the target field is determined according to the hierarchical relation of each sample label. Therefore, when the labels are matched, the keywords corresponding to the labels are directly matched, and the intercepted data corresponding to the relevant words are not matched with the labels, so that the accuracy of the sample labels is improved.
2. In the embodiment of the invention, a lexical analyzer corresponding to the data format of the sample label is determined, and at least one keyword is split from the target field by using the determined lexical analyzer. The lexical analyzer corresponding to the data format is selected, so that the acquired data can be more accurately disassembled, the occurrence rate of errors such as wrong disassembly and incomplete disassembly is reduced, the accurate matching of the sample label and the keyword is facilitated, and the accuracy of label matching is further improved.
3. In the embodiment of the invention, cursors corresponding to each level are respectively set according to the level relation of each sample label, and whether target keywords corresponding to the sample labels exist in the target fields or not is searched by using the set cursors. Therefore, each sample label can be automatically traversed in the target field, and the target field can be prevented from being missed to be read, so that the efficiency of label matching can be improved, and the accuracy of label matching can also be improved.
4. In the embodiment of the invention, the sample labels of each level are sorted according to the corresponding character length respectively. And according to the sequencing result, sequentially determining whether target keywords corresponding to the sample labels exist in the target field by utilizing the cursors corresponding to the levels respectively. Therefore, label mismatching caused by incomplete correspondence of the sample labels can be effectively avoided, and the accuracy of label matching is further improved.
5. In the embodiment of the invention, after the reference label is determined, whether the superior label of the reference label exists is determined according to the hierarchical relationship among the sample labels, and if the superior label of the reference label exists, the reference label and the superior label are determined as the matching label, so that the multi-level label matching of data can be realized, and the accuracy of label matching is further improved.
6. In the embodiment of the invention, when the address class label matching is carried out on the acquired data, the sample label table comprises the full name, the short name and the area code which are respectively corresponding to each address. When the labels are matched, multiple matching modes of full names, short names and codes are adopted in each level, so that the accuracy of label matching is improved.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a" does not exclude the presence of other similar elements in a process, method, article, or apparatus that comprises the element.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it is to be noted that: the above description is only a preferred embodiment of the present invention, and is only used to illustrate the technical solutions of the present invention, and not to limit the protection scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (10)
1. A method for matching tags of data, comprising:
constructing a sample label table, wherein the sample label table comprises at least one sample label and the hierarchical relationship of each sample label; each sample label corresponds to the same label type;
extracting a target field corresponding to the label type from pre-acquired data according to the label type of the at least one sample label;
for each of the sample tags, performing:
determining whether a target keyword corresponding to the sample label exists in the target field, and if so, determining the sample label as a reference label;
and determining at least one matching label corresponding to the data corresponding to the target field from the at least one sample label according to the determined reference label and the hierarchical relationship of each sample label.
2. The method of claim 1,
after the extracting, according to the tag type of the at least one sample tag, a target field corresponding to the tag type from the pre-acquired data, the method further includes:
setting a lexical analyzer corresponding to the data format according to the data format of the at least one sample label;
establishing a full-text index for the target field, and specifying the set lexical analyzer;
splitting the target field into at least one keyword by using the specified lexical analyzer;
then the process of the first step is carried out,
the determining whether a target keyword corresponding to the sample tag exists in the target field includes:
and searching whether a target keyword corresponding to the sample label exists in the at least one keyword or not by using the full-text index established by the target field.
3. The method of claim 1,
prior to the determining whether the target keyword corresponding to the sample tag exists in the target field, further comprising:
respectively setting a cursor corresponding to each level according to the level relation of each sample label;
then the process of the first step is carried out,
the determining whether a target keyword corresponding to the sample tag exists in the target field includes:
determining a cursor corresponding to the sample label according to the level corresponding to the sample label;
and searching whether the target keyword exists in the target field or not by using the determined cursor.
4. The method of claim 3,
performing, at said each of said sample tags: before determining whether a target keyword corresponding to the sample tag exists in the target field, further comprising:
for at least one sample label corresponding to each hierarchy, performing: determining the character lengths corresponding to the at least one sample label respectively, and sequencing the at least one sample label according to the character lengths;
then the process of the first step is carried out,
for each sample label, performing: determining whether a target keyword corresponding to the sample tag exists in the target field, including:
and according to the sequencing result of the at least one sample label, sequentially determining whether target keywords corresponding to the sample labels exist in the target field by using the cursors corresponding to the hierarchy to which the at least one sample label belongs.
5. The method of claim 1,
determining at least one matching label corresponding to the data corresponding to the target field from the at least one sample label according to the determined reference label and the hierarchical relationship of each sample label, including:
determining whether an upper label of the reference label exists in the at least one sample label according to the hierarchical relationship, and if so, taking the upper label and the reference label as the matching label; otherwise, the reference label is used as the matching label.
6. An apparatus for matching tags of data, comprising: the system comprises a construction unit, a field extraction unit and a label matching unit; wherein,
the constructing unit is used for constructing a sample label table, and the sample label table comprises at least one sample label and the hierarchical relationship of each sample label; each sample label corresponds to the same label type;
the field extraction unit is used for extracting a target field corresponding to the label type from pre-acquired data according to the label type of at least one sample label in the sample label table constructed by the construction unit;
the label matching unit is configured to, for each sample label, perform: determining whether a target keyword corresponding to the sample label exists in a target field extracted from the field extraction unit, and if so, determining the sample label as a reference label; and determining at least one matching label corresponding to the data corresponding to the target field from the at least one sample label according to the determined reference label and the hierarchical relationship of each sample label in the sample label table.
7. The apparatus of claim 6,
the field extraction unit is further used for setting a lexical analyzer corresponding to the data format according to the data format of the at least one sample label; establishing a full-text index for the target field, designating the set lexical analyzer, and splitting the target field into at least one keyword by using the designated lexical analyzer;
and the label matching unit is used for searching whether a target keyword corresponding to the sample label exists in the at least one keyword by using the full-text index established by the target field.
8. The apparatus of claim 6, further comprising: a setting unit; wherein,
the setting unit is used for respectively setting a cursor corresponding to each level according to the level relation of each sample label;
the label matching unit is used for determining the vernier corresponding to the sample label from the vernier corresponding to each level set by the setting unit according to the level to which the sample label belongs; and searching whether the target keyword exists in the target field or not by using the determined cursor.
9. The apparatus of claim 8,
the setting unit is further configured to, for at least one sample label corresponding to each hierarchy, perform: determining the character lengths corresponding to the at least one sample label respectively, and sequencing the at least one sample label according to the character lengths;
and the label matching unit is used for sequentially determining whether target keywords corresponding to the sample labels exist in the target field by using cursors corresponding to the hierarchy to which the sample labels belong according to the sequencing result of the sample labels.
10. The apparatus of claim 6,
the tag matching unit is used for determining whether an upper-level tag of the reference tag exists in the at least one sample tag according to the hierarchical relationship, and if so, taking the upper-level tag and the reference tag as the matching tags; otherwise, the reference label is used as the matching label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710723820.7A CN107463711B (en) | 2017-08-22 | 2017-08-22 | Data tag matching method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710723820.7A CN107463711B (en) | 2017-08-22 | 2017-08-22 | Data tag matching method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107463711A true CN107463711A (en) | 2017-12-12 |
CN107463711B CN107463711B (en) | 2020-07-28 |
Family
ID=60549314
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710723820.7A Active CN107463711B (en) | 2017-08-22 | 2017-08-22 | Data tag matching method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107463711B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108959244A (en) * | 2018-06-07 | 2018-12-07 | 北京京东尚科信息技术有限公司 | The method and apparatus of address participle |
CN110097407A (en) * | 2019-05-10 | 2019-08-06 | 宁波奥克斯电气股份有限公司 | A kind of generation method and system of user tag |
CN111198887A (en) * | 2019-12-31 | 2020-05-26 | 北京左医健康技术有限公司 | Medicine indexing method, medicine retrieval method and system |
CN111626808A (en) * | 2020-02-26 | 2020-09-04 | 京东数字科技控股有限公司 | Data processing method and apparatus, storage medium, and electronic apparatus |
WO2020177073A1 (en) * | 2019-03-05 | 2020-09-10 | 深圳市天软科技开发有限公司 | Data set acquisition method, terminal device, and computer readable storage medium |
CN112528100A (en) * | 2020-12-18 | 2021-03-19 | 厦门市美亚柏科信息股份有限公司 | Label strategy recommending and marking method, terminal equipment and storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101283353A (en) * | 2005-08-03 | 2008-10-08 | 温克科技公司 | Systems for and methods of finding relevant documents by analyzing tags |
CN101350012A (en) * | 2007-07-18 | 2009-01-21 | 北京灵图软件技术有限公司 | Method and system for matching address |
CN101686146A (en) * | 2008-09-28 | 2010-03-31 | 华为技术有限公司 | Method and equipment for fuzzy query, query result processing and filtering condition processing |
US20130232153A1 (en) * | 2012-03-02 | 2013-09-05 | Cleversafe, Inc. | Modifying an index node of a hierarchical dispersed storage index |
CN104123366A (en) * | 2014-07-23 | 2014-10-29 | 谢建平 | Search method and server |
CN104375992A (en) * | 2013-08-12 | 2015-02-25 | 中国移动通信集团浙江有限公司 | Address matching method and device |
CN104834736A (en) * | 2015-05-19 | 2015-08-12 | 深圳证券信息有限公司 | Method and device for establishing index database and retrieval method, device and system |
CN104967565A (en) * | 2015-05-28 | 2015-10-07 | 烽火通信科技股份有限公司 | Method and system for hybrid processing of upstream label and downstream label |
US20160246799A1 (en) * | 2015-02-20 | 2016-08-25 | International Business Machines Corporation | Policy-based, multi-scheme data reduction for computer memory |
CN106503276A (en) * | 2017-01-06 | 2017-03-15 | 山东浪潮云服务信息科技有限公司 | A kind of method and apparatus of the time series databases for real-time monitoring system |
CN106682147A (en) * | 2016-12-22 | 2017-05-17 | 北京锐安科技有限公司 | Mass data based query method and device |
-
2017
- 2017-08-22 CN CN201710723820.7A patent/CN107463711B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101283353A (en) * | 2005-08-03 | 2008-10-08 | 温克科技公司 | Systems for and methods of finding relevant documents by analyzing tags |
CN101350012A (en) * | 2007-07-18 | 2009-01-21 | 北京灵图软件技术有限公司 | Method and system for matching address |
CN101686146A (en) * | 2008-09-28 | 2010-03-31 | 华为技术有限公司 | Method and equipment for fuzzy query, query result processing and filtering condition processing |
US20130232153A1 (en) * | 2012-03-02 | 2013-09-05 | Cleversafe, Inc. | Modifying an index node of a hierarchical dispersed storage index |
CN104375992A (en) * | 2013-08-12 | 2015-02-25 | 中国移动通信集团浙江有限公司 | Address matching method and device |
CN104123366A (en) * | 2014-07-23 | 2014-10-29 | 谢建平 | Search method and server |
US20160246799A1 (en) * | 2015-02-20 | 2016-08-25 | International Business Machines Corporation | Policy-based, multi-scheme data reduction for computer memory |
CN104834736A (en) * | 2015-05-19 | 2015-08-12 | 深圳证券信息有限公司 | Method and device for establishing index database and retrieval method, device and system |
CN104967565A (en) * | 2015-05-28 | 2015-10-07 | 烽火通信科技股份有限公司 | Method and system for hybrid processing of upstream label and downstream label |
CN106682147A (en) * | 2016-12-22 | 2017-05-17 | 北京锐安科技有限公司 | Mass data based query method and device |
CN106503276A (en) * | 2017-01-06 | 2017-03-15 | 山东浪潮云服务信息科技有限公司 | A kind of method and apparatus of the time series databases for real-time monitoring system |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108959244A (en) * | 2018-06-07 | 2018-12-07 | 北京京东尚科信息技术有限公司 | The method and apparatus of address participle |
CN108959244B (en) * | 2018-06-07 | 2022-08-09 | 北京京东尚科信息技术有限公司 | Address word segmentation method and device |
WO2020177073A1 (en) * | 2019-03-05 | 2020-09-10 | 深圳市天软科技开发有限公司 | Data set acquisition method, terminal device, and computer readable storage medium |
CN110097407A (en) * | 2019-05-10 | 2019-08-06 | 宁波奥克斯电气股份有限公司 | A kind of generation method and system of user tag |
CN111198887A (en) * | 2019-12-31 | 2020-05-26 | 北京左医健康技术有限公司 | Medicine indexing method, medicine retrieval method and system |
CN111626808A (en) * | 2020-02-26 | 2020-09-04 | 京东数字科技控股有限公司 | Data processing method and apparatus, storage medium, and electronic apparatus |
CN112528100A (en) * | 2020-12-18 | 2021-03-19 | 厦门市美亚柏科信息股份有限公司 | Label strategy recommending and marking method, terminal equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107463711B (en) | 2020-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107463711B (en) | Data tag matching method and device | |
CN103186524B (en) | A kind of place name identification method and apparatus | |
CN109739997B (en) | Address comparison method, device and system | |
CN109033086A (en) | A kind of address resolution, matched method and device | |
CN106598965B (en) | Account mapping method and device based on address information | |
CN112528174A (en) | Address finishing and complementing method based on knowledge graph and multiple matching and application | |
CN110990520A (en) | Address coding method and device, electronic equipment and storage medium | |
CN109933803B (en) | Idiom information display method, idiom information display device, electronic equipment and storage medium | |
CN108733810A (en) | A kind of address date matching process and device | |
CN103678371B (en) | Word library updating device, data integration device and method and electronic equipment | |
CN111414357A (en) | Address data processing method, device, system and storage medium | |
Schmidt et al. | Extraction of address data from unstructured text using free knowledge resources | |
CN112069824B (en) | Region identification method, device and medium based on context probability and citation | |
CN108073678B (en) | Document analysis processing method, system and device applied to big data analysis | |
CN114201480A (en) | Multi-source POI fusion method and device based on NLP technology and readable storage medium | |
CN110232160B (en) | Method and device for detecting interest point transition event and storage medium | |
CN109902148B (en) | Automatic enterprise name completion method for address book contacts | |
CN117494711A (en) | Semantic-based electricity utilization address similarity matching method | |
CN115600601B (en) | Method, device, equipment and medium for constructing tax law knowledge base | |
CN114513550B (en) | Geographic position information processing method and device and electronic equipment | |
CN115577694A (en) | Intelligent recommendation method for standard writing | |
CN115062108A (en) | Method for obtaining standardized house address | |
CN114792091A (en) | Chinese address element analysis method and equipment based on vocabulary enhancement and storage medium | |
CN116414808A (en) | Method, device, computer equipment and storage medium for normalizing detailed address | |
CN114036414A (en) | Method and device for processing interest points, electronic equipment, medium and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |