CN111460069A - Address correction method based on weighted directed acyclic graph - Google Patents
Address correction method based on weighted directed acyclic graph Download PDFInfo
- Publication number
- CN111460069A CN111460069A CN202010241121.0A CN202010241121A CN111460069A CN 111460069 A CN111460069 A CN 111460069A CN 202010241121 A CN202010241121 A CN 202010241121A CN 111460069 A CN111460069 A CN 111460069A
- Authority
- CN
- China
- Prior art keywords
- address
- acyclic graph
- directed acyclic
- weighted
- normalization
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000010606 normalization Methods 0.000 claims abstract description 25
- 230000011218 segmentation Effects 0.000 claims abstract description 8
- 238000004458 analytical method Methods 0.000 claims abstract description 7
- 238000000605 extraction Methods 0.000 claims abstract description 6
- 238000004364 calculation method Methods 0.000 claims abstract 2
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000013523 data management Methods 0.000 abstract description 3
- 230000007547 defect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Remote Sensing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an address correction method based on a weighted directed acyclic graph, which relates to the field of data management and specifically comprises the following steps: performing tail-free processing on the address data subjected to pre-normalization, and performing TF-IDF feature word extraction; calculating an address similarity threshold based on the extracted feature words and the weights; calculating similarity between addresses in the tailless address base and the pre-normalized addresses: performing word segmentation analysis on the address of the preprocessed address, and then combining every two in sequence to form an edge set of the directed graph; and generating a directed acyclic graph based on the calculation of the edge weight data, outputting a weighted maximum path, and completing address normalization. The address completion correction algorithm provided by the invention can effectively solve the problems and can solve the address normalization problem on the basis of no standard address level division data.
Description
Technical Field
The invention belongs to the field of data management, and particularly relates to an address rectification method based on a weighted directed acyclic graph.
Background
In the field of data management, address normalization analysis is often involved, the most common normalized address data is logistics address data, early logistics address information is input by self, and the problem that address information is not standard due to the fact that people miss part of address information or fill part of address information by mistake is solved, so that an address normalization method is provided.
At present, most address normalization methods are fast normalization methods based on a dictionary tree, wherein each layer of the dictionary tree represents an address level, and the final normalized address is obtained by traversing from a root node to a leaf node. The main drawback of this method is the dependence on standard and complete address-level partitioning data for constructing the dictionary tree, and secondly the space consumption of the algorithm is excessive.
Disclosure of Invention
The invention aims to solve the technical problem of providing an address rectification method based on a weighted directed acyclic graph aiming at the defects of the background technology, wherein when no standard and complete address grade divides data, the weighted directed acyclic graph is established through the redundancy relation among mass data, and the weighted maximum path of the directed acyclic graph is the address after normalization.
The invention adopts the following technical scheme for solving the technical problems:
an address rectification method based on a weighted directed acyclic graph specifically comprises the following steps:
step 1, performing tailless processing on address data subjected to pre-normalization, and performing TF-IDF feature word extraction;
step 2, calculating an address similarity threshold value based on the extracted feature words and the weight;
and 5, generating a directed acyclic graph based on the calculated edge weight data, outputting a weighted maximum path, and completing address normalization.
As a further preferable solution of the address rectification method based on the weighted directed acyclic graph of the present invention, the step 1 specifically includes:
carrying out tailless processing on the address data subjected to pre-normalization, deleting unit building number information behind the address to obtain a tailless address (Add), segmenting words of the Add, and carrying out TF-IDF characteristic word extraction { (F)1,W1),(F2,W2),...,(Fn,Wn) In which FnRepresenting a characteristic word, WnThe TF-IDF value representing the feature word.
As a further preferable solution of the address rectification method based on the weighted directed acyclic graph of the present invention, the step 2 is specifically as follows: calculating an address similarity threshold θ based on the extracted TF-IDF feature words and a weight FW, wherein FW { (F)1,W1),(F2,W2),...,(Fn,Wn) And, calculating specifically as follows:
as a further preferable solution of the address rectification method based on the weighted directed acyclic graph of the present invention, in step 3, the similarity W between the address (Sadd) in the tailless address base and the pre-normalized address is calculated, which is specifically as follows:
extracting the first 10 addresses S satisfying the similarity W ≧ theta, wherein S ═ S1,s2,...,s10}。
As a further preferable scheme of the address rectification method based on the weighted directed acyclic graph, in step 4, word segmentation analysis is carried out on 11 addresses S and preprocessed addresses which meet the requirement that the similarity W is more than or equal to theta, and the two addresses are combined in sequence to form an edge set of the directed graph:
all the above-mentioned edge data are formed into edge weight data E, E { < w { [1,w2,c1>,<w2,w3,c2>,...,<wk-1,wk,c3> -, where w1,w2Respectively representing the start node of an edge, c1Representing the weight of an edge, i.e. w1,w2The number of occurrences.
As a further preferable scheme of the address rectification method based on the weighted directed acyclic graph of the present invention, in step 5, a directed acyclic graph G is generated based on the calculated edge weight data E, a weighted maximum path P is output, and address normalization is completed:
wherein the node information on the P path is the normalized result of the address Add, i.e. w1w2...wp。
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
1. aiming at the defects that address normalization excessively depends on a standard and complete address grade data division, the invention provides a weighted directed acyclic graph established based on the redundancy relation among massive address data, wherein the weighted maximum path of the directed acyclic graph is a normalized labeled standard address;
2. the address normalization method is a method for optimizing address by segments, words forming the address are sequentially combined and divided into address sections, and the optimal address section relation forms the final normalized address, so that the error rate of address normalization can be reduced, and the better the address normalization effect is improved; when no standard and complete address grade division data exists, the weighted directed acyclic graph is established through the redundancy relation among the mass data, and the weighted maximum path of the directed acyclic graph is the normalized address.
Drawings
FIG. 1 is a flow chart of an address completion correction algorithm for a weighted directed acyclic graph of the present invention;
FIG. 2 is a weighted directed acyclic graph constructed based on an address in accordance with the present invention.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Aiming at the defects that address normalization excessively depends on a standard and complete address grade data division, the invention provides a weighted directed acyclic graph established based on the redundancy relation among massive address data, wherein the weighted maximum path of the directed acyclic graph is a normalized labeled standard address.
FIG. 1 is a flow chart of an address completion correction algorithm for a weighted directed acyclic graph, including a precedence relationship between data processing modules; fig. 2 is a weighted directed acyclic graph constructed based on an address, in which part of the private information has been replaced with a #.
An address rectification method based on a weighted directed acyclic graph is shown in fig. 1, and specifically includes the following steps:
step 1, carrying out tailless processing on address data subjected to pre-normalization, deleting unit building number information behind the address to obtain a tailless address (Add), segmenting words of the Add, and carrying out TF-IDF feature word extraction { (F)1,W1),(F2,W2),...,(Fn,Wn) In which FiRepresenting a characteristic word, WiA TF-IDF value representing the feature word;
step 2, calculating an address similarity threshold value theta based on the extracted TF-IDF characteristic words and a weight FW, wherein FW { (F)1,W1),(F2,W2),...,(Fn,Wn) And, calculating specifically as follows:
extracting the first 10 addresses S satisfying the similarity W ≧ theta, wherein S ═ S1,s2,...,s10};
all the above-mentioned edge data are formed into edge weight data E, E { < w { [1,w2,c1>,<w2,w3,c2>,...,<wk-1,wk,c3> -, where w1,w2Respectively representing the start node of an edge, c1Representing the weight of an edge, i.e. w1,w2The number of occurrences;
step 5, generating a directed acyclic graph G based on the calculated edge weight data E, and outputting a weighted maximum path P based on a weighted directed acyclic graph constructed on a certain address, as shown in fig. 2, to complete address normalization:
the normalization result w of whether the nodes on the weighted maximum path are sequentially connected or the address Add is1w2...wp
The details of the technical solution of the invention are illustrated with reference to the accompanying drawings:
fig. 1 shows the overall process of the address completion algorithm, for example, the address to be completed is "oasis home yard in large watt kiln" unit cell "; the result after deleting the tail information is: "Luzhou home garden on the way in a big tile kiln"; the result of the word segmentation TF-IDF without the tail address is: 2.987, 'homeland', 2.178, 'oasis', 2.108, 'middle road', 2.064 }; the similarity threshold is: 5.051, respectively; the top 10 tailless addresses that meet the threshold are shown in table 1.
TABLE 1
Serial number | Tailless address |
1 | Middle road of large tile kiln in Toyotai district of Beijing City |
2 | Middle road of large tile kiln in Toyotai district of Beijing City |
3 | Large tile kiln middle road of Fengtai district of Haizui district of Beijing |
4 | Zhonglu Fengze home garden with large tile kiln in Fengtai district of Beijing City |
5 | Great watt kiln green land home garden between five rings and six rings of Fengtai district in Beijing City |
6 | Luzhou home garden of Lugou bridge tile kiln Zhongyao of Fengtai district of Beijing |
7 | Large tile kiln middle road at west road junction of Toyotai district of Beijing City |
8 | Great tile kiln oasis home garden in Fengtai district of Beijing City |
9 | Middle road of large tile kiln in east city of Beijing City |
10 | Middle road of Rugou bridge large tile kiln in Fengtu platform area of Beijing City |
The set of directed edge weights obtained by word segmentation is shown in table 2:
TABLE 2
Directed graph edges | Directed graph edges | Directed graph edges | Directed graph edges |
<Beginning, Beijing, 10> | <Dongchuan area, large brickkiln, 1> | <To, six rings, 1> | <Feng ze, Home 1> |
<Start, large brickkiln, 1> | <Fengcai, Lugou bridge, 3> | <Six rings, between, 1> | <Oasis, home, 4> |
<Beijing City, Fengtai, 1> | <Rugou bridge, large brickkiln, 3> | <Chinese, Large Tile kiln 1> | <Zhongbaodao, oasis, 1> |
<Beijing City, Tokyo, 1> | <Fengcai, West, 1> | <Fengtai district, Large Tile kiln, 3> | <Home, date, 1> |
<Beijing City, Fengtai district, 7> | <West, crossing, 1> | <Large tile kiln, middle road, 8> | <Period, cell, 1> |
<Beijing, Haihe district, 1> | <Level crossing, large brickkiln, 1> | <Large Tile kiln, Medium court, 1> | <Cell, 1> |
<Fengtai, Fengtai district, 1> | <Fengcai region, five rings, 1> | <Zhongluo, oasis, 1> | <Chamber, end, 1> |
<Haihai region, large brickkiln, 1> | <Pentacyclic ring to, 1> | <Zhonglu, Fengze, 1> |
The formed weighted directed acyclic graph is shown in fig. 2 in the attached drawing, and finally the most weighted path node value is output as the result of address normalization, that is: "the Luzhou home garden of Lugou bridge large tile kiln of Fengti district of Beijing unit.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention. While the embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (6)
1. A method for correcting addresses based on a weighted directed acyclic graph is characterized in that: the method specifically comprises the following steps:
step 1, performing tailless processing on address data subjected to pre-normalization, and performing TF-IDF feature word extraction;
step 2, calculating an address similarity threshold value based on the extracted feature words and the weight;
step 3, calculating the similarity between the address in the non-tail address base and the pre-normalized address:
step 4, performing word segmentation analysis on the address of the preprocessed address, and then combining every two in sequence to form an edge set of the directed graph;
and 5, generating a directed acyclic graph based on the calculated edge weight data, outputting a weighted maximum path, and completing address normalization.
2. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: the step 1 is specifically as follows:
carrying out tailless processing on the address data subjected to pre-normalization, deleting unit building number information behind the address to obtain a tailless address (Add), segmenting words of the Add, and carrying out TF-IDF characteristic word extraction { (F)1,W1),(F2,W2),...,(Fn,Wn) In which FnRepresenting a characteristic word, WnThe TF-IDF value representing the feature word.
3. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: the step 2 is specifically as follows: calculating an address similarity threshold θ based on the extracted TF-IDF feature words and a weight FW, wherein FW { (F)1,W1),(F2,W2),...,(Fn,Wn) And, calculating specifically as follows:
4. the address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: in step 3, the similarity W between the address (Sadd) in the tailless address base and the pre-normalized address is calculated as follows:
extracting the first 10 addresses S satisfying the similarity W ≧ theta, wherein S ═ S1,s2,...,s10}。
5. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: in step 4, performing word segmentation analysis on 11 addresses including the first 10 addresses S and the preprocessing address satisfying the similarity W being more than or equal to theta, and combining every two addresses in sequence to form an edge set of the directed graph:
......
all the above-mentioned edge data are formed into edge weight data E, E { < w { [1,w2,c1>,<w2,w3,c2>,...,<wk-1,wk,c3> -, where w1,w2Respectively representing the start node of an edge, c1Representing the weight of an edge, i.e. w1,w2The number of occurrences.
6. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: in step 5, generating a directed acyclic graph G based on the calculation edge weight data E, outputting a weighted maximum path P, and completing address normalization:
wherein the node information on the P path is the normalized result of the address Add, i.e. w1w2...wp。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010241121.0A CN111460069A (en) | 2020-03-31 | 2020-03-31 | Address correction method based on weighted directed acyclic graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010241121.0A CN111460069A (en) | 2020-03-31 | 2020-03-31 | Address correction method based on weighted directed acyclic graph |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111460069A true CN111460069A (en) | 2020-07-28 |
Family
ID=71685100
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010241121.0A Pending CN111460069A (en) | 2020-03-31 | 2020-03-31 | Address correction method based on weighted directed acyclic graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111460069A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114970518A (en) * | 2022-02-15 | 2022-08-30 | 北京青萌数海科技有限公司 | Method and device for correcting address data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804398A (en) * | 2017-05-03 | 2018-11-13 | 阿里巴巴集团控股有限公司 | The similarity calculating method and device of address text |
CN110175216A (en) * | 2019-05-15 | 2019-08-27 | 腾讯科技(深圳)有限公司 | Coordinate error correction method, device and computer equipment |
-
2020
- 2020-03-31 CN CN202010241121.0A patent/CN111460069A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804398A (en) * | 2017-05-03 | 2018-11-13 | 阿里巴巴集团控股有限公司 | The similarity calculating method and device of address text |
CN110175216A (en) * | 2019-05-15 | 2019-08-27 | 腾讯科技(深圳)有限公司 | Coordinate error correction method, device and computer equipment |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114970518A (en) * | 2022-02-15 | 2022-08-30 | 北京青萌数海科技有限公司 | Method and device for correcting address data |
CN114970518B (en) * | 2022-02-15 | 2022-12-16 | 北京青萌数海科技有限公司 | Method and device for correcting address data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020119619A1 (en) | Network optimization structure employing 3d target classification and scene semantic segmentation | |
CN102298585B (en) | A kind of address cutting and rank mask method and address cutting and rank annotation equipment | |
CN108614997B (en) | Remote sensing image identification method based on improved AlexNet | |
CN110955780A (en) | Entity alignment method for knowledge graph | |
CN103377237B (en) | The neighbor search method of high dimensional data and fast approximate image searching method | |
CN103593400A (en) | Lightning activity data statistics method based on modified Apriori algorithm | |
CN113505261B (en) | Data labeling method and device and data labeling model training method and device | |
CN113159451B (en) | Long-term prediction method for drainage basin drought and flood events based on event knowledge graph construction | |
WO2022262320A1 (en) | Information completion method for knowledge graph-based power distribution network cim model, and system | |
CN104317908B (en) | Outlier detection method based on three decisions and distance | |
CN111460069A (en) | Address correction method based on weighted directed acyclic graph | |
CN113761221A (en) | Knowledge graph entity alignment method based on graph neural network | |
CN104346442B (en) | A kind of Rules extraction method of Process-Oriented object data | |
Zhang et al. | Development of a supervised software tool for automated determination of optimal segmentation parameters for ecognition | |
CN112165401A (en) | Edge community discovery algorithm based on network pruning and local community expansion | |
CN110941645B (en) | Method, device, storage medium and processor for automatically judging string case | |
CN116089627A (en) | Knowledge graph construction method based on big data technology | |
CN111192154A (en) | Social network user node matching method based on style migration | |
CN110399455A (en) | A kind of deep learning data digging method based on CNN and LSTM | |
CN103164487A (en) | Clustering algorithm based on density and geometrical information | |
CN114399687A (en) | Semi-supervised self-training hyperspectral remote sensing image classification method based on spatial correction | |
CN113159044A (en) | Deep learning-based road material identification method for convolutional neural network | |
CN107578136A (en) | The overlapping community discovery method extended based on random walk with seed | |
CN112116709A (en) | Terrain feature line processing method for improving terrain expression precision | |
CN112183001A (en) | Hypergraph-based multistage clustering method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200728 |