Christen, 2007 - Google Patents
Improving data linkage and deduplication quality through nearest-neighbour based blockingChristen, 2007
View PDF- Document ID
- 6052687604465235831
- Author
- Christen P
- Publication year
- Publication venue
- Department of Computer Science, The Australian National University. ACM
External Links
Snippet
Linking or matching databases is becoming increasingly important in many data mining projects, as linked data can contain information that is not available otherwise, or that would be too expensive to collect. A main challenge when linking large databases is the …
- 230000000903 blocking 0 title abstract description 189
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30386—Retrieval requests
- G06F17/30424—Query processing
- G06F17/30477—Query execution
- G06F17/30483—Query execution of query operations
- G06F17/30486—Unary operations; data partitioning operations
- G06F17/30489—Aggregation and duplicate elimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30289—Database design, administration or maintenance
- G06F17/30303—Improving data quality; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30705—Clustering or classification
- G06F17/3071—Clustering or classification including class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30386—Retrieval requests
- G06F17/30424—Query processing
- G06F17/30533—Other types of queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30613—Indexing
- G06F17/30619—Indexing indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30587—Details of specialised database models
- G06F17/30595—Relational databases
- G06F17/30598—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30634—Querying
- G06F17/30657—Query processing
- G06F17/30675—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30312—Storage and indexing structures; Management thereof
- G06F17/30321—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30067—File systems; File servers
- G06F17/30129—Details of further file system functionalities
- G06F17/3015—Redundancy elimination performed by the file system
- G06F17/30156—De-duplication implemented within the file system, e.g. based on file segments
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99936—Pattern matching access
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | A survey on blocking technology of entity resolution | |
Christen | Automatic record linkage using seeded nearest neighbour and support vector machine classification | |
Christen | A survey of indexing techniques for scalable record linkage and deduplication | |
Singla et al. | Entity resolution with markov logic | |
Brizan et al. | A. survey of entity resolution and record linkage methodologies | |
CA2723167C (en) | Database systems and methods for linking records corresponding to the same individual | |
Wang et al. | Semantic-aware blocking for entity resolution | |
Khabsa et al. | Online person name disambiguation with constraints | |
Christen | A Two-Step Classification Approach to Unsupervised Record Linkage. | |
Christen | Towards parameter-free blocking for scalable record linkage | |
Welch et al. | Fast and accurate incremental entity resolution relative to an entity knowledge base | |
Christen et al. | Towards scalable real-time entity resolution using a similarity-aware inverted index approach | |
De Vries et al. | Robust record linkage blocking using suffix arrays and bloom filters | |
Christen et al. | Towards automated data linkage and deduplication | |
Costa et al. | A blocking scheme for entity resolution in the semantic web | |
KR20080049239A (en) | Method for disambiguation of same name authors by information extraction from orignal text | |
Al-Sarkhi et al. | Estimating the parameters for linking unstandardized references with the matrix comparator | |
Lee et al. | A hierarchical document clustering approach with frequent itemsets | |
O’hare et al. | High-value token-blocking: efficient blocking method for record linkage | |
Beneventano et al. | BLAST2: An efficient technique for loose schema information extraction from heterogeneous big data sources | |
Van Dam et al. | Duplicate detection in web shops using LSH to reduce the number of computations | |
Christen | Improving data linkage and deduplication quality through nearest-neighbour based blocking | |
Christen et al. | Assessing deduplication and data linkage quality: what to measure? | |
Goel et al. | Efficient indexing techniques for record matching and deduplication | |
Christen | Performance and scalability of fast blocking techniques for deduplication and data linkage |