[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

BLAST: a loosely schema-aware meta-blocking approach for entity resolution

Published: 01 August 2016 Publication History

Abstract

Identifying records that refer to the same entity is a fundamental step for data integration. Since it is prohibitively expensive to compare every pair of records, blocking techniques are typically employed to reduce the complexity of this task. These techniques partition records into blocks and limit the comparison to records co-occurring in a block. Generally, to deal with highly heterogeneous and noisy data (e.g. semi-structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches.
Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features.
In this paper, we demonstrate how "loose" schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose information by adopting a LSH-based step for efficiently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.

References

[1]
A. Agresti and M. Kateri. Categorical data analysis. In International Encyclopedia of Statistical Science, pages 206--208. 2011.
[2]
S. Bergamaschi, D. Ferrari, F. Guerra, G. Simonini, and Y. Velegrakis. Providing insight into data source topics. Journal on Data Semantics, pages 1--18, 2016.
[3]
C. Bizer, T. Heath, and T. Berners-Lee. Linked data-the story so far. Semantic Services, Interoperability and Web Applications: Emerging Concepts, 5(3):1--22, 2009.
[4]
A. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES '97, pages 21--. IEEE Computer Society, 1997.
[5]
P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 24(9):1537--1555, 2012.
[6]
T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
[7]
T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays and bloom filters. TKDD, 5(2):9, 2011.
[8]
X. L. Dong and D. Srivastava. Big data integration. Synthesis Lectures on Data Management, 2015.
[9]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.
[10]
H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2):197--210, 2010.
[11]
J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive Datasets, 2nd Ed. 2014.
[12]
Y. Ma and T. Tran. Typimatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In WSDM, pages 325--334, 2013.
[13]
J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy, S. R. Jeffery, D. Ko, and C. Yu. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342--350, 2007.
[14]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, pages 169--178, 2000.
[15]
F. Naumann and M. Herschel. An introduction to duplicate detection. Synthesis Lectures on Data Management, 2010.
[16]
A. N. Ngomo and S. Auer. LIMES - A time-efficient approach for large-scale link discovery on the web of data. In IJCAI, pages 2312--2317, 2011.
[17]
G. Papadakis, G. Alexiou, G. Papastefanatos, and G. Koutrika. Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB, 9(4):312--323, 2015.
[18]
G. Papadakis, E. Ioannou, T. Palpanas, C. Niederée, and W. Nejdl. A blocking framework for entity resolution in highly heterogeneous information spaces. TKDE, 25(12):2665--2682, 2013.
[19]
G. Papadakis, G. Papastefanatos, and G. Koutrika. Supervised meta-blocking. PVLDB, 7(14):1929--1940, 2014.
[20]
G. Papadakis, G. Papastefanatos, T. Palpanas, and M. Koubarakis. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In EDBT, pages 221--232, 2016.
[21]
P. Shvaiko and J. Euzenat. Ontology matching: State of the art and future challenges. TKDE, 25(1):158--176, 2013.
[22]
P. Vandenbussche and B. Vatant. Linked open vocabularies. ERCIM, 2014(96), 2014.

Cited By

View all
  • (2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 1-Sep-2024
  • (2023)DomainNet: Homograph Detection and Understanding in Data Lake DisambiguationACM Transactions on Database Systems10.1145/361291948:3(1-40)Online publication date: 12-Sep-2023
  • (2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
  • Show More Cited By
  1. BLAST: a loosely schema-aware meta-blocking approach for entity resolution

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 9, Issue 12
    August 2016
    345 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2016
    Published in PVLDB Volume 9, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 1-Sep-2024
    • (2023)DomainNet: Homograph Detection and Understanding in Data Lake DisambiguationACM Transactions on Database Systems10.1145/361291948:3(1-40)Online publication date: 12-Sep-2023
    • (2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
    • (2023)A big data platform exploiting auditable tokenization to promote good practices inside local energy communitiesFuture Generation Computer Systems10.1016/j.future.2022.12.007141:C(595-610)Online publication date: 15-Feb-2023
    • (2022)Generalized supervised meta-blockingProceedings of the VLDB Endowment10.14778/3538598.353861115:9(1902-1910)Online publication date: 27-Jul-2022
    • (2022)Deep indexed active learning for matching heterogeneous entity representationsProceedings of the VLDB Endowment10.14778/3485450.348545515:1(31-45)Online publication date: 14-Jan-2022
    • (2022)Static and Dynamic Progressive Geospatial InterlinkingACM Transactions on Spatial Algorithms and Systems10.1145/35100258:2(1-41)Online publication date: 22-Apr-2022
    • (2022)The role of transitive closure in evaluating blocking methods for dirty entity resolutionJournal of Intelligent Information Systems10.1007/s10844-021-00676-358:3(561-590)Online publication date: 1-Jun-2022
    • (2021)Deep learning for blocking in entity matchingProceedings of the VLDB Endowment10.14778/3476249.347629414:11(2459-2472)Online publication date: 27-Oct-2021
    • (2021)Parallel discrepancy detection and incremental detectionProceedings of the VLDB Endowment10.14778/3457390.345740014:8(1351-1364)Online publication date: 21-Oct-2021
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media