[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3035918.3035960acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services

Published: 09 May 2017 Publication History

Abstract

Many works have applied crowdsourcing to entity matching (EM). While promising, these approaches are limited in that they often require a developer to be in the loop. As such, it is difficult for an organization to deploy multiple crowdsourced EM solutions, because there are simply not enough developers. To address this problem, a recent work has proposed Corleone, a solution that crowdsources the entire EM workflow, requiring no developers. While promising, Corleone is severely limited in that it does not scale to large tables. We propose Falcon, a solution that scales up the hands-off crowdsourced EM approach of Corleone, using RDBMS-style query execution and optimization over a Hadoop cluster. Specifically, we define a set of operators and develop efficient implementations. We translate a hands-off crowdsourced EM workflow into a plan consisting of these operators, optimize, then execute the plan. These plans involve both machine and crowd activities, giving rise to novel optimization techniques such as using crowd time to mask machine time. Extensive experiments show that Falcon can scale up to tables of millions of tuples, thus providing a practical solution for hands-off crowdsourced EM, to build cloud-based EM services.

References

[1]
Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In SIGMOD, 2013.
[2]
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.
[3]
C. Chai, G. Li, J. Li, D. Deng, and J. Feng. Cost-effective crowdsourced entity resolution: A partial-order approach. In SIGMOD, 2016.
[4]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.
[5]
X. Chu, I. F. Ilyas, and P. Koutris. Distributed data deduplication. In VLDB, 2016.
[6]
S. Das, A. Doan, P. S. G. C., C. Gokhale, and P. Konda. The Magellan data repository. https://sites.google.com/site/anhaidgroup/projects/data.
[7]
S. Das et al. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services http://pages.cs.wisc.edu/ anhai/papers/falcon-tr.pdf. Technical Report.
[8]
A. Das Sarma, A. Jain, A. Machanavajjhala, and P. Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM, 2012.
[9]
S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the crowd for top-k and group-by queries. In ICDT, 2013.
[10]
D. Deng, G. Li, H. Wen, and J. Feng. An efficient partition based method for exact set similarity joins. In VLDB, 2016.
[11]
V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis, and T. Palpanas. Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data. In Big Data, 2015.
[12]
J. Fan, M. Zhang, S. Kok, M. Lu, and B. C. Ooi. CrowdOp: Query optimization for declarative crowdsourcing systems. TKDE, 27(8):2078--2092, 2015.
[13]
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. CrowdDB: Answering queries with crowdsourcing. In SIGMOD, 2011.
[14]
M. J. Franklin, B. Trushkowsky, P. Sarkar, and T. Kraska. Crowdsourced enumeration queries. In ICDE, 2013.
[15]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, 2001.
[16]
I. Gog, M. Schwarzkopf, N. Crooks, M. P. Grosvenor, A. Clement, and S. Hand. Musketeer: all for one, one for all in data processing systems. In EuroSys, 2015.
[17]
C. Gokhale, S. Das, A. Doan, J. F. Naughton, R. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014.
[18]
S. Guo, A. Parameswaran, and H. Garcia-Molina. So who won?: Dynamic max discovery with the crowd. In SIGMOD, 2012.
[19]
D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. In VLDB, 2015.
[20]
D. Haas, J. Wang, E. Wu, and M. J. Franklin. CLAMShell: Speeding up crowds for low-latency data labeling. In VLDB, 2016.
[21]
F. Hueske, M. Peters, A. Krettek, M. Ringwald, K. Tzoumas, V. Markl, and J.-C. Freytag. Peeking into the optimization of data flow programs with mapreduce-style udfs. In ICDE, 2013.
[22]
Z. Khayyat et al. BigDansing: A system for big data cleansing. In SIGMOD, 2015.
[23]
L. Kolb, H. Köpcke, A. Thor, and E. Rahm. Learning-based entity resolution with MapReduce. In CloudDb, 2011.
[24]
L. Kolb, A. Thor, and E. Rahm. Parallel sorted neighborhood blocking with MapReduce. In BTW, 2011.
[25]
G. Li, J. He, D. Deng, and J. Li. Efficient similarity join and search on multi-attribute data. In SIGMOD, 2015.
[26]
A. Marcus, D. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. In VLDB, 2013.
[27]
A. Marcus and A. Parameswaran. Crowdsourced data management: Industry and academic perspectives. Foundations and Trends in Databases, 6(1--2):1--161, 2015.
[28]
A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. In VLDB, 2011.
[29]
A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, 2011.
[30]
B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. In VLDB, 2014.
[31]
A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, 2011.
[32]
A. Parameswaran, A. D. Sarma, H. Garcia-Molina, N. Polyzotis, and J. Widom. Human-assisted graph search: It's okay to ask questions. In VLDB, 2011.
[33]
A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. CrowdScreen: Algorithms for filtering data with humans. In SIGMOD, 2012.
[34]
A. G. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: Declarative crowdsourcing. In CIKM, 2012.
[35]
A. G. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. In CIDR, 2011.
[36]
H. Park and J. Widom. Query optimization over crowdsourced data. In VLDB, 2013.
[37]
A. Rheinländer, A. Heise, F. Hueske, U. Leser, and F. Naumann. SOFA: An extensible logical optimizer for udf-heavy data flows. Information Systems, 52:96--125, 2015.
[38]
C. Rong, W. Lu, X. Wang, X. Du, Y. Chen, and A. K. Tung. Efficient and scalable processing of string similarity join. TKDE, 25(10):2217--2230, 2013.
[39]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004.
[40]
A. D. Sarma, Y. He, and S. Chaudhuri. ClusterJoin: A similarity joins framework using Map-Reduce. In VLDB, 2014.
[41]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, 2010.
[42]
N. Vesdapunt, K. Bellare, and N. N. Dalvi. Crowdsourcing algorithms for entity resolution. In VLDB, 2014.
[43]
J. Wang, J. Feng, and G. Li. Trie-Join: Efficient trie-based string similarity joins with edit-distance constraints. In VLDB, 2010.
[44]
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. In VLDB, 2012.
[45]
J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, 2013.
[46]
S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. In VLDB, 2013.
[47]
C. Xiao, W. Wang, and X. Lin. Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. In VLDB, 2008.
[48]
C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. TODS, 36(3):15:1--15:41, 2011.
[49]
C. Yan, Y. Song, J. Wang, and W. Guo. Eliminating the redundancy in mapreduce-based entity resolution. In CCGRID, 2015.
[50]
M. Yu, G. Li, D. Deng, and J. Feng. String similarity search and join: A survey. Frontiers of Computer Science, 10(3):399--417, 2016.

Cited By

View all
  • (2024)Fine-Grained Tasks for Crowdsourced Entity ResolutionApplied Sciences10.3390/app1501000415:1(4)Online publication date: 24-Dec-2024
  • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
  • (2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
  • Show More Cited By

Index Terms

  1. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
        May 2017
        1810 pages
        ISBN:9781450341974
        DOI:10.1145/3035918
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 09 May 2017

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. cloud services
        2. crowdsourcing
        3. entity matching
        4. hands-off

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        SIGMOD/PODS'17
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 785 of 4,003 submissions, 20%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)69
        • Downloads (Last 6 weeks)15
        Reflects downloads up to 30 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Fine-Grained Tasks for Crowdsourced Entity ResolutionApplied Sciences10.3390/app1501000415:1(4)Online publication date: 24-Dec-2024
        • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
        • (2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
        • (2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
        • (2024)Fairness-Aware Data Preparation for Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00268(3476-3489)Online publication date: 13-May-2024
        • (2023)Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity MatchingProceedings of the VLDB Endowment10.14778/3583140.358316316:6(1507-1519)Online publication date: 20-Apr-2023
        • (2023)VersaMatch: Ontology Matching with Weak SupervisionProceedings of the VLDB Endowment10.14778/3583140.358314816:6(1305-1318)Online publication date: 20-Apr-2023
        • (2023)Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data IntegrationProceedings of the ACM on Management of Data10.1145/35889381:1(1-26)Online publication date: 30-May-2023
        • (2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
        • (2023)Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00391(1502-1515)Online publication date: Apr-2023
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media