More Web Proxy on the site http://driver.im/

research-article

Public Access

Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services

Authors:

Paul Suganthan G.C.,

Jeffrey F. Naughton,

Ganesh Krishnan,

Esteban Arcaute,

Vijay Raghavendra,

Youngchoon ParkAuthors Info & Claims

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 1431 - 1446

https://doi.org/10.1145/3035918.3035960

Published: 09 May 2017 Publication History

Abstract

Many works have applied crowdsourcing to entity matching (EM). While promising, these approaches are limited in that they often require a developer to be in the loop. As such, it is difficult for an organization to deploy multiple crowdsourced EM solutions, because there are simply not enough developers. To address this problem, a recent work has proposed Corleone, a solution that crowdsources the entire EM workflow, requiring no developers. While promising, Corleone is severely limited in that it does not scale to large tables. We propose Falcon, a solution that scales up the hands-off crowdsourced EM approach of Corleone, using RDBMS-style query execution and optimization over a Hadoop cluster. Specifically, we define a set of operators and develop efficient implementations. We translate a hands-off crowdsourced EM workflow into a plan consisting of these operators, optimize, then execute the plan. These plans involve both machine and crowd activities, giving rise to novel optimization techniques such as using crowd time to mask machine time. Extensive experiments show that Falcon can scale up to tables of millions of tuples, thus providing a practical solution for hands-off crowdsourced EM, to build cloud-based EM services.

References

[1]

Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In SIGMOD, 2013.

Digital Library

[2]

L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.

Digital Library

[3]

C. Chai, G. Li, J. Li, D. Deng, and J. Feng. Cost-effective crowdsourced entity resolution: A partial-order approach. In SIGMOD, 2016.

Digital Library

[4]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.

Digital Library

[5]

X. Chu, I. F. Ilyas, and P. Koutris. Distributed data deduplication. In VLDB, 2016.

Digital Library

[6]

S. Das, A. Doan, P. S. G. C., C. Gokhale, and P. Konda. The Magellan data repository. https://sites.google.com/site/anhaidgroup/projects/data.

[7]

S. Das et al. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services http://pages.cs.wisc.edu/ anhai/papers/falcon-tr.pdf. Technical Report.

[8]

A. Das Sarma, A. Jain, A. Machanavajjhala, and P. Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM, 2012.

Digital Library

[9]

S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the crowd for top-k and group-by queries. In ICDT, 2013.

Digital Library

[10]

D. Deng, G. Li, H. Wen, and J. Feng. An efficient partition based method for exact set similarity joins. In VLDB, 2016.

Digital Library

[11]

V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis, and T. Palpanas. Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data. In Big Data, 2015.

Digital Library

[12]

J. Fan, M. Zhang, S. Kok, M. Lu, and B. C. Ooi. CrowdOp: Query optimization for declarative crowdsourcing systems. TKDE, 27(8):2078--2092, 2015.

Digital Library

[13]

M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. CrowdDB: Answering queries with crowdsourcing. In SIGMOD, 2011.

Digital Library

[14]

M. J. Franklin, B. Trushkowsky, P. Sarkar, and T. Kraska. Crowdsourced enumeration queries. In ICDE, 2013.

Digital Library

[15]

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, 2001.

Digital Library

[16]

I. Gog, M. Schwarzkopf, N. Crooks, M. P. Grosvenor, A. Clement, and S. Hand. Musketeer: all for one, one for all in data processing systems. In EuroSys, 2015.

Digital Library

[17]

C. Gokhale, S. Das, A. Doan, J. F. Naughton, R. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014.

Digital Library

[18]

S. Guo, A. Parameswaran, and H. Garcia-Molina. So who won?: Dynamic max discovery with the crowd. In SIGMOD, 2012.

Digital Library

[19]

D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. In VLDB, 2015.

Digital Library

[20]

D. Haas, J. Wang, E. Wu, and M. J. Franklin. CLAMShell: Speeding up crowds for low-latency data labeling. In VLDB, 2016.

Digital Library

[21]

F. Hueske, M. Peters, A. Krettek, M. Ringwald, K. Tzoumas, V. Markl, and J.-C. Freytag. Peeking into the optimization of data flow programs with mapreduce-style udfs. In ICDE, 2013.

Digital Library

[22]

Z. Khayyat et al. BigDansing: A system for big data cleansing. In SIGMOD, 2015.

Digital Library

[23]

L. Kolb, H. Köpcke, A. Thor, and E. Rahm. Learning-based entity resolution with MapReduce. In CloudDb, 2011.

Digital Library

[24]

L. Kolb, A. Thor, and E. Rahm. Parallel sorted neighborhood blocking with MapReduce. In BTW, 2011.

[25]

G. Li, J. He, D. Deng, and J. Li. Efficient similarity join and search on multi-attribute data. In SIGMOD, 2015.

Digital Library

[26]

A. Marcus, D. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. In VLDB, 2013.

Digital Library

[27]

A. Marcus and A. Parameswaran. Crowdsourced data management: Industry and academic perspectives. Foundations and Trends in Databases, 6(1--2):1--161, 2015.

Digital Library

[28]

A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. In VLDB, 2011.

Digital Library

[29]

A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, 2011.

[30]

B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. In VLDB, 2014.

Digital Library

[31]

A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, 2011.

Digital Library

[32]

A. Parameswaran, A. D. Sarma, H. Garcia-Molina, N. Polyzotis, and J. Widom. Human-assisted graph search: It's okay to ask questions. In VLDB, 2011.

Digital Library

[33]

A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. CrowdScreen: Algorithms for filtering data with humans. In SIGMOD, 2012.

Digital Library

[34]

A. G. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: Declarative crowdsourcing. In CIKM, 2012.

Digital Library

[35]

A. G. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. In CIDR, 2011.

[36]

H. Park and J. Widom. Query optimization over crowdsourced data. In VLDB, 2013.

Digital Library

[37]

A. Rheinländer, A. Heise, F. Hueske, U. Leser, and F. Naumann. SOFA: An extensible logical optimizer for udf-heavy data flows. Information Systems, 52:96--125, 2015.

Digital Library

[38]

C. Rong, W. Lu, X. Wang, X. Du, Y. Chen, and A. K. Tung. Efficient and scalable processing of string similarity join. TKDE, 25(10):2217--2230, 2013.

Digital Library

[39]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004.

Digital Library

[40]

A. D. Sarma, Y. He, and S. Chaudhuri. ClusterJoin: A similarity joins framework using Map-Reduce. In VLDB, 2014.

[41]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, 2010.

Digital Library

[42]

N. Vesdapunt, K. Bellare, and N. N. Dalvi. Crowdsourcing algorithms for entity resolution. In VLDB, 2014.

Digital Library

[43]

J. Wang, J. Feng, and G. Li. Trie-Join: Efficient trie-based string similarity joins with edit-distance constraints. In VLDB, 2010.

Digital Library

[44]

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. In VLDB, 2012.

Digital Library

[45]

J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, 2013.

Digital Library

[46]

S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. In VLDB, 2013.

Digital Library

[47]

C. Xiao, W. Wang, and X. Lin. Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. In VLDB, 2008.

Digital Library

[48]

C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. TODS, 36(3):15:1--15:41, 2011.

Digital Library

[49]

C. Yan, Y. Song, J. Wang, and W. Guo. Eliminating the redundancy in mapreduce-based entity resolution. In CCGRID, 2015.

Digital Library

[50]

M. Yu, G. Li, D. Deng, and J. Feng. String similarity search and join: A survey. Frontiers of Computer Science, 10(3):399--417, 2016.

Digital Library

Cited By

Nie TMao HLiu XYu S(2024)Fine-Grained Tasks for Crowdsourced Entity ResolutionApplied Sciences10.3390/app1501000415:1(4)Online publication date: 24-Dec-2024
https://doi.org/10.3390/app15010004
Fan WPang KLu PTian C(2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1145/3702315
Dou WShen DZhou XBai HKou YNie TCui HYu GSerra ESpezzano F(2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679843
Show More Cited By

Index Terms

Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning
      2. Entity resolution
  2. World Wide Web
    1. Web applications
      1. Crowdsourcing

Recommendations

Corleone: hands-off crowdsourcing for entity matching
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Recent approaches to crowdsourcing entity matching (EM) are limited in that they crowdsource only parts of the EM workflow, requiring a developer to execute the remaining parts. Consequently, these approaches do not scale to the growing EM need at ...
Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments

The ubiquity of the Internet and the widespread proliferation of electronic devices has resulted in flourishing microtask crowdsourcing marketplaces, such as Amazon MTurk. An aspect that has remained largely invisible in microtask crowdsourcing is that ...
A Community Rather Than A Union: Understanding Self-Organization Phenomenon on MTurk and How It Impacts Turkers and Requesters
CHI EA '17: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems

This paper aims to understand the self-organization phenomenon among the workers of Amazon Mechanical Turk (MTurk), a well-known crowdsourcing platform. Specifically, we explored 1) why MTurk workers self-organize into online communities (Turker ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

May 2017

1810 pages

ISBN:9781450341974

DOI:10.1145/3035918

General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NIH BD2K
NSF

Conference

SIGMOD/PODS'17

Sponsor:

SIGMOD

SIGMOD/PODS'17: International Conference on Management of Data

May 14 - 19, 2017

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
812
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)15

Reflects downloads up to 30 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nie TMao HLiu XYu S(2024)Fine-Grained Tasks for Crowdsourced Entity ResolutionApplied Sciences10.3390/app1501000415:1(4)Online publication date: 24-Dec-2024
https://doi.org/10.3390/app15010004
Fan WPang KLu PTian C(2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1145/3702315
Dou WShen DZhou XBai HKou YNie TCui HYu GSerra ESpezzano F(2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679843
Li ZSun WZhan DKang YChen LBozzon AHai R(2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
https://doi.org/10.1109/TKDE.2024.3357389
Shahbazi NWang JMiao ZBhutani N(2024)Fairness-Aware Data Preparation for Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00268(3476-3489)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00268
Paulsen DGovind YDoan A(2023)Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity MatchingProceedings of the VLDB Endowment10.14778/3583140.358316316:6(1507-1519)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583163
Fürst JArgerich MCheng B(2023)VersaMatch: Ontology Matching with Weak SupervisionProceedings of the VLDB Endowment10.14778/3583140.358314816:6(1305-1318)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583148
Tu JFan JTang NWang PLi GDu XJia XGao S(2023)Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data IntegrationProceedings of the ACM on Management of Data10.1145/35889381:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588938
Buono FFaggioli GPaganelli MBaraldi AGuerra FFerro N(2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
https://dl.acm.org/doi/10.1145/3584014.3584015
Wang RLi YWang J(2023)Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00391(1502-1515)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00391
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents