Abstract
Crowdsourced entity resolution has recently attracted significant attentions because it can harness the wisdom of crowd to improve the quality of entity resolution. However, existing techniques either cannot achieve high quality or incur huge monetary costs. To address these problems, we propose a cost-effective crowdsourced entity resolution framework, which significantly reduces the monetary cost while keeping high quality. We first define a partial order on the pairs of records. Then, we select a pair as a question and ask the crowd to check whether the records in the pair refer to the same entity. After getting the answer of this pair, we infer the answers of other pairs based on the partial order. Next, we iteratively select pairs without answers to ask until we get the answers of all pairs. We devise effective algorithms to judiciously select the pairs to ask in order to minimize the number of asked pairs. To further reduce the cost, we propose a grouping technique to group the pairs and we only ask one pair instead of all pairs in each group. We develop error-tolerant techniques to tolerate the errors introduced by the partial order and the crowd. We also study the budget-aware entity resolution, which, given a budget, finds the maximum number of matching pairs within the budget, and propose effective optimization techniques. Experimental results show that our method reduces the cost to 1.25% of existing approaches (or existing approaches take \(80\times \) monetary cost of our method) while not sacrificing the quality.
Similar content being viewed by others
Notes
Note to avoid duplicately comparing two pairs in \(\mathcal {U} (p_{ij})\), we can only select pivots from \(\mathcal {C} (p_{ij})\) and \(\mathcal {P} (p_{ij})\).
References
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using em algorithm. Appl. Stat. 28(1), 20–28 (1979)
Aydin, B.I., Yilmaz, Y.S., Li, Y., Li, Q., Gao, J., Demirbas, M.: Crowdsourcing for multiple-choice question answering. In: AAAI, pp. 2946–2953 (2014)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 Aug 2003. pp. 39–48 (2003)
Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask? Jury selection for decision making tasks on micro-blog services. PVLDB 5(11), 1495–1506 (2012)
Chai, C., Li, G., Li, J., Deng, D., Feng. J.: Cost-effective crowdsourced entity resolution: a partial-order approach. In: SIGMOD, pp. 969–984 (2016)
Chai, C., Li, G., Li, J., Deng, D., Feng, J.: A partial-order framework for cost-effective crowdsourced entity resolution. In: Technical Report. http://dbgroup.cs.tsinghua.edu.cn/ligl/papers/vldbj2018-partialorder.pdf. Accessed 11 June 2018
Chen, X., Bennett, P.N., Collins-Thompson, K., Horvitz, E.: Pairwise ranking aggregation in a crowdsourced setting. In: WSDM, pp. 193–202 (2013)
Davidson, S.B., Khanna, S., Milo, T., Roy, S.: Using the crowd for top-k and group-by queries. In: ICDT, pp. 225–236 (2013)
Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96 (2005)
Eriksson, B.: Learning to top-k search using pairwise comparisons. In: AISTATS, pp. 265–273 (2013)
Fan, J., Li, G., Ooi, B. C., Tan, K., Feng, J.: icrowd: an adaptive crowdsourcing framework. In: SIGMOD, pp. 1015–1030 (2015)
Felsner, S., Raghavan, V., Spinrad, J.P.: Recognition algorithms for orders of small width and graphs of small dilworth number. Order 20(4), 351–364 (2003)
Franklin, M.J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: Crowddb: answering queries with crowdsourcing. In: SIGMOD, pp. 61–72 (2011)
Fulkerson, D.R.: Note on Dilworth’s decomposition theorem for partially ordered sets. Am. Math. Soc. 7(4), 701–702 (1956)
Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J.W., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD, pp. 601–612 (2014)
Guo, S., Parameswaran, A.G., Garcia-Molina, H.: So who won? Dynamic max discovery with the crowd. In: SIGMOD, pp. 385–396 (2012)
Ho, C.-J., Jabbari, S., Vaughan, J.W.: Adaptive task assignment for crowdsourced classification. In: ICML, pp. 534–542 (2013)
Ipeirotis, P.G., Provost, F., Wang, J.: Quality management on amazon mechanical turk. In: SIGKDD workshop on human computation, pp. 64–67. ACM (2010)
Joglekar, M., Garcia-Molina, H., Parameswaran, A.G.: Evaluating the crowd with confidence. In: SIGKDD, pp. 686–694 (2013)
Karger, D.R., Oh, S., Shah, D.: Iterative learning for reliable crowdsourcing systems. In: NIPS, pp. 1953–1961 (2011)
Kazemi, L., Shahabi, C., Chen, L.: Geotrucrowd: trustworthy query answering with spatial crowdsourcing. In: SIGSPATIAL, pp. 304–313 (2013)
Khan, A.R., Garcia-Molina, H.: Hybrid strategies for finding the max with the crowd. In: Technical Report (2014)
Konda, P., Das, S., Suganthan, G.C.P., Doan, A., Ardalan, A., Ballard, J.R., Li, H., Panahi, F., Zhang, H., Naughton, J.F., Prasad, S., Krishnan, G., Deep, R., Raghavendra, V.: Magellan: toward building entity matching management systems. PVLDB 9(12), 1197–1208 (2016)
Kreveld, M.,Overmars, M., Schwarzkopf,O., Berg, M.d., Schwart- skopf, O.: Computational geometry: algorithms and applications, pp. I-XII, 1–365. Springer (1997). ISBN 354061270X
Li, G.: Human-in-the-loop data integration. PVLDB 10(12), 2006–2017 (2017)
Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: optimizing queries with crowd-based selections and joins. In: SIGMOD, pp. 1463–1478 (2017)
Li, G., Wang, J., Zheng, Y., Franklin, M.J.: Crowdsourced data management: a survey. IEEE Trans. Knowl. Data Eng. 28(9), 2296–2319 (2016)
Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: CDAS: a crowdsourcing data analytics system. PVLDB 5(10), 1040–1051 (2012)
Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Human-powered sorts and joins. PVLDB 5(1), 13–24 (2011)
Marcus, A., Wu, E., Madden, S., Miller, R.C.: Crowdsourced databases: query processing with people. In: CIDR, pp. 211–214 (2011)
Megiddo, N., Supowit, K.J.: On the complexity of some common geometric location problems. SIAM J. Comput. 13(1), 182–196 (1984)
Ouyang, W.R., Kaplan, L.M. et al.: Debiasing crowdsourced quantitative characteristics in local businesses and services. In: IPSN, pp. 190–201 (2015)
Parameswaran, A.G., Boyd, S.P., Garcia-Molina, H., Gupta, A., Polyzotis, N., Widom, J.: Optimal crowd-powered rating and filtering algorithms. PVLDB 7(9), 685–696 (2014)
Parameswaran, A.G., Garcia-Molina, H., Park, H., Polyzotis, N., Ramesh, A., Widom, J.: Crowdscreen: algorithms for filtering data with humans. In: SIGMOD, pp. 361–372 (2012)
Parameswaran, A.G., Park, H., Garcia-Molina, H., Polyzotis, N., Widom, J.: Deco: declarative crowdsourcing. In: CIKM, pp. 1203–1212 (2012)
Park, H., Widom, J.: Query optimization over crowdsourced data. PVLDB 6(10), 781–792 (2013)
Park, H., Widom, J.: Crowdfill: collecting structured data from the crowd. In: SIGMOD, pp. 577–588 (2014)
Pfeiffer, T., Gao, X.A., Chen, Y., Mao, A., Rand, D.G.: Adaptive polling for information aggregation. In: AAAI (2012)
Raykar, V.C., Yu, S., Zhao, L.H., Jerebko, A.K., Florin, C., Valadez, G.H., Bogoni, L., Moy, L.: Supervised learning from multiple experts: whom to trust when everyone lies a bit. In: ICML, pp. 889–896 (2009)
Sarma, A.D., Parameswaran, A.G., Garcia-Molina, H., Halevy, A.Y.: Crowd-powered find algorithms. In: ICDE, pp. 964–975 (2014)
Trushkowsky, B., Kraska, T., Franklin, M.J., Sarkar, P.: Crowdsourced enumeration queries. In: ICDE, pp. 673–684 (2013)
Venanzi, M., Guiver, J., Kazai, G., Kohli, P., Shokouhi, M.: Community-based bayesian aggregation models for crowdsourcing. In: WWW, pp. 155–164 (2014)
Venetis, P., Garcia-Molina, H., Huang, K., Polyzotis, N.: Max algorithms in crowdsourcing environments. In: WWW, pp. 989–998 (2012)
Verroios, V., Garcia-Molina, H.: Entity resolution with crowd errors. In: ICDE, pp. 219–230 (2015)
Vesdapunt, N., Bellare, K., Dalvi, N.N.: Crowdsourcing algorithms for entity resolution. In: PVLDB (2014)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: Crowdsourcing entity resolution. In: PVLDB (2012)
Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD, pp. 229–240 (2013)
Wang, S., Xiao, X., Lee, C.: Crowd-based deduplication: An adaptive approach. In: SIGMOD, pp. 1263–1277 (2015)
Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. In: PVLDB (2013)
Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., Movellan, J.R.: Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In: NIPS, pp. 2035–2043 (2009)
Ye, P., EDU, U., Doermann, D.: Combining preference and absolute judgements in a crowd-sourced setting. In: ICML Workshop (2013)
Zhang, C.J., Chen, L., Jagadish, H.V., Cao, C.C.: Reducing uncertainty of schema matching via crowdsourcing. PVLDB 6(9), 757–768 (2013)
Zhao, Z., Wei, F., Zhou, M., Chen, W., Ng, W.: Crowd-selection query processing in crowdsourcing databases: a task-driven approach. In: EDBT, pp. 397–408 (2015)
Zhao, Z., Yan, D., Ng, W., Gao, S.: A transfer learning based framework of crowd-selection on twitter. In: SIGKDD, pp. 1514–1517 (2013)
Zheng, Y., Cheng, R., Maniu, S., Mo, L.: On optimality of jury selection in crowdsourcing. In: EDBT, pp. 193–204 (2015)
Zheng, Y., Li, G., Cheng, R.: DOCS: domain-aware crowdsourcing system. PVLDB 10(4), 361–372 (2016)
Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: is the problem solved? PVLDB 10(5), 541–552 (2016)
Zheng, Y., Wang, J., Li, G., Cheng, R., Feng, J.: QASCA: A quality-aware task assignment system for crowdsourcing applications. In: SIGMOD, pp. 1031–1046 (2015)
Acknowledgements
This work was supported by the 973 Program of China (2015CB358700), NSF of China (61632016, 61472198, 61521002, 61661166012), and TAL education.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chai, C., Li, G., Li, J. et al. A partial-order-based framework for cost-effective crowdsourced entity resolution. The VLDB Journal 27, 745–770 (2018). https://doi.org/10.1007/s00778-018-0509-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-018-0509-6