[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1557019.1557053acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Efficiently learning the accuracy of labeling sources for selective sampling

Published: 28 June 2009 Publication History

Abstract

Many scalable data mining tasks rely on active learning to provide the most useful accurately labeled instances. However, what if there are multiple labeling sources ('oracles' or 'experts') with different but unknown reliabilities? With the recent advent of inexpensive and scalable online annotation tools, such as Amazon's Mechanical Turk, the labeling process has become more vulnerable to noise - and without prior knowledge of the accuracy of each individual labeler. This paper addresses exactly such a challenge: how to jointly learn the accuracy of labeling sources and obtain the most informative labels for the active learning task at hand minimizing total labeling effort. More specifically, we present IEThresh (Interval Estimate Threshold) as a strategy to intelligently select the expert(s) with the highest estimated labeling accuracy. IEThresh estimates a confidence interval for the reliability of each expert and filters out the one(s) whose estimated upper-bound confidence interval is below a threshold - which jointly optimizes expected accuracy (mean) and need to better estimate the expert's accuracy (variance). Our framework is flexible enough to work with a wide range of different noise levels and outperforms baselines such as asking all available experts and random expert selection. In particular, IEThresh achieves a given level of accuracy with less than half the queries issued by all-experts labeling and less than a third the queries required by random expert selection on datasets such as the UCI mushroom one. The results show that our method naturally balances exploration and exploitation as it gains knowledge of which experts to rely upon, and selects them with increasing frequency.

Supplementary Material

JPG File (p259-donmez.jpg)
MP4 File (p259-donmez.mp4)

References

[1]
Blake and C. J. Merz. UCI repository of machine learning databases, 1998.
[2]
C. E. Brodley and M. A. Friedl. Identifying and eliminating mislabeled training instances. In Proceedings of the 13th National Conference on Artificial Intelligence, pages 799--805, August 1996.
[3]
P. Donmez and J. G. Carbonell. Proactive learning: Cost-sensitive active learning with multiple imperfect oracles. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM '08), pages 619--628, 2008.
[4]
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139, 1997.
[5]
L. P. Kaelbling. Learning in Embedded Systems. PhD thesis, Department of Computer Science, Stanford University, 1990.
[6]
A. Kappor and R. Greiner. Learning and classifying under hard budgets. In ECML '05, pages 170--181, 2005.
[7]
D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 3--12, 1994.
[8]
G. Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25:79--87, 1992.
[9]
P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. An expected utility approach to active feature-value acquisition. In Proceedings of the 5th International Conference on Data Mining (ICDM '05), 2005.
[10]
A. Moore and J. Schneider. Memory-based stochastic optimization. In Neural Information Processing Systems 8, 1995.
[11]
C. T. Morrison and P. R. Cohen. Noisy information value in utility-based decision making. In Proc. of the First International Workshop on Utility-based Data Mining UBDM '05, pages 34--38, 2005.
[12]
F. Provost. Toward economic machine learning and utility-based data mining. In Proceedings of the First International Workshop on Utility-based Data Mining, pages 1--1, 2005.
[13]
G. Ratsch, T. Onoda, and K. R. Muller. Soft margins for adaboost. Machine Learning, 42(3):287--320, 2001.
[14]
M. Saar-Tsechansky, P. Melville, and F. Provost. Active feature-value acquisition. Management Sciences, 2008.
[15]
V. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '08), pages 614--622, 2008.
[16]
B. W. Silverman. Some asymptotic properties of the probabilistic teacher. IEEE Transactions on Information Theory, 26:246--249, 1980.
[17]
P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems (NIPS '94), pages 1085--1092, 1994.
[18]
P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Learning with probabilistic supervision. Computational Learning Theory and Natural Learning Systems, 3, 1995.
[19]
R. Snow, O'Connor, D. Jurafsky, and A. Ng. Cheap and fast--but is it good? evaluating non-expert annotations for natural language tasks. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008.
[20]
Z. Zheng and B. Padmanabhan. Selectively acquiring customer information: A new data acquisition problem and an active learning-based solution. Management Science, 52.

Cited By

View all
  • (2024)CLA-RA: Collaborative Active Learning Amidst Relabeling Ambiguity2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00016(18-24)Online publication date: 7-Jul-2024
  • (2024)Active Learning and Bayesian Optimization: A Unified Perspective to Learn with a GoalArchives of Computational Methods in Engineering10.1007/s11831-024-10064-zOnline publication date: 23-Apr-2024
  • (2024)Trustworthy human computation: a surveyArtificial Intelligence Review10.1007/s10462-024-10974-157:12Online publication date: 12-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
June 2009
1426 pages
ISBN:9781605584959
DOI:10.1145/1557019
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. active learning
  2. estimation
  3. labeler selection
  4. noisy labelers

Qualifiers

  • Research-article

Conference

KDD09

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)3
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CLA-RA: Collaborative Active Learning Amidst Relabeling Ambiguity2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00016(18-24)Online publication date: 7-Jul-2024
  • (2024)Active Learning and Bayesian Optimization: A Unified Perspective to Learn with a GoalArchives of Computational Methods in Engineering10.1007/s11831-024-10064-zOnline publication date: 23-Apr-2024
  • (2024)Trustworthy human computation: a surveyArtificial Intelligence Review10.1007/s10462-024-10974-157:12Online publication date: 12-Oct-2024
  • (2023)ASSBert: Active and semi-supervised bert for smart contract vulnerability detectionJournal of Information Security and Applications10.1016/j.jisa.2023.10342373(103423)Online publication date: Mar-2023
  • (2023)On the application of active learning for efficient and effective IoT botnet detectionFuture Generation Computer Systems10.1016/j.future.2022.10.024141(40-53)Online publication date: Apr-2023
  • (2022)Vexation-Aware Active Learning for On-Menu Restaurant Dish AvailabilityProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539152(3116-3126)Online publication date: 14-Aug-2022
  • (2022)Debiased Label Aggregation for Subjective Crowdsourcing TasksCHI Conference on Human Factors in Computing Systems Extended Abstracts10.1145/3491101.3519614(1-8)Online publication date: 27-Apr-2022
  • (2022)Enabling Efficient and Strong Privacy-Preserving Truth Discovery in Mobile CrowdsensingIEEE Transactions on Information Forensics and Security10.1109/TIFS.2022.320790517(3569-3581)Online publication date: 2022
  • (2022)Knowledge Learning With Crowdsourcing: A Brief Review and Systematic PerspectiveIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2022.1054349:5(749-762)Online publication date: May-2022
  • (2022)Augmented visualization cues on primary flight display facilitating pilot's monitoring performanceInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2019.102377135:COnline publication date: 21-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media