More Web Proxy on the site http://driver.im/

research-article

Efficiently learning the accuracy of labeling sources for selective sampling

Authors:

Jaime G. Carbonell,

Jeff SchneiderAuthors Info & Claims

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 259 - 268

https://doi.org/10.1145/1557019.1557053

Published: 28 June 2009 Publication History

Abstract

Many scalable data mining tasks rely on active learning to provide the most useful accurately labeled instances. However, what if there are multiple labeling sources ('oracles' or 'experts') with different but unknown reliabilities? With the recent advent of inexpensive and scalable online annotation tools, such as Amazon's Mechanical Turk, the labeling process has become more vulnerable to noise - and without prior knowledge of the accuracy of each individual labeler. This paper addresses exactly such a challenge: how to jointly learn the accuracy of labeling sources and obtain the most informative labels for the active learning task at hand minimizing total labeling effort. More specifically, we present IEThresh (Interval Estimate Threshold) as a strategy to intelligently select the expert(s) with the highest estimated labeling accuracy. IEThresh estimates a confidence interval for the reliability of each expert and filters out the one(s) whose estimated upper-bound confidence interval is below a threshold - which jointly optimizes expected accuracy (mean) and need to better estimate the expert's accuracy (variance). Our framework is flexible enough to work with a wide range of different noise levels and outperforms baselines such as asking all available experts and random expert selection. In particular, IEThresh achieves a given level of accuracy with less than half the queries issued by all-experts labeling and less than a third the queries required by random expert selection on datasets such as the UCI mushroom one. The results show that our method naturally balances exploration and exploitation as it gains knowledge of which experts to rely upon, and selects them with increasing frequency.

Supplementary Material

JPG File (p259-donmez.jpg)

Download
10.40 KB

MP4 File (p259-donmez.mp4)

Download
117.29 MB

References

[1]

Blake and C. J. Merz. UCI repository of machine learning databases, 1998.

[2]

C. E. Brodley and M. A. Friedl. Identifying and eliminating mislabeled training instances. In Proceedings of the 13th National Conference on Artificial Intelligence, pages 799--805, August 1996.

Digital Library

[3]

P. Donmez and J. G. Carbonell. Proactive learning: Cost-sensitive active learning with multiple imperfect oracles. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM '08), pages 619--628, 2008.

Digital Library

[4]

Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139, 1997.

Digital Library

[5]

L. P. Kaelbling. Learning in Embedded Systems. PhD thesis, Department of Computer Science, Stanford University, 1990.

Digital Library

[6]

A. Kappor and R. Greiner. Learning and classifying under hard budgets. In ECML '05, pages 170--181, 2005.

Digital Library

[7]

D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 3--12, 1994.

Digital Library

[8]

G. Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25:79--87, 1992.

Digital Library

[9]

P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. An expected utility approach to active feature-value acquisition. In Proceedings of the 5th International Conference on Data Mining (ICDM '05), 2005.

Digital Library

[10]

A. Moore and J. Schneider. Memory-based stochastic optimization. In Neural Information Processing Systems 8, 1995.

[11]

C. T. Morrison and P. R. Cohen. Noisy information value in utility-based decision making. In Proc. of the First International Workshop on Utility-based Data Mining UBDM '05, pages 34--38, 2005.

Digital Library

[12]

F. Provost. Toward economic machine learning and utility-based data mining. In Proceedings of the First International Workshop on Utility-based Data Mining, pages 1--1, 2005.

Digital Library

[13]

G. Ratsch, T. Onoda, and K. R. Muller. Soft margins for adaboost. Machine Learning, 42(3):287--320, 2001.

Digital Library

[14]

M. Saar-Tsechansky, P. Melville, and F. Provost. Active feature-value acquisition. Management Sciences, 2008.

Digital Library

[15]

V. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '08), pages 614--622, 2008.

Digital Library

[16]

B. W. Silverman. Some asymptotic properties of the probabilistic teacher. IEEE Transactions on Information Theory, 26:246--249, 1980.

Digital Library

[17]

P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems (NIPS '94), pages 1085--1092, 1994.

[18]

P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Learning with probabilistic supervision. Computational Learning Theory and Natural Learning Systems, 3, 1995.

[19]

R. Snow, O'Connor, D. Jurafsky, and A. Ng. Cheap and fast--but is it good? evaluating non-expert annotations for natural language tasks. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008.

Digital Library

[20]

Z. Zheng and B. Padmanabhan. Selectively acquiring customer information: A new data acquisition problem and an active learning-based solution. Management Science, 52.

Digital Library

Cited By

Chatterjee ORahaman KAggarwal P(2024)CLA-RA: Collaborative Active Learning Amidst Relabeling Ambiguity2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00016(18-24)Online publication date: 7-Jul-2024
https://doi.org/10.1109/SSE62657.2024.00016
Di Fiore FNardelli MMainini L(2024)Active Learning and Bayesian Optimization: A Unified Perspective to Learn with a GoalArchives of Computational Methods in Engineering10.1007/s11831-024-10064-zOnline publication date: 23-Apr-2024
https://doi.org/10.1007/s11831-024-10064-z
Kashima HOyama SArai HMori J(2024)Trustworthy human computation: a surveyArtificial Intelligence Review10.1007/s10462-024-10974-157:12Online publication date: 12-Oct-2024
https://doi.org/10.1007/s10462-024-10974-1
Show More Cited By

Index Terms

Efficiently learning the accuracy of labeling sources for selective sampling
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Active Learning from Multiple Noisy Labelers with Varied Costs
ICDM '10: Proceedings of the 2010 IEEE International Conference on Data Mining

In active learning, where a learning algorithm has to purchase the labels of its training examples, it is often assumed that there is only one labeler available to label examples, and that this labeler is noise-free. In reality, it is possible that ...
Compression-Based Selective Sampling for Learning to Rank
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Learning to rank (L2R) algorithms use a labeled training set to generate a ranking model that can be later used to rank new query results. These training sets are very costly and laborious to produce, requiring human annotators to assess the relevance or ...
Unsupervised Selective Labeling for More Effective Semi-supervised Learning
Computer Vision – ECCV 2022
Abstract
Given an unlabeled dataset and an annotation budget, we study how to selectively label a fixed number of instances so that semi-supervised learning (SSL) on such a partially labeled dataset is most effective. We focus on selecting the right data ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

June 2009

1426 pages

ISBN:9781605584959

DOI:10.1145/1557019

General Chairs:
John Elder
Elder Research, Inc., USA
,
Françoise Soulié Fogelman
KXEN, France
,
Program Chairs:
Peter Flach
University of Bristol, UK
,
Mohammed Zaki
RPI, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD09

Sponsor:

KDD09: The 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

June 28 - July 1, 2009

Paris, France

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

154
Total Citations
View Citations
1,224
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)3

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chatterjee ORahaman KAggarwal P(2024)CLA-RA: Collaborative Active Learning Amidst Relabeling Ambiguity2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00016(18-24)Online publication date: 7-Jul-2024
https://doi.org/10.1109/SSE62657.2024.00016
Di Fiore FNardelli MMainini L(2024)Active Learning and Bayesian Optimization: A Unified Perspective to Learn with a GoalArchives of Computational Methods in Engineering10.1007/s11831-024-10064-zOnline publication date: 23-Apr-2024
https://doi.org/10.1007/s11831-024-10064-z
Kashima HOyama SArai HMori J(2024)Trustworthy human computation: a surveyArtificial Intelligence Review10.1007/s10462-024-10974-157:12Online publication date: 12-Oct-2024
https://doi.org/10.1007/s10462-024-10974-1
Sun XTu LZhang JCai JLi BWang Y(2023)ASSBert: Active and semi-supervised bert for smart contract vulnerability detectionJournal of Information Security and Applications10.1016/j.jisa.2023.10342373(103423)Online publication date: Mar-2023
https://doi.org/10.1016/j.jisa.2023.103423
Guerra-Manzanares ABahsi H(2023)On the application of active learning for efficient and effective IoT botnet detectionFuture Generation Computer Systems10.1016/j.future.2022.10.024141(40-53)Online publication date: Apr-2023
https://doi.org/10.1016/j.future.2022.10.024
Kagy JKorn FRostamizadeh AWelty CZhang ARangwala H(2022)Vexation-Aware Active Learning for On-Menu Restaurant Dish AvailabilityProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539152(3116-3126)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539152
Wallace SCai TLe BLeiva L(2022)Debiased Label Aggregation for Subjective Crowdsourcing TasksCHI Conference on Human Factors in Computing Systems Extended Abstracts10.1145/3491101.3519614(1-8)Online publication date: 27-Apr-2022
https://dl.acm.org/doi/10.1145/3491101.3519614
Zhang CZhao MZhu LWu TLiu X(2022)Enabling Efficient and Strong Privacy-Preserving Truth Discovery in Mobile CrowdsensingIEEE Transactions on Information Forensics and Security10.1109/TIFS.2022.320790517(3569-3581)Online publication date: 2022
https://doi.org/10.1109/TIFS.2022.3207905
Zhang J(2022)Knowledge Learning With Crowdsourcing: A Brief Review and Systematic PerspectiveIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2022.1054349:5(749-762)Online publication date: May-2022
https://doi.org/10.1109/JAS.2022.105434
Li WHorn ASun ZZhang JBraithwaite G(2022)Augmented visualization cues on primary flight display facilitating pilot's monitoring performanceInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2019.102377135:COnline publication date: 21-Apr-2022
https://dl.acm.org/doi/10.1016/j.ijhcs.2019.102377
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten