[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3331184.3331389acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
tutorial

Learning to Quantify: Estimating Class Prevalence via Supervised Learning

Published: 18 July 2019 Publication History

Abstract

Quantification (also known as "supervised prevalence estimation" [2], or "class prior estimation" [7]) is the task of estimating, given a set σ of unlabelled items and a set of classes C = c1, . . . , c |C|, the relative frequency (or "prevalence") p(ci ) of each class ci C, i.e., the fraction of items in σ that belong to ci . When each item belongs to exactly one class, since 0 ≤ p(ci ) ≤ 1 and Í ci C p(ci ) = 1, p is a distribution of the items in σ across the classes in C (the true distribution), and quantification thus amounts to estimating p (i.e., to computing a predicted distribution p?).
Quantification is important in many disciplines (such as e.g., market research, political science, the social sciences, and epidemiology) which usually deal with aggregate (as opposed to individual) data. In these contexts, classifying individual unlabelled instances is usually not a primary goal, while estimating the prevalence of the classes of interest in the data is. For instance, when classifying the tweets about a certain entity (e.g., a political candidate) as displaying either a Positive or a Negative stance towards the entity, we are usually not much interested in the class of a specific tweet: instead, we usually want to know the fraction of these tweets that belong to the class [14].

References

[1]
Barranquero, J., Díez, J. and del Coz, J. J. {2015}, 'Quantification-oriented learning based on reliable classifiers',Pattern Recognition 48(2), 591--604.
[2]
Barranquero, J., González, P., Díez, J. and del Coz, J. J. {2013}, 'On the study of nearest neighbor algorithms for prevalence estimation in binary problems', Pattern Recognition 46(2), 472--482.
[3]
Beijbom, O., Hoffman, J., Yao, E., Darrell, T., Rodriguez-Ramirez, A., Gonzalez-Rivero, M. and Hoegh-Guldberg, O. {2015}, Quantification in-the-wild: Data-sets and baselines. CoRR abs/1510.04811 (2015). Presented at the NIPS 2015 Workshop on Transfer and Multi-Task Learning, Montreal, CA.
[4]
Bella, A., Ferri, C., Hernández-Orallo, J. and Ramírez-Quintana, M. J. {2010}, Quantification via probability estimators, in 'Proceedings of the 11th IEEE Inter-national Conference on Data Mining (ICDM 2010)', Sydney, AU, pp. 737--742.
[5]
Ceron, A., Curini, L. and Iacus, S. M. {2016}, 'iSA: A fast, scalable and accurate algorithm for sentiment analysis of social media content', Information Sciences 367/368, 105-124.
[6]
Da San Martino, G., Gao, W. and Sebastiani, F. {2016}, Ordinal text quantification, in 'Proceedings of the 39th ACM Conference on Research and Development in Information Retrieval (SIGIR 2016)', Pisa, IT, pp. 937--940.
[7]
du Plessis, M. C., Niu, G. and Sugiyama, M. {2017}, 'Class-prior estimation for learning from positive and unlabeled data', Machine Learning 106(4), 463--492.
[8]
Esuli, A. {2016}, ISTI-CNR at SemEval-2016 Task 4: Quantification on an ordinal scale, in 'Proceedings of the 10th International Workshop on Semantic Evaluation(SemEval 2016)', San Diego, US.
[9]
Esuli, A., Moreo, A. and Sebastiani, F. {2019}, 'Cross-lingual sentiment quantification', arXiv:1904.07965.
[10]
Esuli, A., Moreo, A., Sebastiani, F. and Trevisan, D. {2019}, Evaluation protocols for quantification. Submitted for publication.
[11]
Esuli, A. and Sebastiani, F. {2010}, 'Sentiment quantification', IEEE Intelligent Systems 25(4), 72--75.
[12]
Esuli, A. and Sebastiani, F. {2015}, 'Optimizing text quantifiers for multivariate loss functions', ACM Transactions on Knowledge Discovery and Data 9(4), Article 27.
[13]
Forman, G. {2008}, 'Quantifying counts and costs via classification', Data Mining and Knowledge Discovery 17(2), 164--206.
[14]
Gao, W. and Sebastiani, F. {2016}, 'From classification to quantification in tweet sentiment analysis', Social Network Analysis and Mining6(19), 1--22.
[15]
González, P., Castaño, A., Chawla, N. V. and del Coz, J. J. {2017}, 'A review on quantification learning', ACM Computing Surveys 50(5), 74:1--74:40.
[16]
González-Castro, V., Alaiz-Rodríguez, R. and Alegre, E. {2013}, 'Class distribution estimation based on the Hellinger distance', Information Sciences 218, 146--164.
[17]
Hopkins, D. J. and King, G. {2010}, 'A method of automated nonparametric content analysis for social science',American Journal of Political Science54(1), 229--247.
[18]
Kar, P., Li, S., Narasimhan, H., Chawla, S. and Sebastiani, F. {2016}, Online optimization methods for the quantification problem, in 'Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016)', San Francisco, US, pp. 1625--1634.
[19]
King, G. and Lu, Y. {2008}, 'Verbal autopsy methods with multiple causes of death', Statistical Science 23(1), 78--91.
[20]
Levin, R. and Roitman, H. {2017}, Enhanced probabilistic classify and count methods for multi-label text quantification, in 'Proceedings of the 7th ACM International Conference on the Theory of Information Retrieval (ICTIR 2017)', Amsterdam, NL, pp. 229--232.
[21]
Maletzke, A. G., Moreira dos Reis, D. and Batista, G. E. {2018}, 'Combining in-stance selection and self-training to improve data stream quantification', Journal of the Brazilian Computer Society24(12), 43--48.
[22]
Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D. and Sebastiani, F. {2013}, Quantification trees, in 'Proceedings of the 13th IEEE International Conference on Data Mining (ICDM 2013)', Dallas, US, pp. 528--536.
[23]
Milli, L., Monreale, A., Rossetti, G., Pedreschi, D., Giannotti, F. and Sebastiani, F. {2015}, Quantification in social networks, in 'Proceedings of the 2nd IEEE International Conference on Data Science and Advanced Analytics (DSAA 2015)',Paris, FR.
[24]
Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V. and Herrera, F. {2012}, 'A unifying view on dataset shift in classification', Pattern Recognition 45(1), 521--530.
[25]
Nakov, P., Farra, N. and Rosenthal, S. {2017}, SemEval-2017 Task 4: Sentiment analysis in Twitter, in 'Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017)', Vancouver, CA.
[26]
Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F. and Stoyanov, V. {2016}, SemEval-2016 Task 4: Sentiment analysis in Twitter,in?Proceedings of the 10th Inter-national Workshop on Semantic Evaluation (SemEval 2016)', San Diego, US, pp. 1--18.
[27]
Pérez-Gállego, P., Quevedo, J. R. and del Coz, J. J. {2017}, 'Using ensembles for problems with characterizable changes in data distribution: A case study on quantification',Information Fusion 34, 87--100.
[28]
Saerens, M., Latinne, P. and Decaestecker, C. {2002}, 'Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure', Neural Computation 14(1), 21--41.
[29]
Sanya, A., Kumar, P., Kar, P., Chawla, S. and Sebastiani, F. {2018}, 'Optimizing non-decomposable measures with deep networks',Machine Learning 107(8--10), 1597--1620.
[30]
Sebastiani, F. {2018}, 'Evaluation measures for quantification: An axiomatic approach', arXiv:1809.01991.
[31]
Storkey, A. {2009}, When training and test sets are different: Characterizing learn-ing transfer, in J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer and N. D.Lawrence, eds, 'Dataset shift in machine learning', The MIT Press, Cambridge,US, pp. 3--28.
[32]
Tang, L., Gao, H. and Liu, H. {2010}, Network quantification despite biased labels, in 'Proceedings of the 8th Workshop on Mining and Learning with Graphs (MLG2010)', Washington, US, pp. 147--154.
[33]
Vapnik, V. {1998},Statistical Learning Theory, Wiley, New York, US.

Cited By

View all
  • (2021)Stopping Criteria for Technology Assisted Reviews based on Counting ProcessesProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463013(2293-2297)Online publication date: 11-Jul-2021

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2019
1512 pages
ISBN:9781450361729
DOI:10.1145/3331184
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2019

Check for updates

Author Tags

  1. class prior estimation
  2. quantification
  3. supervised learning
  4. supervised prevalence estimation

Qualifiers

  • Tutorial

Conference

SIGIR '19
Sponsor:

Acceptance Rates

SIGIR'19 Paper Acceptance Rate 84 of 426 submissions, 20%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Stopping Criteria for Technology Assisted Reviews based on Counting ProcessesProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463013(2293-2297)Online publication date: 11-Jul-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media