An Effective Method for Identifying Unknown Unknowns with Noisy Oracle

Bo Zheng¹⁶,
Xin Lin¹⁶,
Yanghua Xiao^17,18,
Jing Yang¹⁶ &
…
Liang He¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11156))

Included in the following conference series:

International Conference on Case-Based Reasoning

1193 Accesses

Abstract

Unknown Unknowns (UUs) are referred to the error predictions that with high confidence. The identifying of the UUs is important to understand the limitation of predictive models. Some proposed solutions are effective in such identifying. All of them assume there is a perfect Oracle to return the correct labels of the UUs. However, it is not practical since there is no perfect Oracle in real world. Even experts will make mistakes in UUs labelling. Such errors will lead to the terrible consequence since fake UUs will mislead the existing algorithms and reduce their performance. In this paper, we identify the impact of noisy Oracle and propose a UUs identifying algorithm that can be adapted to the setting of noisy Oracle. Experimental results demonstrate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Naïve Bayes classifier based on reliability measurement for datasets with noisy labels

Article 09 November 2023

Learning from crowdsourced labeled data: a survey

Article 02 July 2016

Evidential uncertainty sampling strategies for active learning

Article 27 June 2024

Notes

1.
http://aiweb.cs.washington.edu/ai/unkunk18/.
2.
https://www.kaggle.com/c/dogs-vs-cats/data.
3.
http://scikit-learn.org/.
4.
Actually, \(\tau \) is a parameter worth discussing, and different thresholds will construct different search spaces. However, we tried several candidate values such as 0.70 and 0.75 in our experiments, and the results basically consistent, so we use the value in previous works [3, 14] without further discussion.
5.
https://en.wikipedia.org/wiki/Elbow_method_(clustering).

References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Man, D.: Concrete problems in ai safety (2016)
Google Scholar
Attenberg, J., Ipeirotis, P., Provost, F.: Beat the machine: challenging humans to find a predictive model’s unknown unknowns. J. Data Inf. Qual. (JDIQ) 6(1), 1 (2015)
Article Google Scholar
Bansal, G., Weld, D.S.: A coverage-based utility model for identifying unknown unknowns. In: AAAI (2018)
Google Scholar
Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: ACL, pp. 187–205 (2007)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Outlier detection: a survey. ACM Comput. Surv. (2007)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)
Article Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI, vol. 17, pp. 973–978. Lawrence Erlbaum Associates Ltd. (2001)
Google Scholar
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: ICML, pp. 513–520 (2011)
Google Scholar
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Burlington (2011)
MATH Google Scholar
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Hsueh, P.Y., Melville, P., Sindhwani, V.: Data quality from crowdsourcing: a study of annotation selection criteria. In: NAACL HLT Workshop on Active Learning and NLP, pp. 27–35. Association for Computational Linguistics (2009)
Google Scholar
Hu, R., Delany, S., MacNamee, B.: Sampling with confidence: using K-NN confidence measures in active learning. In: ICCBR, p. 50 (2009)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)
Google Scholar
Lakkaraju, H., Kamar, E., Caruana, R., Horvitz, E.: Identifying unknown unknowns in the open world: representations and policies for guided exploration. In: AAAI, pp. 2124–2132 (2017)
Google Scholar
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Croft, B.W., van Rijsbergen, C.J. (eds.) SIGIR, pp. 3–12. Springer, New York (1994). https://doi.org/10.1007/978-1-4471-2099-5_1
Chapter Google Scholar
McAuley, J., Pandey, R., Leskovec, J.: Inferring networks of substitutable and complementary products. In: KDD, pp. 785–794. ACM (2015)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Pan, S.J., Yang, Q.: A survey on transfer learning. TKDE 22(10), 1345–1359 (2010)
Google Scholar
Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: ACL, p. 271. Association for Computational Linguistics (2004)
Google Scholar
Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: ACL, pp. 115–124. Association for Computational Linguistics (2005)
Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you? Explaining the predictions of any classifier. In: KDD, pp. 1135–1144. ACM (2016)
Google Scholar
Sani, S., Wiratunga, N., Massie, S., Cooper, K.: kNN sampling for personalised human activity recognition. In: Aha, D.W., Lieber, J. (eds.) ICCBR 2017. LNCS (LNAI), vol. 10339, pp. 330–344. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61030-6_23
Chapter Google Scholar
Settles, B.: Active learning literature survey. University of Wisconsin, Madison 52(55–66), 11 (2010)
Google Scholar
Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: NIPS, pp. 1289–1296 (2008)
Google Scholar
Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: COLT, pp. 287–294. ACM (1992)
Google Scholar
Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: KDD, pp. 614–622. ACM (2008)
Google Scholar
Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference 90(2), 227–244 (2000)
Article MathSciNet Google Scholar
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484 (2016)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)
Article Google Scholar
Zhang, T., Oles, F.: The value of unlabeled data for classification problems. In: ICML, pp. 1191–1198. Citeseer (2000)
Google Scholar
Zhu, X., Lafferty, J., Ghahramani, Z.: Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In: ICML 2003 Workshop on the Continuum From Labeled to Unlabeled Data in Machine Learning and Data Mining, vol. 3 (2003)
Google Scholar

Download references

Acknowledgments

We thank all reviewers who provided the thoughtful and constructive comments on this paper. This research is funded by the National Key R&D Program of China (No. 2017YFC0803700), the National Natural Science Foundation of China (No. 61773167), the Shanghai Municipal Commission of Economy and Informatization (No. 170513), and the Open Research Fund of Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University. The computation is performed in the Supercomputer Center of ECNU.

Author information

Authors and Affiliations

East China Normal University, Shanghai, China
Bo Zheng, Xin Lin, Jing Yang & Liang He
School of Computer Science, Fudan University, Shanghai, China
Yanghua Xiao
Shanghai Institute of Intelligent Electronics and Systems, Shanghai, China
Yanghua Xiao

Authors

Bo Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yanghua Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Jing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Liang He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Yang .

Editor information

Editors and Affiliations

Wright State University, Dayton, OH, USA
Michael T. Cox
Mälardalen University, Västeras, Sweden
Peter Funk
Mälardalen University, Västeras, Sweden
Shahina Begum

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, B., Lin, X., Xiao, Y., Yang, J., He, L. (2018). An Effective Method for Identifying Unknown Unknowns with Noisy Oracle. In: Cox, M., Funk, P., Begum, S. (eds) Case-Based Reasoning Research and Development. ICCBR 2018. Lecture Notes in Computer Science(), vol 11156. Springer, Cham. https://doi.org/10.1007/978-3-030-01081-2_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-01081-2_32
Published: 09 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01080-5
Online ISBN: 978-3-030-01081-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics