[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Comprehensible counterfactual explanation on Kolmogorov-Smirnov test

Published: 01 May 2021 Publication History

Abstract

The Kolmogorov-Smirnov (KS) test is popularly used in many applications, such as anomaly detection, astronomy, database security and AI systems. One challenge remained untouched is how we can obtain an explanation on why a test set fails the KS test. In this paper, we tackle the problem of producing counterfactual explanations for test data failing the KS test. Concept-wise, we propose the notion of most comprehensible counterfactual explanations, which accommodates both the KS test data and the user domain knowledge in producing explanations. Computation-wise, we develop an efficient algorithm MOCHE (for <u>MO</u>st <u>C</u>ompre<u>H</u>ensible <u>E</u>xplanation) that avoids enumerating and checking an exponential number of subsets of the test set failing the KS test. MOCHE not only guarantees to produce the most comprehensible counterfactual explanations, but also is orders of magnitudes faster than the baselines. Experiment-wise, we present a systematic empirical study on a series of benchmark real datasets to verify the effectiveness, efficiency and scalability of most comprehensible counterfactual explanations and MOCHE.

References

[1]
2021. Alibi Detect. https://github.com/SeldonIO/alibi-detect. Accessed: 2021-05-11.
[2]
2021. BC COVID-19 Data. http://www.bccdc.ca/health-info/diseases-conditions/covid-19/data. Accessed: 2021-05-11.
[3]
2021. Health Authority Boundaries. https://catalogue.data.gov.bc.ca/dataset/health-authority-boundaries. Accessed: 2021-05-11.
[4]
2021. Matrix Profile. https://matrixprofile.org. Accessed: 2021-05-11.
[5]
2021. Series2Graph. http://helios.mi.parisdescartes.fr/~themisp/series2graph/. Accessed: 2021-05-11.
[6]
2021. Statistics Canada. https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/index.cfm. Accessed: 2021-05-11.
[7]
Charu C. Aggarwal. 2013. Outlier Analysis. Springer.
[8]
Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu. 2004. Order-Preserving Encryption for Numeric Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13--18, 2004, Gerhard Weikum, Arnd Christian König, and Stefan Deßloch (Eds.). ACM, 563--574.
[9]
Arjun R. Akula, Shuai Wang, and Song-Chun Zhu. 2020. CoCoX: Generating Conceptual and Counterfactual Explanations via Fault-Lines. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 2594--2601. https://aaai.org/ojs/index.php/AAAI/article/view/5643
[10]
Fabrizio Angiulli and Clara Pizzuti. 2002. Fast Outlier Detection in High Dimensional Spaces. In Principles of Data Mining and Knowledge Discovery, 6th European Conference, PKDD 2002, Helsinki, Finland, August 19--23, 2002, Proceedings (Lecture Notes in Computer Science), Tapio Elomaa, Heikki Mannila, and Hannu Toivonen (Eds.), Vol. 2431. Springer, 15--26.
[11]
André Artelt and Barbara Hammer. 2020. Efficient computation of counterfactual explanations of LVQ models. In 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2020, Bruges, Belgium, October 2--4, 2020. 19--24. https://www.esann.org/sites/default/files/proceedings/2020/ES2020-55.pdf
[12]
Idir Benouaret, Sihem Amer-Yahia, and Senjuti Basu Roy. 2019. An Efficient Greedy Algorithm for Sequence Recommendation. In Database and Expert Systems Applications - 30th International Conference, DEXA 2019, Linz, Austria, August 26--29, 2019, Proceedings, Part I (Lecture Notes in Computer Science), Sven Hartmann, Josef Küng, Sharma Chakravarthy, Gabriele Anderst-Kotsis, A Min Tjoa, and Ismail Khalil (Eds.), Vol. 11706. Springer, 314--326.
[13]
Paul Boniol and Themis Palpanas. 2020. Series2graph: Graph-based subsequence anomaly detection for time series. Proceedings of the VLDB Endowment 13, 12 (2020), 1821--1834.
[14]
Wieland Brendel, Jonas Rauber, and Matthias Bethge. 2018. Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 -- May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SyZI0GWCZ
[15]
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying Density-Based Local Outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16--18, 2000, Dallas, Texas, USA, Weidong Chen, Jeffrey F. Naughton, and Philip A. Bernstein (Eds.). ACM, 93--104.
[16]
Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield, Gretchen Krueger, Gillian K. Hadfield, Heidy Khlaaf, Jingying Yang, Helen Toner, Ruth Fong, Tegan Maharaj, Pang Wei Koh, Sara Hooker, Jade Leung, Andrew Trask, Emma Bluemke, Jonathan Lebensbold, Cullen O'Keefe, Mark Koren, Theo Ryffel, J. B. Rubinovitz, Tamay Besiroglu, Federica Carugati, Jack Clark, Peter Eckersley, Sarah de Haas, Maritza Johnson, Ben Laurie, Alex Ingerman, Igor Krawczuk, Amanda Askell, Rosario Cammarota, Andrew Lohn, David Krueger, Charlotte Stix, Peter Henderson, Logan Graham, Carina Prunkl, Bianca Martin, Elizabeth Seger, Noa Zilberman, Seán Ó hÉigeartaigh, Frens Kroeger, Girish Sastry, Rebecca Kagan, Adrian Weller, Brian Tse, Elizabeth Barnes, Allan Dafoe, Paul Scharre, Ariel Herbert-Voss, Martijn Rasser, Shagun Sodhani, Carrick Flynn, Thomas Krendl Gilbert, Lisa Dyer, Saif Khan, Yoshua Bengio, and Markus Anderljung. 2020. Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. CoRR abs/2004.07213 (2020). arXiv:2004.07213 https://arxiv.org/abs/2004.07213
[17]
Diogo V Carvalho, Eduardo M Pereira, and Jaime S Cardoso. 2019. Machine learning interpretability: A survey on methods and metrics. Electronics 8, 8 (2019), 832.
[18]
Hongsong Chen, Caixia Meng, Zhiguang Shan, Zhongchuan Fu, and Bharat K Bhargava. 2019. A novel Low-rate Denial of Service attack detection approach in ZigBee wireless sensor network by combining Hilbert-Huang Transformation and Trust Evaluation. IEEE Access 7 (2019), 32853--32866.
[19]
Xuefeng Chen, Yifeng Zeng, Gao Cong, Shengchao Qin, Yanping Xiang, and Yuanshun Dai. 2015. On Information Coverage for Location Category Based Point-of-Interest Recommendation. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25--30, 2015, Austin, Texas, USA, Blai Bonet and Sven Koenig (Eds.). AAAI Press, 37--43. http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9703
[20]
Minhao Cheng, Thong Le, Pin-Yu Chen, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. 2019. Query-Efficient Hard-label Black-box Attack: An Optimization-based Approach. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019. OpenReview.net. https://openreview.net/forum?id=rJlk6iRqKX
[21]
Francesco Croce and Matthias Hein. 2019. Sparse and Imperceivable Adversarial Attacks. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 -- November 2, 2019. IEEE, 4723--4731.
[22]
Zhiguo Ding and Minrui Fei. 2013. An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proceedings Volumes 46, 20 (2013), 12--17.
[23]
Denis Moreira dos Reis, Peter A. Flach, Stan Matwin, and Gustavo E. A. P. A. Batista. 2016. Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13--17, 2016, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 1545--1554.
[24]
G. Fasano and A. Franceschini. 1987. A multidimensional version of the Kolmogorov-Smirnov test. Monthly Notices of the Royal Astronomical Society 225, 1 (03 1987), 155--170. arXiv:https://academic.oup.com/mnras/article-pdf/225/1/155/18522274/mnras225-0155.pdf
[25]
Ruth C. Fong and Andrea Vedaldi. 2017. Interpretable Explanations of Black Boxes by Meaningful Perturbation. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. IEEE Computer Society, 3449--3457.
[26]
Markus Goldstein and Andreas Dengel. 2012. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. KI-2012: Poster and Demo Track (2012), 59--63.
[27]
Xiaoyi Gu, Leman Akoglu, and Alessandro Rinaldo. 2019. Statistical Analysis of Nearest Neighbor Methods for Anomaly Detection. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8--14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 10921--10931. https://proceedings.neurips.cc/paper/2019/hash/805163a0f0f128e473726ccda5f91bac-Abstract.html
[28]
Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Philipp Weis. 2008. Resisting structural re-identification in anonymized social networks. Proceedings of the VLDB Endowment 1, 1 (2008), 102--114.
[29]
Fabian Keller, Emmanuel Müller, and Klemens Böhm. 2012. HiCS: High Contrast Subspaces for Density-Based Outlier Ranking. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1--5 April, 2012, Anastasios Kementsietsidis and Marcos Antonio Vaz Salles (Eds.). IEEE Computer Society, 1037--1048.
[30]
Daniel Kifer, Shai Ben-David, and Johannes Gehrke. 2004. Detecting change in data streams. In VLDB, Vol. 4. Toronto, Canada, 180--191.
[31]
Jerome Klotz. 1967. Asymptotic efficiency of the two sample Kolmogorov-Smirnov test. J. Amer. Statist. Assoc. 62, 319 (1967), 932--938.
[32]
Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. In Advances in neural information processing systems. 4066--4076.
[33]
Ashwin Lall. 2015. Data streaming algorithms for the Kolmogorov-Smirnov test. In 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29 -- November 1, 2015. IEEE Computer Society, 95--104.
[34]
Alexander Lavin and Subutai Ahmad. 2015. Evaluating Real-Time Anomaly Detection Algorithms-The Numenta Anomaly Benchmark. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA). IEEE, 38--44.
[35]
Aleksandar Lazarevic and Vipin Kumar. 2005. Feature bagging for outlier detection. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. 157--166.
[36]
Thai Le, Suhang Wang, and Dongwon Lee. 2020. GRACE: Generating Concise and Informative Contrastive Sample to Explain Neural Network Model's Prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD '20). Association for Computing Machinery, New York, NY, USA, 238--248.
[37]
Arnaud Van Looveren and Janis Klaise. 2019. Interpretable Counterfactual Explanations Guided by Prototypes. CoRR abs/1907.02584 (2019). arXiv:1907.02584 http://arxiv.org/abs/1907.02584
[38]
Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence 267 (2019), 1--38.
[39]
Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. 2019. Sparsefool: a few pixels make a big difference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9087--9096.
[40]
Christoph Molnar. 2019. Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/.
[41]
Raha Moraffah, Mansooreh Karami, Ruocheng Guo, Adrienne Raglin, and Huan Liu. 2020. Causal Interpretability for Machine Learning-Problems, Methods and Evaluation. ACM SIGKDD Explorations Newsletter 22, 1 (2020), 18--33.
[42]
Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. 2020. Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 607--617.
[43]
Sigurd Kirkevold Næss. 2012. Application of the Kolmogorov-Smirnov test to CMB data: Is the universe really weakly random? Astronomy & Astrophysics 538 (2012), A17.
[44]
Raymond S Nickerson. 1998. Confirmation bias: A ubiquitous phenomenon in many guises. Review of general psychology 2, 2 (1998), 175--220.
[45]
Mila Nikolova. 2013. Description of the Minimizers of Least Squares Regularized with \ell_0-norm. Uniqueness of the Global Minimizer. SIAM Journal on Imaging Sciences 6, 2 (2013), 904--937.
[46]
Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B Gibbons, and Christos Faloutsos. 2003. Loci: Fast outlier detection using the local correlation integral. In Proceedings 19th international conference on data engineering (Cat. No. 03CH37405). IEEE, 315--326.
[47]
Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. 2017. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security. 506--519.
[48]
Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. 2016. Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples. CoRR abs/1605.07277 (2016). arXiv:1605.07277 http://arxiv.org/abs/1605.07277
[49]
Fábio Pinto, Marco O. P. Sampaio, and Pedro Bizarro. 2019. Automatic Model Monitoring for Data Streams. CoRR abs/1908.04240 (2019). arXiv:1908.04240 http://arxiv.org/abs/1908.04240
[50]
Neoklis Polyzotis, Martin Zinkevich, Sudip Roy, Eric Breck, and Steven Whang. 2019. Data validation for machine learning. Proceedings of Machine Learning and Systems 1 (2019), 334--347.
[51]
Stephan Rabanser, Stephan Günnemann, and Zachary Lipton. 2019. Failing loudly: An empirical study of methods for detecting dataset shift. In Advances in Neural Information Processing Systems. 1396--1408.
[52]
Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 427--438.
[53]
Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. 2019. Time-Series Anomaly Detection Service at Microsoft. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3009--3017.
[54]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should I trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135--1144.
[55]
Ron Rymon. 1992. Search through Systematic Set Enumeration. In Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning (Cambridge, MA) (KR'92). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 539--550.
[56]
Ricardo Jorge Santos, Jorge Bernardino, and Marco Vieira. 2014. Approaches and challenges in database intrusion detection. ACM Sigmod Record 43, 3 (2014), 36--47.
[57]
Sebastian Schelter, Tammo Rukat, and Felix Biessmann. 2020. Learning to Validate the Predictions of Black Box Classifiers on Unseen Data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1289--1299.
[58]
Kacper Sokol and Peter A. Flach. 2019. Counterfactual Explanations of Machine Learning Predictions: Opportunities and Challenges for AI Safety. In Workshop on Artificial Intelligence Safety 2019 co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019 (CEUR Workshop Proceedings), Huáscar Espinoza, Seán Ó hÉigeartaigh, Xiaowei Huang, José Hernández-Orallo, and Mauricio Castillo-Effen (Eds.), Vol. 2301. CEUR-WS.org. http://ceur-ws.org/Vol-2301/paper_20.pdf
[59]
S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. 2006. Online Outlier Detection in Sensor Data Using Non-Parametric Models. In Proceedings of the 32nd International Conference on Very Large Data Bases (Seoul, Korea) (VLDB '06). VLDB Endowment, 187--198.
[60]
Sebastian Tschiatschek, Adish Singla, and Andreas Krause. 2017. Selecting Sequences of Items via Submodular Maximization. In AAAI. 2667--2673.
[61]
Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech. 31 (2017), 841.
[62]
Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y Lim. 2019. Designing theory-driven user-centric explainable AI. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1--15.
[63]
David F Williamson, Robert A Parker, and Juliette S Kendrick. 1989. The box plot: a simple visual method to interpret data. Annals of internal medicine 110, 11 (1989), 916--921.
[64]
Min Xie, Laks VS Lakshmanan, and Peter T Wood. 2010. Breaking out of the box of recommendations: from items to packages. In Proceedings of the fourth ACM conference on Recommender systems. 151--158.
[65]
Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, and Eamonn Keogh. 2016. Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In 2016 IEEE 16th international conference on data mining (ICDM). Ieee, 1317--1322.
[66]
Shujian Yu, Xiaoyang Wang, and José C. Príncipe. 2018. Request-and-Reverify: Hierarchical Hypothesis Testing for Concept Drift Detection with Expensive Labels. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13--19, 2018, Stockholm, Sweden, Jérôme Lang (Ed.). ijcai.org, 3033--3039.

Cited By

View all
  • (2025)A survey of multimodal event detection based on data fusionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00878-534:1Online publication date: 1-Jan-2025
  • (2024)Counterfactual Explanation of Shapley Value in Data CoalitionsProceedings of the VLDB Endowment10.14778/3681954.368200417:11(3332-3345)Online publication date: 1-Jul-2024
  • (2024)Counterfactual Explanations and Algorithmic Recourses for Machine Learning: A ReviewACM Computing Surveys10.1145/367711956:12(1-42)Online publication date: 9-Jul-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 14, Issue 9
May 2021
249 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 May 2021
Published in PVLDB Volume 14, Issue 9

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A survey of multimodal event detection based on data fusionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00878-534:1Online publication date: 1-Jan-2025
  • (2024)Counterfactual Explanation of Shapley Value in Data CoalitionsProceedings of the VLDB Endowment10.14778/3681954.368200417:11(3332-3345)Online publication date: 1-Jul-2024
  • (2024)Counterfactual Explanations and Algorithmic Recourses for Machine Learning: A ReviewACM Computing Surveys10.1145/367711956:12(1-42)Online publication date: 9-Jul-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media