Black-box error diagnosis in Deep Neural Networks for computer vision: a survey of tools

Piero Fraternali¹^na1,
Federico Milani ORCID: orcid.org/0000-0003-2700-2228¹^na1,
Rocio Nahime Torres¹^na1 &
…
Niccolò Zangrando¹^na1

594 Accesses
1 Altmetric
Explore all metrics

Abstract

The application of Deep Neural Networks (DNNs) to a broad variety of tasks demands methods for coping with the complex and opaque nature of these architectures. When a gold standard is available, performance assessment treats the DNN as a black box and computes standard metrics based on the comparison of the predictions with the ground truth. A deeper understanding of performances requires going beyond such evaluation metrics to diagnose the model behavior and the prediction errors. This goal can be pursued in two complementary ways. On one side, model interpretation techniques “open the box” and assess the relationship between the input, the inner layers and the output, so as to identify the architecture modules most likely to cause the performance loss. On the other hand, black-box error diagnosis techniques study the correlation between the model response and some properties of the input not used for training, so as to identify the features of the inputs that make the model fail. Both approaches give hints on how to improve the architecture and/or the training process. This paper focuses on the application of DNNs to computer vision (CV) tasks and presents a survey of the tools that support the black-box performance diagnosis paradigm. It illustrates the features and gaps of the current proposals, discusses the relevant research directions and provides a brief overview of the diagnosis tools in sectors other than CV.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Deep Nets: What have They Ever Done for Vision?

Article 27 November 2020

KS(conf): A Light-Weight Test if a ConvNet Operates Outside of Its Specifications

Deep Learning vs. Traditional Computer Vision

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

The link to the code repository is navigable in the online version of the paper.
https://blackboxnlp.github.io/.
https://www.eclipse.org.
https://www.jetbrains.com/.

Abbreviations

AD:: Action detection
AI:: Artificial intelligence
AP:: Average precision
AUC:: Area under the curve
CAM:: Class Activation Map
CL:: Classification
CV:: Computer vision
DNN:: Deep Neural Network
ET:: Error type
FN:: False negative
FP:: False positive
GT:: Ground truth
IoU:: Intersection over union
IS:: Instance segmentation
MAE:: Mean absolute error
mAP:: Mean average precision
ME:: Mean error
ML:: Machine learning
MSE:: Mean squared error
NAB:: Numenta anomaly benchmark
NLP:: Natural language processing
OD:: Object detection
OT:: Object tracking
PE:: Pose estimation
PR:: Precision–recall
RMSE:: Root mean squared error
ROC:: Receiver operating characteristic
RS:: Recommender systems
SS:: Semantic segmentation
TN:: True negative
TP:: True positive
TS:: Time series
VRD:: Video relation detection

References

Liu W, Wang Z, Liu X et al (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26
Article Google Scholar
Chiroma H, Abdullahi UA, Alarood AA et al (2018) Progress on artificial neural networks for big data analytics: a survey. IEEE Access 7:70,535-70,551
Article Google Scholar
Voulodimos A, Doulamis N, Doulamis A et al (2018) Deep learning for computer vision: a brief review. Comput Intell Neurosci
Gharibi G, Walunj V, Nekadi R et al (2021) Automated end-to-end management of the modeling lifecycle in deep learning. Empir Softw Eng 26(2):1–33
Article Google Scholar
Doshi-Velez F, Kim B (2017) Towards a rigorous science of interpretable machine learning. arXiv:1702.08608
Guidotti R, Monreale A, Ruggieri S et al (2019) A survey of methods for explaining black box models. ACM Comput Surv 51(5):93:1-93:42
Article Google Scholar
Qs Zhang, Zhu SC (2018) Visual interpretability for deep learning: a survey. Front Inf Technol Electron Eng 19(1):27–39
Article Google Scholar
Montavon G, Samek W, Müller KR (2018) Methods for interpreting and understanding deep neural networks. Digit Signal Process 73:1–15
Article Google Scholar
Carvalho DV, Pereira EM, Cardoso JS (2019) Machine learning interpretability: a survey on methods and metrics. Electronics 8(8):832
Article Google Scholar
Tjoa E, Guan C (2021) A survey on explainable artificial intelligence (XAI): toward medical XAI. IEEE Trans Neural Netw Learn Syst 32(11):4793–4813
Article Google Scholar
Barredo Arrieta A, Gil-Lopez S, Laña I et al (2021) On the post-hoc explainability of deep echo state networks for time series forecasting, image and video classification. Neural Comput Appl 34:1–21
Google Scholar
Zhou B, Khosla A, Lapedriza A et al (2016) Learning deep features for discriminative localization. CVPR
Selvaraju RR, Cogswell M, Das A et al (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: 2017 IEEE international conference on computer vision (ICCV), pp 618–626
Chattopadhay A, Sarkar A, Howlader P et al (2018) Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE winter conference on applications of computer vision (WACV)
Sun KH, Huh H, Tama BA et al (2020) Vision-based fault diagnostics using explainable deep learning with class activation maps. IEEE Access 8:12,9169-12,9179
Article Google Scholar
Bae W, Noh J, Kim G (2020) Rethinking class activation mapping for weakly supervised object localization. In: Vedaldi A, Bischof H, Brox T et al (eds) Computer vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV, vol 12360. Lecture Notes in Computer Science. Springer, pp 618–634
Linardatos P, Papastefanopoulos V, Kotsiantis S (2020) Explainable ai: a review of machine learning interpretability methods. Entropy 23(1):18
Article Google Scholar
Verma S, Dickerson J, Hines K (2020) Counterfactual explanations for machine learning: a review. arXiv:2010.10596
Stepin I, Alonso JM, Catala A et al (2021) A survey of contrastive and counterfactual explanation generation methods for explainable artificial intelligence. IEEE Access 9:11,974-12,001
Article Google Scholar
Mehrabi N, Morstatter F, Saxena N et al (2021) A survey on bias and fairness in machine learning. ACM Comput Surv (CSUR) 54(6):1–35
Article Google Scholar
Wu X, Hu Z, Pei K et al (2021) Methods for deep learning model failure detection and model adaption: a survey. In: 2021 IEEE international symposium on software reliability engineering workshops (ISSREW). IEEE, pp 218–223
Wang Z, Liu K, Li J et al (2019) Various frameworks and libraries of machine learning and deep learning: a survey. Archiv Comput Methods Eng 1–24
Gilpin LH, Bau D, Yuan BZ et al (2018) Explaining explanations: an overview of interpretability of machine learning. In: 2018 IEEE 5th international conference on data science and advanced analytics (DSAA). IEEE, pp 80–89
Choo J, Liu S (2018) Visual analytics for explainable deep learning. IEEE Comput Graph Appl 38(4):84–92
Article Google Scholar
Roscher R, Bohn B, Duarte MF et al (2020) Explainable machine learning for scientific insights and discoveries. IEEE Access 8:42,200-42,216
Article Google Scholar
Molnar C (2022) Interpretable machine learning, 2nd edn. Independent publisher
Pessach D, Shmueli E (2022) A review on fairness in machine learning. ACM Comput Surv (CSUR) 55(3):1–44
Article Google Scholar
Balayn A, Soilis P, Lofi C et al (2021) What do you mean? Interpreting image classification with crowdsourced concept extraction and analysis. In: Leskovec J, Grobelnik M, Najork M et al (eds) WWW ’21: the web conference 2021, Virtual Event/Ljubljana, Slovenia, April 19-23, 2021. ACM/IW3C2, pp 1937–1948
Page MJ, McKenzie JE, Bossuyt PM et al (2021) The prisma 2020 statement: an updated guideline for reporting systematic reviews. Int J Surg 88(105):906
Google Scholar
Falagas ME, Pitsouni EI, Malietzis GA et al (2008) Comparison of pubmed, scopus, web of science, and google scholar: strengths and weaknesses. FASEB J 22(2):338–342
Article Google Scholar
Dollár P, Wojek C, Schiele B et al (2009) Pedestrian detection: a benchmark. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 304–311
Hoiem D, Chodpathumwan Y, Dai Q (2012) Diagnosing error in object detectors. In: European conference on computer vision. Springer, pp 340–353
Russakovsky O, Deng J, Huang Z et al (2013) Detecting avocados to zucchinis: what have we done, and where are we going? In: Proceedings of the IEEE international conference on computer vision, pp 2064–2071
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: Fleet D, Pajdla T, Schiele B et al (eds) Computer vision - ECCV 2014. Springer, Cham, pp 740–755
Hariharan B, Arbeláez P, Girshick R et al (2014) Simultaneous detection and segmentation. In: European conference on computer vision. Springer, pp 297–312
Zhu H, Lu S, Cai J et al (2015) Diagnosing state-of-the-art object proposal methods. arXiv:1507.04512
Amershi S, Chickering M, Drucker SM et al (2015) Modeltracker: redesigning performance analysis tools for machine learning. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems, pp 337–346
Redondo-Cabrera C, López-Sastre RJ, Xiang Y et al (2016) Pose estimation errors, the ultimate diagnosis. In: European conference on computer vision. Springer, pp 118–134
Krause J, Perer A, Ng K (2016) Interacting with predictions: visual inspection of black-box machine learning models. In: Proceedings of the 2016 CHI conference on human factors in computing systems, pp 5686–5697
Zhang S, Benenson R, Omran M et al (2016) How far are we from solving pedestrian detection? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1259–1267
Ruggero Ronchi M, Perona P (2017) Benchmarking and error diagnosis in multi-instance pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp 369–378
Krause J, Dasgupta A, Swartz J et al (2017) A workflow for visual diagnostics of binary classifiers using instance-level explanations. In: 2017 IEEE conference on visual analytics science and technology (VAST). IEEE, pp 162–172
Ren D, Amershi S, Lee B et al (2016) Squares: supporting interactive performance analysis for multiclass classifiers. IEEE Trans Vis Comput Graph 23(1):61–70
Article Google Scholar
Sigurdsson GA, Russakovsky O, Gupta A (2017) What actions are needed for understanding human actions in videos? In: Proceedings of the IEEE international conference on computer vision, pp 2137–2146
Alwassel H, Heilbron FC, Escorcia V et al (2018) Diagnosing error in temporal action detectors. In: Proceedings of the European conference on computer vision (ECCV), pp 256–272
Nekrasov V, Shen C, Reid I (2018) Diagnostics in semantic segmentation. arXiv:1809.10328
Zhang J, Wang Y, Molino P et al (2018) Manifold: a model-agnostic framework for interpretation and diagnosis of machine learning models. IEEE Trans Vis Comput Graph 25(1):364–373
Article Google Scholar
Wexler J, Pushkarna M, Bolukbasi T et al (2019) The what-if tool: interactive probing of machine learning models. IEEE Trans Vis Comput Graph 26(1):56–65
Google Scholar
Bolya D, Foley S, Hays J et al (2020) Tide: a general toolbox for identifying object detection errors. In: European conference on computer vision. Springer, pp 558–573
Torres RN, Fraternali P, Romero J (2020) Odin: an object detection and instance segmentation diagnosis framework. In: European conference on computer vision. Springer, pp 19–31
Torres RN, Milani F, Fraternali P (2021) Odin: pluggable meta-annotations and metrics for the diagnosis of classification and localization. In: International conference on machine learning, optimization, and data science. Springer, pp 383–398
Padilla R, Netto SL, da Silva EA (2020) A survey on performance metrics for object-detection algorithms. In: 2020 International conference on systems, signals and image processing (IWSSIP). IEEE, pp 237–242
Yoon H, Lee SH, Park M (2020) Tensorflow with user friendly graphical framework for object detection API. arXiv:2006.06385
Gleicher M, Barve A, Yu X et al (2020) Boxer: interactive comparison of classifier results. In: Computer graphics forum. Wiley Online Library, pp 181–193
Demidovskij A, Tugaryov A, Kashchikhin A, et al (2021) Openvino deep learning workbench: towards analytical platform for neural networks inference optimization. In: Journal of physics: conference series. IOP Publishing, p 012012
Padilla R, Passos WL, Dias TL et al (2021) A comparative analysis of object detection metrics with a companion open-source toolkit. Electronics 10(3):279
Article Google Scholar
Fan H, Yang F, Chu P et al (2021) Tracklinic: diagnosis of challenge factors in visual tracking. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), pp 970–979
Chen S, Pascal M, Snoek CG (2021) Diagnosing errors in video relation detectors. In: BMVC
Kräter M, Abuhattum S, Soteriou D et al (2021) Aideveloper: deep learning image classification in life science and beyond. Adv Sci 8(11):2003743
Article Google Scholar
Nourani M, Roy C, Honeycutt DR et al (2022) Detoxer: a visual debugging tool with multi-scope explanations for temporal multi-label classification. IEEE Comput Graph Appl
Deng Z, Sun H, Zhou S et al (2018) Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J Photogramm Remote Sens 145:3–22
Article Google Scholar
Shang X, Ren T, Guo J et al (2017) Video visual relation detection. In: Proceedings of the 25th ACM international conference on Multimedia, pp 1300–1308
Shang X, Di D, Xiao J, et al (2019) Annotating objects and relations in user-generated videos. In: Proceedings of the 2019 on international conference on multimedia retrieval, pp 279–287
Pang G, Shen C, Cao L et al (2021) Deep learning for anomaly detection: a review. ACM Comput Surv 54(2):1–38
Article Google Scholar
Chalapathy R, Chawla S (2019) Deep learning for anomaly detection: a survey. arXiv:1901.03407
Zhang W, Yang D, Wang H (2019) Data-driven methods for predictive maintenance of industrial equipment: a survey. IEEE Syst J 13(3):2213–2227
Article Google Scholar
Vollert S, Atzmueller M, Theissler A (2021) Interpretable machine learning: a brief survey from the predictive maintenance perspective. In: 2021 26th IEEE international conference on emerging technologies and factory automation (ETFA ), pp 01–08
Zoppi T, Ceccarelli A, Bondavalli A (2019) Evaluation of anomaly detection algorithms made easy with reload. In: 2019 IEEE 30th international symposium on software reliability engineering (ISSRE). IEEE, pp 446–455
Herzen J, Lässig F, Piazzetta SG et al (2021) Darts: user-friendly modern machine learning for time series. arXiv:2110.03224
Carrasco J, López D, Aguilera-Martos I et al (2021) Anomaly detection in predictive maintenance: a new evaluation framework for temporal unsupervised anomaly detection algorithms. Neurocomputing 462:440–452
Article Google Scholar
Krokotsch T, Knaak M, Gühmann C (2020) A novel evaluation framework for unsupervised domain adaption on remaining useful lifetime estimation. In: 2020 IEEE international conference on prognostics and health management (ICPHM). IEEE, pp 1–8
Zangrando N, Torres RN, Milani F et al (2022) Odin ts: a tool for the black-box evaluation of time series analytics. In: Conference proceedings ITISE. Springer
Gralinski F, Wróblewska A, Stanisławek T et al (2019) Geval: tool for debugging nlp datasets and models. In: Proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP, pp 254–262
Tenney I, Wexler J, Bastings J et al (2020) The language interpretability tool: extensible, interactive visualizations and analysis for nlp models. arXiv:2008.05122
Manabe H, Hagiwara M (2021) Expats: a toolkit for explainable automated text scoring. arXiv:2104.03364
Zhao WX, Mu S, Hou Y et al (2021) Recbole: towards a unified, comprehensive and efficient framework for recommendation algorithms. In: Proceedings of the 30th ACM international conference on information and knowledge management, pp 4653–4664
Anelli VW, Bellogín A, Ferrara A et al (2021) Elliot: a comprehensive and rigorous framework for reproducible recommender systems evaluation. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pp 2405–2414
Monteiro FC, Campilho AC (2006) Performance evaluation of image segmentation. In: International conference image analysis and recognition. Springer, pp 248–259
Hossin M, Sulaiman M (2015) A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process 5(2):1
Article Google Scholar
Novaković JD, Veljović A, Ilić SS et al (2017) Evaluation of classification models in machine learning. Theory Appl Math Comput Sci 7(1):39–46
Google Scholar
Milani F, Fraternali P (2021) A dataset and a convolutional model for iconography classification in paintings. J Comput Cult Heritage (JOCCH) 14(4):1–18
Article Google Scholar
Petsiuk V, Jain R, Manjunatha V et al (2020) Black-box explanation of object detectors via saliency maps. arXiv:2006.03204
Theissler A, Thomas M, Burch M et al (2022) Confusionvis: comparative evaluation and selection of multi-class classifiers based on confusion matrices. Knowl Based Syst 247(108):651
Google Scholar
Theissler A, Vollert S, Benz P et al (2020) Ml-modelexplorer: an explorative model-agnostic approach to evaluate and compare multi-class classifiers. In: International cross-domain conference for machine learning and knowledge extraction. Springer, pp 281–300
Chen Y, Zheng B, Zhang Z et al (2020) Deep learning on mobile and embedded devices: state-of-the-art, challenges, and future directions. ACM Comput Surv 53(4):8:41-8:437
Google Scholar
Talbi EG (2021) Automated design of deep neural networks: a survey and unified taxonomy. ACM Comput Surv 54(2):1–37
Article Google Scholar
Thornton C, Hutter F, Hoos HH et al (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 847–855
Liu Z, Xu Z, Rajaa S et al (2020) Towards automated deep learning: analysis of the autodl challenge series 2019. In: NeurIPS 2019 competition and demonstration track, PMLR, pp 242–252
Dong X, Kedziora DJ, Musial K et al (2021) Automated deep learning: neural architecture search is not the end. arXiv:2112.09245

Download references

Acknowledgements

This work is partially supported by the project “PRECEPT - A novel decentralized edge-enabled PREsCriptivE and ProacTive framework for increased energy efficiency and well-being in residential buildings” funded by the EU H2020 Programme, Grant Agreement No. 958284.

Author information

Piero Fraternali, Federico Milani, Rocio Nahime Torres and Niccolò Zangrando contributed equally to this work.

Authors and Affiliations

Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano, Via Giuseppe Ponzio, 34, 20133, Milan, MI, Italy
Piero Fraternali, Federico Milani, Rocio Nahime Torres & Niccolò Zangrando

Authors

Piero Fraternali
View author publications
You can also search for this author in PubMed Google Scholar
Federico Milani
View author publications
You can also search for this author in PubMed Google Scholar
Rocio Nahime Torres
View author publications
You can also search for this author in PubMed Google Scholar
Niccolò Zangrando
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Federico Milani.

Ethics declarations

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fraternali, P., Milani, F., Torres, R.N. et al. Black-box error diagnosis in Deep Neural Networks for computer vision: a survey of tools. Neural Comput & Applic 35, 3041–3062 (2023). https://doi.org/10.1007/s00521-022-08100-9

Download citation

Received: 12 June 2022
Accepted: 22 November 2022
Published: 10 December 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s00521-022-08100-9

Black-box error diagnosis in Deep Neural Networks for computer vision: a survey of tools

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Nets: What have They Ever Done for Vision?

KS(conf): A Light-Weight Test if a ConvNet Operates Outside of Its Specifications

Deep Learning vs. Traditional Computer Vision

Data availability

Notes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Black-box error diagnosis in Deep Neural Networks for computer vision: a survey of tools

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Nets: What have They Ever Done for Vision?

KS(conf): A Light-Weight Test if a ConvNet Operates Outside of Its Specifications

Deep Learning vs. Traditional Computer Vision

Explore related subjects

Data availability

Notes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now