More Web Proxy on the site http://driver.im/

research-article

Understanding error propagation in deep learning neural network (DNN) accelerators and applications

Authors:

Siva Kumar Sastry Hari,

Michael Sullivan,

Karthik Pattabiraman,

Stephen W. KecklerAuthors Info & Claims

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 8, Pages 1 - 12

https://doi.org/10.1145/3126908.3126964

Published: 12 November 2017 Publication History

Abstract

Deep learning neural networks (DNNs) have been successful in solving a wide range of machine learning problems. Specialized hardware accelerators have been proposed to accelerate the execution of DNN algorithms for high-performance and energy efficiency. Recently, they have been deployed in datacenters (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. Soft errors caused by high-energy particles have been increasing in hardware systems, and these can lead to catastrophic failures in DNN systems. Traditional methods for building resilient systems, e.g., Triple Modular Redundancy (TMR), are agnostic of the DNN algorithm and the DNN accelerator's architecture. Hence, these traditional resilience approaches incur high overheads, which makes them challenging to deploy. In this paper, we experimentally evaluate the resilience characteristics of DNN systems (i.e., DNN software running on specialized accelerators). We find that the error resilience of a DNN system depends on the data types, values, data reuses, and types of layers in the design. Based on our observations, we propose two efficient protection techniques for DNN systems.

References

[1]

Alippi, Cesare, Vincenzo Piuri, and Mariagiovanna Sami. 1995. Sensitivity to errors in artificial neural networks: A behavioral approach. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications (1995).

[2]

Autonomous car facts 2016. Keynote: Autonomous Car A New Driver for Resilient Computing and Design-for-Test. (2016). Retrieved Oct. 2016 from https://nepp.nasa.gov/workshops/etw2016/talks/15WED/20160615-0930-Autonomous_Saxena-Nirmal-Saxena-Rec2016Jun16-nasaNEPP.pdf

[3]

Bettola, Simone, and Vincenzo Piuri. 1998. High performance fault-tolerant digital neural networks. IEEE transactions on computers (1998).

Digital Library

[4]

Bojarski, Mariusz, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel and others. 2016. End to End Learning for Self-Driving Cars. arXiv preprint arXiv: 1604.07316 (2016).

[5]

Borkar, Shekhar. 2005. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. In Proceedings of the International Symposium on Microarchitecture (MICRO), Vol. 25. 10--16.

Digital Library

[6]

BVCL 2014. BERKELEY VISION AND LEARNING CENTER. (2014). Retrieved Oct. 2016 from http://bvlc.eecs.berkeley.edu

[7]

Caffe Model 2014. Caffe Model Zoo. (2014). Retrieved Oct. 2016 from http://caffe.berkeleyvision.org/model_zoo.html

[8]

CaffeNet 2014. CaffeNet. (2014). Retrieved Oct. 2016 from http://caffe.berkeleyvision.org/model_zoo.html

[9]

Cavigelli, Lukas, David Gschwend, Christoph Mayer, Samuel Willi, Beat Muheim, and Luca Benini. 2015. Origami: A convolutional network accelerator. In In Proceedings of the 25th edition on Great Lakes Symposium on VLSI.

Digital Library

[10]

Chakradhar, Srimat, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In ACM SIGARCH Computer Architecture News.

Digital Library

[11]

Chatterjee, Avhishek, and Lav R. Varshney. 2017. Energy-reliability limits in nanoscale neural networks. In The 51st Annual Conference on Information Sciences and Systems (CISS). 1--6.

[12]

Chen, Tianshi, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM Sigplan Notices.

Digital Library

[13]

Chen, Yu-Hsin, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the International Symposium on Computer Architecture (ISCA). 367--379.

Digital Library

[14]

Chen, Yunji, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li and others. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the International Symposium on Microarchitecture (MICRO).

Digital Library

[15]

CIFAR dataset 2014. CIFAR-10. (2014). Retrieved Oct. 2016 from https://www.cs.toronto.edu/~kriz/cifar.html

[16]

Constantinescu, Cristian. 2008. Intermittent faults and effects on reliability of integrated circuits. In Proceedings of the Reliability and Maintainability Symposium. 370.

Digital Library

[17]

ConvNet 2014. High-performance C++/CUDA implementation of convolutional neural networks. (2014). Retrieved Oct. 2016 from https://code.google.com/p/cuda-convnet

[18]

Dahl, George E., Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 20, 1 (2012), 30--42.

Digital Library

[19]

Du, Zidong, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News.

Digital Library

[20]

Feng, Shuguang and Gupta, Shantanu and Ansari, Amin and Mahlke, Scott. 2010. Shoestring: probabilistic soft error reliability on the cheap. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 385--396.

Digital Library

[21]

Fernandes, Fernando and Weigel, Lucas and Jung, Claudio and Navaux, Philippe and Carro, Luigi and Rech, Paolo. 2016. Evaluation of histogram of oriented gradients soft errors criticality for automotive applications. ACM Transactions on Architecture and Code Optimization (TACO) 13, 4 (2016), 38.

Digital Library

[22]

Gill, B., N. Seifert, and V. Zia. 2009. Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node. In Proceedings of the International Reliability Physics Symposium (IRPS).

[23]

Gokhale, Vinayak, Jonghoon Jin, Aysegul Dundar, Berin Martini, and Eugenio Culurciello. 2014. A 240 g-ops/s mobile coprocessor for deep neural networks. In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.

Digital Library

[24]

Gupta, Suyog, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. (2015).

[25]

Han, Song, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: efficient inference engine on compressed deep neural network. (2016).

[26]

Hari, Siva Kumar Sastry and Adve, Sarita V and Naeimi, Helia. 2012. Low-cost program-level detectors for reducing silent data corruptions. In Proceedings of the International Conference on Dependable Systems and Networks (DSN).

Digital Library

[27]

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (2015).

[28]

IEC 61508 2016. Functional Safety and IEC 61508. (2016). Retrieved Oct. 2016 from http://www.iec.ch/functionalsafety/

[29]

ImageNet2014. ImageNet. (2014). Retrieved Oct. 2016 from http://image-net.org

[30]

Judd, Patrick, Jorge Albericio, Tayler Hetherington, Tor M. Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks.

[31]

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.

Digital Library

[32]

Laguna, Ignacio, Martin Schulz, David F. Richards, Jon Calhoun, and Luke Olson. 2016. Ipas: Intelligent protection against silent output corruption in scientific applications. In Preceedings of the International Symposium on Code Generation and Optimization (CGO).

Digital Library

[33]

Lane, Nicholas D., and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing?. In In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications.

Digital Library

[34]

LeCun, Yann, Koray Kavukcuoglu, and Clément Farabet. 2010. Convolutional networks and applications in vision. In Proceedings of IEEE International Symposium on Circuits and Systems.

[35]

Li, Guanpeng and Lu, Qining and Pattabiraman, Karthik. 2015. Fine-Grained Characterization of Faults Causing Long Latency Crashes in Programs. In Proceedings of the International Conference on Dependable Systems and Networks (DSN).

Digital Library

[36]

Li, Guanpeng and Pattabiraman, Karthik and Cher, Chen-Yong and Bose, Pradip. 2016. Understanding Error Propagation in GPGPU Applications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC).

Digital Library

[37]

Lin, Min and Chen, Qiang and Yan, Shuicheng. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).

[38]

Lu, Qining and Li, Guanpeng and Pattabiraman, Karthik and Gupta, Meeta S and Rivers, Jude A. 2017. Configurable detection of SDC-causing errors in programs. ACM Transactions on Embedded Computing Systems (TECS) 16, 3 (2017), 88.

Digital Library

[39]

Neale, Adam, and Manoj Sachdev. 2016. Neutron Radiation Induced Soft Error Rates for an Adjacent-ECC Protected SRAM in 28 nm CMOS. (2016).

[40]

Oh, Nahmsuk, Philip P. Shirvani, and Edward J. McCluskey. 2002. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability (2002).

[41]

Park, Seongwook, Kyeongryeol Bong, Dongjoo Shin, Jinmook Lee, Sungpill Choi, and Hoi-Jun Yoo. {n. d.}. 4.6 A1. 93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In International Solid-State Circuits Conference.

[42]

Karthik Pattabiraman, Giancinto Paolo Saggese, Daniel Chen, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2011. Automated derivation of application-specific error detectors using dynamic analysis. IEEE Transactions on Dependable and Secure Computing 8, 5 (2011), 640--655.

Digital Library

[43]

Peemen, Maurice, Arnaud AA Setio, Bart Mesman, and Henk Corporaal. 2013. Memory-centric accelerator design for convolutional neural networks. In IEEE 31st International Conference on Computer Design (ICCD).

[44]

Piuri, Vincenzo. 2001. Analysis of fault tolerance in artificial neural networks. J. Parallel and Distrib. Comput. (2001).

Digital Library

[45]

Reagen, Brandon and Whatmough, Paul and Adolf, Robert and Rama, Saketh and Lee, Hyunkwang and Lee, Sae Kyu and Hernández-Lobato, José Miguel and Wei, Gu-Yeon and Brooks, David. 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA). 267--278.

Digital Library

[46]

Reis, George A., Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. 2005. SWIFT: Software implemented fault tolerance. In Preceedings of the International Symposium on Code Generation and Optimization (CGO).

Digital Library

[47]

Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang and others. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.

Digital Library

[48]

Safety Standard 2016. ISO-26262: Road Vehicles Functional safety. (2016). Retrieved Oct. 2016 from https://en.wikipedia.org/wiki/ISO_26262

[49]

Sankaradas, Murugan, Venkata Jakkula, Srihari Cadambi, Srimat Chakradhar, Igor Durdanovic, Eric Cosatto, and Hans Peter Graf. 2009. A massively parallel coprocessor for convolutional neural networks. In IEEE International Conference on Application-specific Systems, Architectures and Processors.

Digital Library

[50]

Seifert, Norbert, Balkaran Gill, Shah Jahinuzzaman, Joseph Basile, Vinod Ambrose, Quan Shi, Randy Allmon, and Arkady Bramnik. 2012. Soft error susceptibilities of 22 nm tri-gate devices. (2012).

[51]

Danny Shapiro. 2016. Introducing Xavier, the NVIDIA AI Supercomputer for the Future of Autonomous Transportation. (2016). Retrieved Apr. 2017 from https://blogs.nvidia.com/blog/2016/09/28/xavier/

[52]

Simonyan, Karen, and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[53]

Sriram, Vinay, David Cox, Kuen Hung Tsoi, and Wayne Luk. 2010. Towards an embedded biologically-inspired machine vision processor. In In Field-Programmable Technology.

[54]

Sullivan, Michael and Zimmer, Brian and Hari, Siva and Tsai, Timothy and Keckler, Stephen W. 2016. An Analytical Model for Hardened Latch Selection and Exploration. (2016).

[55]

Olivier Temam. 2012. A defect-tolerant accelerator for emerging high-performance applications. In Proceedings of the International Symposium on Computer Architecture (ISCA). 356--367.

Digital Library

[56]

Tiny-CNN {n. d.}. Tiny-CNN Framework. ({n. d.}). Retrieved Oct. 2016 from https://github.com/nyanp/tiny-cnn

[57]

Tithi, Jesmin Jahan, Neal C. Crago, and Joel S. Emer. 2014. Exploiting spatial architectures for edit distance algorithms. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS).

[58]

TPU {n. d.}. Google supercharges machine learning tasks with TPU custom chip. ({n. d.}). Retrieved Oct. 2016 from https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html

[59]

Wei, Jiesheng and Thomas, Anna and Li, Guanpeng and Pattabiraman, Karthik. 2014. Quantifying the accuracy of high-level fault injection techniques for hardware faults. In Proceedings of the International Conference on Dependable Systems and Networks (DSN).

Digital Library

[60]

Yann LeCun. 2000. Deep Learning and the Future of AI. (2000). Retrieved Apr. 2017 from https://indico.cern.ch/event/510372/

[61]

Zhang, Chen, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In In Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.

Digital Library

Cited By

Kord AAboelfetouh AShohieb S(2025)Academic course planning recommendation and students’ performance prediction multi-modal based on educational data mining techniquesJournal of Computing in Higher Education10.1007/s12528-024-09426-0Online publication date: 8-Jan-2025
https://doi.org/10.1007/s12528-024-09426-0
Catalán IFlich JHernández C(2025)Exploiting neural networks bit-level redundancy to mitigate the impact of faults at inferenceThe Journal of Supercomputing10.1007/s11227-024-06693-781:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s11227-024-06693-7
Hu LHan CWang XZhu HOuyang J(2024)Security Enhancement for Deep Reinforcement Learning-Based Strategy in Energy-Efficient Wireless Sensor NetworksSensors10.3390/s2406199324:6(1993)Online publication date: 21-Mar-2024
https://doi.org/10.3390/s24061993
Show More Cited By

Recommendations

Cross-layer approaches for improving the dependability of deep learning systems
SCOPES '20: Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

Deep Neural Networks (DNNs) - the state-of-the-art computational models for many Artificial Intelligence (AI) applications - are inherently compute and resource-intensive and, hence, cannot exploit traditional redundancy-based fault mitigation ...
SERN: Modeling and Analyzing the Soft Error Reliability of Convolutional Neural Networks
GLSVLSI '20: Proceedings of the 2020 on Great Lakes Symposium on VLSI

Convolutional Neural Networks (CNNs) are popular in artificial intelligence areas due to their high accuracy. Meanwhile, as manufacturing process technology scales, the probability of soft errors occurrence in computer systems increases, which causes ...
FPGA Design for PCANet Deep Learning Network
FCCM '15: Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines

In recent years, deep learning has attracted lots of research interests for pattern recognition and artificial intelligence. PCA Network (PCANet) is a simple deep learning network with highly competitive performance for texture classification and object ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2017

801 pages

ISBN:9781450351140

DOI:10.1145/3126908

General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

Colorado, Denver

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

342
Total Citations
View Citations
3,384
Total Downloads

Downloads (Last 12 months)547
Downloads (Last 6 weeks)51

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kord AAboelfetouh AShohieb S(2025)Academic course planning recommendation and students’ performance prediction multi-modal based on educational data mining techniquesJournal of Computing in Higher Education10.1007/s12528-024-09426-0Online publication date: 8-Jan-2025
https://doi.org/10.1007/s12528-024-09426-0
Catalán IFlich JHernández C(2025)Exploiting neural networks bit-level redundancy to mitigate the impact of faults at inferenceThe Journal of Supercomputing10.1007/s11227-024-06693-781:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s11227-024-06693-7
Hu LHan CWang XZhu HOuyang J(2024)Security Enhancement for Deep Reinforcement Learning-Based Strategy in Energy-Efficient Wireless Sensor NetworksSensors10.3390/s2406199324:6(1993)Online publication date: 21-Mar-2024
https://doi.org/10.3390/s24061993
Mahmoud KNicolici N(2024)ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN AcceleratorsElectronics10.3390/electronics1316324313:16(3243)Online publication date: 15-Aug-2024
https://doi.org/10.3390/electronics13163243
Wei XWang CYue HTan JGuan ZJiang NZheng XZhao JQiu M(2024)ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error DetectionACM Transactions on Architecture and Code Optimization10.1145/367490921:3(1-26)Online publication date: 28-Jun-2024
https://dl.acm.org/doi/10.1145/3674909
Weng OMeza ABock QHawks BCampos JTran NDuarte JKastner R(2024)FKeras: A Sensitivity Analysis Tool for Edge Neural NetworksACM Journal on Autonomous Transportation Systems10.1145/36653341:3(1-27)Online publication date: 18-May-2024
https://dl.acm.org/doi/10.1145/3665334
Vasquez CLecompte TYuan XTzeng NPeng L(2024)Soft Error Resilience Analysis of LSTM NetworksProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658776(328-332)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3649476.3658776
Jung JSo HKo WJoshi SKim YKo YShrivastava ALee KDe V(2024)Maintaining Sanity: Algorithm-based Comprehensive Fault Tolerance for CNNsProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657355(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3657355
Guillemé WKritikakou AHelen YKillian CChillet DDe V(2024)HTAG-eNN: Hardening Technique with AND Gates for Embedded Neural NetworksProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657329(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3657329
Perez-Cerrolaza JAbella JBorg MDonzella CCerquides JCazorla FEnglund CTauber MNikolakopoulos GFlores J(2024)Artificial Intelligence for Safety-Critical Systems in Industrial and Transportation Domains: A SurveyACM Computing Surveys10.1145/362631456:7(1-40)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3626314
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents