[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3126908.3126964acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Understanding error propagation in deep learning neural network (DNN) accelerators and applications

Published: 12 November 2017 Publication History

Abstract

Deep learning neural networks (DNNs) have been successful in solving a wide range of machine learning problems. Specialized hardware accelerators have been proposed to accelerate the execution of DNN algorithms for high-performance and energy efficiency. Recently, they have been deployed in datacenters (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. Soft errors caused by high-energy particles have been increasing in hardware systems, and these can lead to catastrophic failures in DNN systems. Traditional methods for building resilient systems, e.g., Triple Modular Redundancy (TMR), are agnostic of the DNN algorithm and the DNN accelerator's architecture. Hence, these traditional resilience approaches incur high overheads, which makes them challenging to deploy. In this paper, we experimentally evaluate the resilience characteristics of DNN systems (i.e., DNN software running on specialized accelerators). We find that the error resilience of a DNN system depends on the data types, values, data reuses, and types of layers in the design. Based on our observations, we propose two efficient protection techniques for DNN systems.

References

[1]
Alippi, Cesare, Vincenzo Piuri, and Mariagiovanna Sami. 1995. Sensitivity to errors in artificial neural networks: A behavioral approach. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications (1995).
[2]
Autonomous car facts 2016. Keynote: Autonomous Car A New Driver for Resilient Computing and Design-for-Test. (2016). Retrieved Oct. 2016 from https://nepp.nasa.gov/workshops/etw2016/talks/15WED/20160615-0930-Autonomous_Saxena-Nirmal-Saxena-Rec2016Jun16-nasaNEPP.pdf
[3]
Bettola, Simone, and Vincenzo Piuri. 1998. High performance fault-tolerant digital neural networks. IEEE transactions on computers (1998).
[4]
Bojarski, Mariusz, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel and others. 2016. End to End Learning for Self-Driving Cars. arXiv preprint arXiv: 1604.07316 (2016).
[5]
Borkar, Shekhar. 2005. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. In Proceedings of the International Symposium on Microarchitecture (MICRO), Vol. 25. 10--16.
[6]
BVCL 2014. BERKELEY VISION AND LEARNING CENTER. (2014). Retrieved Oct. 2016 from http://bvlc.eecs.berkeley.edu
[7]
Caffe Model 2014. Caffe Model Zoo. (2014). Retrieved Oct. 2016 from http://caffe.berkeleyvision.org/model_zoo.html
[8]
CaffeNet 2014. CaffeNet. (2014). Retrieved Oct. 2016 from http://caffe.berkeleyvision.org/model_zoo.html
[9]
Cavigelli, Lukas, David Gschwend, Christoph Mayer, Samuel Willi, Beat Muheim, and Luca Benini. 2015. Origami: A convolutional network accelerator. In In Proceedings of the 25th edition on Great Lakes Symposium on VLSI.
[10]
Chakradhar, Srimat, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In ACM SIGARCH Computer Architecture News.
[11]
Chatterjee, Avhishek, and Lav R. Varshney. 2017. Energy-reliability limits in nanoscale neural networks. In The 51st Annual Conference on Information Sciences and Systems (CISS). 1--6.
[12]
Chen, Tianshi, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM Sigplan Notices.
[13]
Chen, Yu-Hsin, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the International Symposium on Computer Architecture (ISCA). 367--379.
[14]
Chen, Yunji, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li and others. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the International Symposium on Microarchitecture (MICRO).
[15]
CIFAR dataset 2014. CIFAR-10. (2014). Retrieved Oct. 2016 from https://www.cs.toronto.edu/~kriz/cifar.html
[16]
Constantinescu, Cristian. 2008. Intermittent faults and effects on reliability of integrated circuits. In Proceedings of the Reliability and Maintainability Symposium. 370.
[17]
ConvNet 2014. High-performance C++/CUDA implementation of convolutional neural networks. (2014). Retrieved Oct. 2016 from https://code.google.com/p/cuda-convnet
[18]
Dahl, George E., Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 20, 1 (2012), 30--42.
[19]
Du, Zidong, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News.
[20]
Feng, Shuguang and Gupta, Shantanu and Ansari, Amin and Mahlke, Scott. 2010. Shoestring: probabilistic soft error reliability on the cheap. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 385--396.
[21]
Fernandes, Fernando and Weigel, Lucas and Jung, Claudio and Navaux, Philippe and Carro, Luigi and Rech, Paolo. 2016. Evaluation of histogram of oriented gradients soft errors criticality for automotive applications. ACM Transactions on Architecture and Code Optimization (TACO) 13, 4 (2016), 38.
[22]
Gill, B., N. Seifert, and V. Zia. 2009. Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node. In Proceedings of the International Reliability Physics Symposium (IRPS).
[23]
Gokhale, Vinayak, Jonghoon Jin, Aysegul Dundar, Berin Martini, and Eugenio Culurciello. 2014. A 240 g-ops/s mobile coprocessor for deep neural networks. In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.
[24]
Gupta, Suyog, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. (2015).
[25]
Han, Song, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: efficient inference engine on compressed deep neural network. (2016).
[26]
Hari, Siva Kumar Sastry and Adve, Sarita V and Naeimi, Helia. 2012. Low-cost program-level detectors for reducing silent data corruptions. In Proceedings of the International Conference on Dependable Systems and Networks (DSN).
[27]
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (2015).
[28]
IEC 61508 2016. Functional Safety and IEC 61508. (2016). Retrieved Oct. 2016 from http://www.iec.ch/functionalsafety/
[29]
ImageNet2014. ImageNet. (2014). Retrieved Oct. 2016 from http://image-net.org
[30]
Judd, Patrick, Jorge Albericio, Tayler Hetherington, Tor M. Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks.
[31]
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.
[32]
Laguna, Ignacio, Martin Schulz, David F. Richards, Jon Calhoun, and Luke Olson. 2016. Ipas: Intelligent protection against silent output corruption in scientific applications. In Preceedings of the International Symposium on Code Generation and Optimization (CGO).
[33]
Lane, Nicholas D., and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing?. In In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications.
[34]
LeCun, Yann, Koray Kavukcuoglu, and Clément Farabet. 2010. Convolutional networks and applications in vision. In Proceedings of IEEE International Symposium on Circuits and Systems.
[35]
Li, Guanpeng and Lu, Qining and Pattabiraman, Karthik. 2015. Fine-Grained Characterization of Faults Causing Long Latency Crashes in Programs. In Proceedings of the International Conference on Dependable Systems and Networks (DSN).
[36]
Li, Guanpeng and Pattabiraman, Karthik and Cher, Chen-Yong and Bose, Pradip. 2016. Understanding Error Propagation in GPGPU Applications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC).
[37]
Lin, Min and Chen, Qiang and Yan, Shuicheng. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).
[38]
Lu, Qining and Li, Guanpeng and Pattabiraman, Karthik and Gupta, Meeta S and Rivers, Jude A. 2017. Configurable detection of SDC-causing errors in programs. ACM Transactions on Embedded Computing Systems (TECS) 16, 3 (2017), 88.
[39]
Neale, Adam, and Manoj Sachdev. 2016. Neutron Radiation Induced Soft Error Rates for an Adjacent-ECC Protected SRAM in 28 nm CMOS. (2016).
[40]
Oh, Nahmsuk, Philip P. Shirvani, and Edward J. McCluskey. 2002. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability (2002).
[41]
Park, Seongwook, Kyeongryeol Bong, Dongjoo Shin, Jinmook Lee, Sungpill Choi, and Hoi-Jun Yoo. {n. d.}. 4.6 A1. 93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In International Solid-State Circuits Conference.
[42]
Karthik Pattabiraman, Giancinto Paolo Saggese, Daniel Chen, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2011. Automated derivation of application-specific error detectors using dynamic analysis. IEEE Transactions on Dependable and Secure Computing 8, 5 (2011), 640--655.
[43]
Peemen, Maurice, Arnaud AA Setio, Bart Mesman, and Henk Corporaal. 2013. Memory-centric accelerator design for convolutional neural networks. In IEEE 31st International Conference on Computer Design (ICCD).
[44]
Piuri, Vincenzo. 2001. Analysis of fault tolerance in artificial neural networks. J. Parallel and Distrib. Comput. (2001).
[45]
Reagen, Brandon and Whatmough, Paul and Adolf, Robert and Rama, Saketh and Lee, Hyunkwang and Lee, Sae Kyu and Hernández-Lobato, José Miguel and Wei, Gu-Yeon and Brooks, David. 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA). 267--278.
[46]
Reis, George A., Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. 2005. SWIFT: Software implemented fault tolerance. In Preceedings of the International Symposium on Code Generation and Optimization (CGO).
[47]
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang and others. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.
[48]
Safety Standard 2016. ISO-26262: Road Vehicles Functional safety. (2016). Retrieved Oct. 2016 from https://en.wikipedia.org/wiki/ISO_26262
[49]
Sankaradas, Murugan, Venkata Jakkula, Srihari Cadambi, Srimat Chakradhar, Igor Durdanovic, Eric Cosatto, and Hans Peter Graf. 2009. A massively parallel coprocessor for convolutional neural networks. In IEEE International Conference on Application-specific Systems, Architectures and Processors.
[50]
Seifert, Norbert, Balkaran Gill, Shah Jahinuzzaman, Joseph Basile, Vinod Ambrose, Quan Shi, Randy Allmon, and Arkady Bramnik. 2012. Soft error susceptibilities of 22 nm tri-gate devices. (2012).
[51]
Danny Shapiro. 2016. Introducing Xavier, the NVIDIA AI Supercomputer for the Future of Autonomous Transportation. (2016). Retrieved Apr. 2017 from https://blogs.nvidia.com/blog/2016/09/28/xavier/
[52]
Simonyan, Karen, and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[53]
Sriram, Vinay, David Cox, Kuen Hung Tsoi, and Wayne Luk. 2010. Towards an embedded biologically-inspired machine vision processor. In In Field-Programmable Technology.
[54]
Sullivan, Michael and Zimmer, Brian and Hari, Siva and Tsai, Timothy and Keckler, Stephen W. 2016. An Analytical Model for Hardened Latch Selection and Exploration. (2016).
[55]
Olivier Temam. 2012. A defect-tolerant accelerator for emerging high-performance applications. In Proceedings of the International Symposium on Computer Architecture (ISCA). 356--367.
[56]
Tiny-CNN {n. d.}. Tiny-CNN Framework. ({n. d.}). Retrieved Oct. 2016 from https://github.com/nyanp/tiny-cnn
[57]
Tithi, Jesmin Jahan, Neal C. Crago, and Joel S. Emer. 2014. Exploiting spatial architectures for edit distance algorithms. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS).
[58]
TPU {n. d.}. Google supercharges machine learning tasks with TPU custom chip. ({n. d.}). Retrieved Oct. 2016 from https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html
[59]
Wei, Jiesheng and Thomas, Anna and Li, Guanpeng and Pattabiraman, Karthik. 2014. Quantifying the accuracy of high-level fault injection techniques for hardware faults. In Proceedings of the International Conference on Dependable Systems and Networks (DSN).
[60]
Yann LeCun. 2000. Deep Learning and the Future of AI. (2000). Retrieved Apr. 2017 from https://indico.cern.ch/event/510372/
[61]
Zhang, Chen, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In In Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.

Cited By

View all
  • (2025)Academic course planning recommendation and students’ performance prediction multi-modal based on educational data mining techniquesJournal of Computing in Higher Education10.1007/s12528-024-09426-0Online publication date: 8-Jan-2025
  • (2025)Exploiting neural networks bit-level redundancy to mitigate the impact of faults at inferenceThe Journal of Supercomputing10.1007/s11227-024-06693-781:1Online publication date: 1-Jan-2025
  • (2024)Security Enhancement for Deep Reinforcement Learning-Based Strategy in Energy-Efficient Wireless Sensor NetworksSensors10.3390/s2406199324:6(1993)Online publication date: 21-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
  • General Chair:
  • Bernd Mohr,
  • Program Chair:
  • Padma Raghavan
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. reliability
  3. silent data corruption
  4. soft error

Qualifiers

  • Research-article

Conference

SC '17
Sponsor:

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)547
  • Downloads (Last 6 weeks)51
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Academic course planning recommendation and students’ performance prediction multi-modal based on educational data mining techniquesJournal of Computing in Higher Education10.1007/s12528-024-09426-0Online publication date: 8-Jan-2025
  • (2025)Exploiting neural networks bit-level redundancy to mitigate the impact of faults at inferenceThe Journal of Supercomputing10.1007/s11227-024-06693-781:1Online publication date: 1-Jan-2025
  • (2024)Security Enhancement for Deep Reinforcement Learning-Based Strategy in Energy-Efficient Wireless Sensor NetworksSensors10.3390/s2406199324:6(1993)Online publication date: 21-Mar-2024
  • (2024)ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN AcceleratorsElectronics10.3390/electronics1316324313:16(3243)Online publication date: 15-Aug-2024
  • (2024)ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error DetectionACM Transactions on Architecture and Code Optimization10.1145/367490921:3(1-26)Online publication date: 28-Jun-2024
  • (2024)FKeras: A Sensitivity Analysis Tool for Edge Neural NetworksACM Journal on Autonomous Transportation Systems10.1145/36653341:3(1-27)Online publication date: 18-May-2024
  • (2024)Soft Error Resilience Analysis of LSTM NetworksProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658776(328-332)Online publication date: 12-Jun-2024
  • (2024)Maintaining Sanity: Algorithm-based Comprehensive Fault Tolerance for CNNsProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657355(1-6)Online publication date: 23-Jun-2024
  • (2024)HTAG-eNN: Hardening Technique with AND Gates for Embedded Neural NetworksProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657329(1-6)Online publication date: 23-Jun-2024
  • (2024)Artificial Intelligence for Safety-Critical Systems in Industrial and Transportation Domains: A SurveyACM Computing Surveys10.1145/362631456:7(1-40)Online publication date: 9-Apr-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media