[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

OCEAN: An Optimized HW/SW Reliability Mitigation Approach for Scratchpad Memories in Real-Time SoCs

Published: 01 April 2014 Publication History

Abstract

Recent process technology advances trigger reliability issues that degrade the Quality-of-Service (QoS) required by embedded Systems-on-Chip (SoCs). To maintain the required QoS with acceptable overheads, we propose OCEAN, a novel cross-layer error mitigation. OCEAN enforces on-chip SRAMs reliability with a fault-tolerant buffer. We utilize this buffer to protect a portion of the processed data used to restore from runtime error. We optimally select the buffer size to minimize the energy overhead, with timing and area constraints. OCEAN achieves full error mitigation with 10.1% average energy overhead compared to base-line operation that does not include any error correction capability, and 65% energy savings, compared to a cross-layer error mitigation mechanism.

References

[1]
F. Abate, L. Sterpone, and M. Violante. 2008. A new mitigation approach for sof errors in embedded processors. IEEE Trans. Nuclear Sci. 55, 4.
[2]
M. Agostinelli, J. Hicks, J. Xu, B. Woolery, K. Mistry, et al. 2005a. Erratic fluctuations of sram cache vmin at the 90nm process technology node. In Proceedings of the IEEE International Electron Devices Meeting Technical Digest (IEDM'05). 655--658.
[3]
M. Agostinelli, S. Pae, W. Yang, C. Prasad, D. Kencke, et al. 2005b. Random charge effects for pmos nbti in ultra-small gate area devices. In Proceedings of the 43rd Annual IEEE International Reliability Physics Symposium (IRPS'05). 529--532.
[4]
AMD. 2001. AMD eighth-generation processor architecture. AMD white paper. http://intel80386.com/amd/k8_architecture.pdf.
[5]
K. Amini and M. Peyghami. 2006. Complexity analysis of interior-point methods for linear optimization based on some conditions on kernel function. Elsevier Appl. Math. Comput. 176, 1.
[6]
R. Baumann. 2002. The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction. In Proceedings of the International Electron Devices Meeting (IEDM'02). 329--332.
[7]
M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. 2006. Nonlinear Programming: Theory and Algorithms. John Wiley and Sons.
[8]
L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri. 2005. MPARM: Exploring the multi-processor soc design space with systemc. J. VLSI Signal Process. Syst. 41, 2.
[9]
K. Bhattacharya, N. Ranganathan, and S. Kim. 2009. A framework for correction of multi-bit soft errors in l2 caches based on redundancy. IEEE Trans. VLSI Syst. 17, 2.
[10]
CACTI. 2008. CACTI, an integrated cache access time, cycle time, area, leakage, and dynamic power model for uniform and non-uniform cache architectures. www.cs.utah.edu/∼rajeev/cacti6/.
[11]
E. de Kock. 2002. Multiprocessor mapping of process networks: A jpeg decoding case study. In Proceedings of the 15th International Symposium on System Synthesis (ISSS'02). 68--73.
[12]
A. Ejlali, B. M. Al-Hashimi, and P. Eles. 2009. A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems. In Proceedings of the 7th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'09). 193--202.
[13]
P. Eles, V. Izosimov, P. Pop, and Z Peng. 2008. Synthesis of fault-tolerant embedded systems. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'08). 1117--1122.
[14]
Fmincon. 2013. Fmincon: Minimization of non-linear multivariate functions. www.mathworks.com/help/toolbox/optim/ug/fmincon.html.
[15]
D. Gizopoulos, M. Psarakis, S. V. Adve, P. Ramachandran, and S. K. S. Hari. 2011. Architectures for online error detection and recovery in multicore processors. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'11).
[16]
M. S. Gupta, K. K. Rangan, M. D. Smith, G. Y. Wei, and D. Brooks. 2008a. DeCoR: A delayed commit and rollback mechanism for handling inductive noise in processors. In Proceedings of the 14th IEEE International Symposium on High Performance Computer Architecture (HPCA'08). 381--392.
[17]
S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. 2008b. StageNetSlice: A reconfigurable microarchitecture building block for resilient cmp systems. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES'08). 1--10.
[18]
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the IEEE International Workshop on Workload Characterization (WWC'01). 3--14.
[19]
J. Henkel, L. Bauer, J. Becker, O. Bringmann, U. Brinkschulte, et al. 2011. Design and architectures for dependable embedded systems. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'11). 69--78.
[20]
L. Huang, F. Yuan, and Q. Xu. 2009. Lifetime reliability-aware task allocation and scheduling for mpsoc platforms. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'09).
[21]
R. Hyman, K. Bhattacharya, and N. Ranganathan. 2009. A strategy for soft error reduction in multi core designs. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'09). 2217--2220.
[22]
E. Ibe, H. Taniguchi, Y. Yahagi, K. Shimbo, and T. Toba. 2010. Impact of scaling on neutron-induced soft error in srams from a 250 nm to a 22 nm design rule. IEEE Trans. Electron Devices 57, 7, 1527--1538.
[23]
G. Karakonstantis, C. Roth, C. Benkeser, and A. Burg. 2012. On the exploitation of the inherent error resilience of wireless systems under unreliable silicon. In Proceedings of the 49th Annual Design Automation Conference (DAC'12). 510--515.
[24]
J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. Hoe. 2008. Multi-bit error tolerant caches using two-dimensional error coding. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'08). 197--209.
[25]
P. Kongetira, K. Aingaran, and K. Olukotun. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25.
[26]
S. Krishnamohan and N. R. Mahapatra. 2005. Combining error masking and error detection plus recovery to combat soft errors in static cmos circuits. In Proceedings of the International Conference on Dependable Systems and Networks (DSN'05). 40--49.
[27]
A. Kumar, H. Corporaal, B. Mesman, and Y. Ha. 2011. Multimedia Multiprocessor Systems Analysis, Design and Management. SpringerLink.
[28]
C. Lee, M. Potkonjak, and W. H. Mangione-Smith. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO'97). 330--335.
[29]
K. Lee, A. Shrivastava, M. Kim, N. Dutt, and N. Venkatasubramanian. 2008. Mitigating the impact of hardware defect on multimedia application - A cross-layer approach. In Proceedings of the 16th ACM International Conference on Multimedia (MM'08). 319--328.
[30]
L. Leem, H. Cho, J. Bau, Q. A. Jacobson, and S. Mitra. 2010. ERSA: Error resilient system architecture for probabilistic applications. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'10).
[31]
T. Li, R. Ragel, and S. Parameswaran. 2012. Reli: Hardware/software checkpoint and recovery scheme for embedded processors. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'12).
[32]
M. Lukasiewycz, M. Glass, and J. Teich. 2009. Exploiting data-redundancy in reliability-aware networked embedded system design. In Proceedings of the 7th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'09). 229--238.
[33]
M. Manoochehri, M. Annavaram, and M. Dubois. 2011. Cppc: Correctable parity protected cache. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA'11). 223--234.
[34]
M. May, M. Alles, and N. Wehn. 2008. A case study in reliability-aware design: A resilient ldpc code decoder. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'08). 456--461.
[35]
J. Mitchell, D. Henderson, and G. Ahrens. 2005. IBM power 5 processor-based servers: A highly available design for buisness-critical applications. IBM white paper. http://www-07.ibm.com/systems/includes/pdf/power5_ras.pdf.
[36]
S. Mitra. 2008. Globally optimized robust systems to overcome scaled cmos reliability challenges. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'08).
[37]
S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim. 2005. Robust system design with built-in soft-error resilience. Comput. 38, 2.
[38]
R. H. Morelos-Zaragoza. 2002. The Art of Error Correcting Coding. Wiley.
[39]
S. S. Mukherjee, J. Emer, and S. K. Reinhardt. 2005. The soft error problem: An architectural perspective. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA'05). 243--247.
[40]
M. Nicolaidis. 2005. Design for soft error mitigation. IEEE Trans. Device Mater. Reliab. 5, 3.
[41]
A. K. Nieuwland, S. Jasarevic, and G. Jerin. 2006. Combinational logic soft error analysis and protection. In Proceedings of the 12th IEEE International Symposium on On-Line Testing (IOLTS'06). 99--104.
[42]
NXP. 2014. NXP arm-based microntrollers. www.nxp.com/documents/data sheet/LH7A400_N.pdf.
[43]
S. Paul, F. Cai, X. Zhang, and S. Bhunia. 2011. Reliability-driven ecc allocation for multiple bit error resilience in processor cache. IEEE Trans. Comput. 60, 1, 20--34.
[44]
P. Pop, V. Izosimov, P. Eles, and Z. Peng. 2009. Design optimization of time- and cost-constrained fault-tolerant embedded systems with check pointing and replication. IEEE Trans. VLSI Syst. 17, 3, 389--402.
[45]
D. K. Pradhan. 1996. Fault-Tolerant Computer System Design. Prentice Hall.
[46]
M. Prvulovic, Z. Zhang, and J. Torrellas. 2002. Revive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA'02). 111--122.
[47]
S.-S. Pyo, C.-H. Lee, G.-H. Kim, K.-M. Choi, Y.-H. Jun, and B.-S. Kong. 2009. 45nm low-power embedded pseudo-sram with ecc-based auto-adjusted self-refresh scheme. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'09). 2517--2520.
[48]
D. Rossi, N. Timincini, M. Spica, and C. Metra. 2011. Error correcting code analysis for cache memory high reliability and performance. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'11).
[49]
M. M. Sabry, D. Atienza, and F. Catthoor. 2012. A hybrid hw-sw approach for intermittent error mitigation in streaming-based embedded systems. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'12).
[50]
R. A. Shafik, B. M. Al-Hashimi, and K. Chakrabarty. 2010. Soft error-aware design optimization of low power and time-constrained embedded systems. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'10).
[51]
D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems: Design and Evaluation. A. K. Peters.
[52]
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. Safetynet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA'02). 123--134.
[53]
H. Sun, N. Zheng, and T. Zhang. 2009. Leveraging access locality for the efficient use of multibit error-correcting codes in l2 cache. IEEE Trans. Comput. 58, 1297--1306.
[54]
TIC60. 2014. Texas instruments c60 member data sheet. http://focus.ti.com/lit/ds/symlink/tms320c6424.pdf.
[55]
M. Vayrynen, V. Singh, and E. Larsson. 2009. Fault-tolerant average execution time optimization for general-purpose multi-processor system-on-chips. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'09).
[56]
X. Vera, J. Abella, J. Carretero, and A. Gonzalez. 2009. Selective replication: A lighweight technique for soft errors. ACM Trans. Comput. Syst. 27, 4.
[57]
D. Zhu and H. Aydin. 2009. Reliability-aware energy management for periodic real-time tasks. IEEE Trans. Comput. 58, 10.

Cited By

View all
  • (2023)Learning-Oriented Reliability Improvement of Computing Systems From Transistor to Application Level2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137182(1-10)Online publication date: Apr-2023
  • (2020)Vulnerability-aware Dynamic Reconfiguration of Partially Protected Caches2020 21st International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED48828.2020.9137050(255-260)Online publication date: Mar-2020
  • (2018)Reliable power and time-constraints-aware predictive management of heterogeneous exascale systemsProceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation10.1145/3229631.3239368(187-194)Online publication date: 15-Jul-2018
  • Show More Cited By

Index Terms

  1. OCEAN: An Optimized HW/SW Reliability Mitigation Approach for Scratchpad Memories in Real-Time SoCs

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Embedded Computing Systems
      ACM Transactions on Embedded Computing Systems  Volume 13, Issue 4s
      Special Issue on Real-Time and Embedded Technology and Applications, Domain-Specific Multicore Computing, Cross-Layer Dependable Embedded Systems, and Application of Concurrency to System Design (ACSD'13)
      July 2014
      571 pages
      ISSN:1539-9087
      EISSN:1558-3465
      DOI:10.1145/2601432
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Journal Family

      Publication History

      Published: 01 April 2014
      Accepted: 01 November 2013
      Revised: 01 May 2013
      Received: 01 August 2012
      Published in TECS Volume 13, Issue 4s

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Error correction
      2. embedded systems
      3. hybrid mitigation

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • joint research grant for ESL-EPFL by IMEC
      • Seventh Framework Programme
      • Nano-Tera.ch with Swiss Confederation financing
      • BodyPoweredSenSE RTD project (no. 20NA21_143069) evaluated by the Swiss NSF

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 06 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Learning-Oriented Reliability Improvement of Computing Systems From Transistor to Application Level2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137182(1-10)Online publication date: Apr-2023
      • (2020)Vulnerability-aware Dynamic Reconfiguration of Partially Protected Caches2020 21st International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED48828.2020.9137050(255-260)Online publication date: Mar-2020
      • (2018)Reliable power and time-constraints-aware predictive management of heterogeneous exascale systemsProceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation10.1145/3229631.3239368(187-194)Online publication date: 15-Jul-2018
      • (2018)Cost-aware optimal data allocations for multiple dimensional heterogeneous memories using dynamic programming in big dataJournal of Computational Science10.1016/j.jocs.2016.06.00226(402-408)Online publication date: May-2018
      • (2017)Low-Cost Memory Fault Tolerance for IoT DevicesACM Transactions on Embedded Computing Systems10.1145/312653416:5s(1-25)Online publication date: 27-Sep-2017
      • (2017)Fine-Grained Checkpoint Recovery for Application-Specific Instruction-Set ProcessorsIEEE Transactions on Computers10.1109/TC.2016.260637866:4(647-660)Online publication date: 1-Apr-2017
      • (2017)Will Chips of the Future Learn How to Feel Pain and Cure Themselves?IEEE Design & Test10.1109/MDAT.2017.273084134:5(80-87)Online publication date: Oct-2017
      • (2016)Memories for NTCNear Threshold Computing10.1007/978-3-319-23389-5_5(75-100)Online publication date: 2016
      • (2015)In-Scratchpad Memory ReplicationACM Transactions on Design Automation of Electronic Systems10.1145/277087420:4(1-28)Online publication date: 28-Sep-2015
      • (2015)Heterogeneous Error-Resilient Scheme for Spectral Analysis in Ultra-Low Power Wearable Electrocardiogram Devices2015 IEEE Computer Society Annual Symposium on VLSI10.1109/ISVLSI.2015.46(268-273)Online publication date: Jul-2015
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media