[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3132402.3132443acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

Odd-ECC: on-demand DRAM error correcting codes

Published: 02 October 2017 Publication History

Abstract

An application may have different sensitivity to faults in different subsets of the data it uses. Some data regions may therefore be more critical than others. Capitalizing on this observation, Odd-ECC provides a mechanism to dynamically select the memory fault tolerance of each allocated page of a program on demand depending on the criticality of the respective data. Odd-ECC error correcting codes (ECCs) are stored in separate physical pages and hidden by the OS as pages unavailable to the user. Still, these ECCs are physically aligned with the data they protect so the memory controller can efficiently access them. Thereby, capacity, performance and energy overheads of memory fault tolerance are proportional to the criticality of the data stored. Odd-ECC is applied to memory systems that use conventional 2D DRAM DIMMs as well as to 3D-stacked DRAMs and evaluated using various applications. Compared to flat memory protection schemes, Odd-ECC substantially reduces ECCs capacity overheads while achieving the same Mean Time to Failure (MTTF) and in addition it slightly improves performance and energy costs. Under the same capacity constraints, Odd-ECC achieves substantially higher MTTF, compared to a flat memory protection. This comes at a performance and energy cost, which is however still a fraction of the cost introduced by a flat equally strong scheme.

References

[1]
Altera Corporation. 2012. Error Correction Code in SoC FPGA-Based Memory Systems. https://www.altera.com/en_US/pdfs/literature/wp/wp-01179-ecc-embedded.pdf. (2012). Online; Accessed: 2017-04-28.
[2]
Sai Ankireddi and Tony Chen. 2008. Challenges in thermal management of memory modules. Electronics Cooling 14, 1 (2008), 24.
[3]
David H Bailey et al. 1991. The NAS parallel benchmarks. The Int. Journal of Supercomputing Applications 5, 3 (1991), 63--73.
[4]
Marc Casas et al. 2012. Fault Resilience of the Algebraic Multi-grid Solver. In ACM Int. Conf. on Supercomputing. 91--100.
[5]
Hsing-Min Chen et al. 2015. E-ECC: Low power erasure and error correction schemes for increasing reliability of commodity dram systems. In Proceedings of the 2015 International Symposium on Memory Systems. ACM, 60--70.
[6]
Hsing-Min Chen et al. 2016. RATT-ECC: Rate Adaptive Two-Tiered Error Correction Codes for Reliable 3D Die-Stacked Memory. ACM TACO 13, 3 (2016).
[7]
Long Chen et al. 2013. E3CC: A memory error protection scheme with novel address mapping for subranked & low-power memories. ACM TACO 10, 4 (2013).
[8]
Bharan Giridhar et al. 2013. Exploring DRAM organizations for energy-efficient and resilient exascale memories. In SC. ACM, 23.
[9]
Seong-Lyong Gong et al. 2015. Clean-ecc: High reliability ecc for adaptive granularity memory system. In Int. Symp. on Microarchitecture. 611--622.
[10]
Hybrid Memory Cube Consortium. 2015. HMC Specification 2.1. http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf. (2015). Online; Accessed: 2017-05-08.
[11]
IBM. 2013. Linux Native Memory issues for WebSphere Application Server. http://www-01.ibm.com/support/docview.wss?uid=swg27039764&aid=1. (2013). Online; Accessed: 2017-05-06.
[12]
Bruce Jacob et al. 2010. Memory systems: cache, DRAM, disk. Morgan Kaufmann.
[13]
Hyeran Jeon et al. 2014. Efficient RAS support for die-stacked DRAM. In Test Conference (ITC), 2014 IEEE Int. IEEE, 1--10.
[14]
Xun Jian et al. 2013. Low-power, low-storage-overhead chipkill correct via multi-line error correction. In SC. ACM, 24.
[15]
Xun Jian and Rakesh Kumar. 2013. Adaptive reliability chipkill correct (arcc). In HPCA2013, 2013 IEEE 19th Int. Symp. on. IEEE, 270--281.
[16]
Xun Jian and Rakesh Kumar. 2014. ECC Parity: A technique for efficient memory error resilience for multi-channel memory systems. In Proceedings of the Int. Conf. for HPC, Netw., Stor. & Analysis. IEEE Press, 1035--1046.
[17]
Xun Jian, Vilas Sridharan, and Rakesh Kumar. 2016. Parity helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems. In HPCA, 2016 IEEE International Symposium on. IEEE, 555--567.
[18]
Mark W Kellogg, Timothy J Dell, Erik L Hedberg, and Claude L Bertin. 1999. Programmable burst length DRAM. (April 20 1999). US Patent 5,896,404.
[19]
Jungrae Kim et al. 2015. Frugal ecc: Efficient and versatile memory error protection through fine-grained compression. In Int.Conf for HPC, Net., Stor. & An.
[20]
Jungrae Kim, Michael Sullivan, and Mattan Erez. 2015. Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory. In HPCA. 101--112.
[21]
Yongjun Lee, Jongwon Kim, et al. 2015. A fully associative, tagless DRAM cache. In ACM SIGARCH Comp. Arch. News, Vol. 43. ACM, 211--222.
[22]
Sheng Li et al. 2011. System implications of memory reliability in exascale computing. In Int. Conf. for HPC, Netw., Stor. & Analysis. 46.
[23]
Sheng Li et al. 2012. MAGE: adaptive granularity and ECC for resilient and power efficient memory systems. In SC. IEEE, 1--11.
[24]
Linux Foundation Events. 2016. Virtual Memory and Linux. http://events.linuxfoundation.org/sites/events/files/slides/elc_2016_mem_0.pdf. (2016). Online.
[25]
Song Liu et al. 2012. Flikker: saving DRAM refresh-power through critical data partitioning. ACM SIGPLAN Notices 47, 4 (2012), 213--224.
[26]
Chi-Keung Luk et al. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices, Vol. 40. ACM, 190--200.
[27]
Yixin Luo et al. 2014. Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory. In DSN, 2014 44th Annual IEEE/IFIP Int. Conf on. IEEE, 467--478.
[28]
Mojtaba Mehrara and Todd Austin. 2008. Exploiting selective placement for low-cost memory protection. ACM TACO 5, 3 (2008), 14.
[29]
Shubhendu S Mukherjee, Joel Emer, and Steven K Reinhardt. 2005. The soft error problem: An architectural perspective. In HPCA-11. 243--247.
[30]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In IEEE/ACM Int. Symp. on Microarchitecture. 3--14.
[31]
Prashant J Nair et al. 2016. Citadel: Efficiently protecting stacked memory from TSV and large granularity failures. ACM TACO 12, 4 (2016), 49.
[32]
David J Palframan et al. 2015. COP: To compress and protect main memory. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 682--693.
[33]
David A Patterson, Garth Gibson, and Randy H Katz. 1988. A case for redundant arrays of inexpensive disks (RAID). Vol. 17. ACM.
[34]
J Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In Hot Chips 23 Symposium (HCS), 2011 IEEE. IEEE, 1--24.
[35]
Moinuddin K Qureshi and Gabe H Loh. 2012. Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design. In IEEE/ACM Int. Symp. on Microarchitecture. 235--246.
[36]
Daniel A. Reed et al. 2015. Exascale Computing and Big Data. ACM (2015).
[37]
Paul Rosenfeld. 2014. Performance exploration of the HMC. Ph.D. Dissertation.
[38]
Paul Rosenfeld et al. 2011. DRAMSim2: A cycle accurate memory system simulator. IEEE Computer Architecture Letters 10, 1 (2011), 16--19.
[39]
Swarup Kumar Sahoo et al. 2008. Using likely program invariants to detect hardware errors. In DSN With FTCS and DCC, 2008. IEEE, 70--79.
[40]
Julian Seward et al. 2008. Valgrind 3.3-advanced debugging and profiling for gnu/linux applications. Network Theory Ltd.
[41]
Jaewoong Sim et al. 2013. Resilient die-stacked DRAM caches. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 416--427.
[42]
Vilas Sridharan et al. 2013. Feng shui of supercomputer memory positional effects in DRAM and SRAM faults. In SC, 2013 Int. Conf for. IEEE, 1--11.
[43]
Vilas Sridharan et al. 2015. Memory errors in modern systems: The good, the bad, and the ugly. In ACM SIGPLAN Notices, Vol. 50. ACM, 297--310.
[44]
Vilas Sridharan and Dean Liberty. 2012. A study of DRAM failures in the field. In HPC, Netw., Stor. & Analysis (SC), 2012 International Conference for. IEEE, 1--11.
[45]
JEDEC Standard. 2013. High bandwidth memory (hbm) dram. JESD235 (2013).
[46]
Georgios Stefanakis. 2015. Characterizing and exploiting application behavior under data corruption. (2015).
[47]
The GNU Project Debugger. 2016. GDB 7.12. https://www.gnu.org/s/gdb/. (2016). Online; Accessed: 2017-05-07.
[48]
The OpenMP Organization. 2015. OpenMP Application Programming Interface Version 4.5. http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf. (2015). Online; Accessed: 2017-05-05.
[49]
Aniruddha N Udipi et al. 2012. LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems. In ACM SIGARCH Computer Architecture.
[50]
Amir Yazdanbakhsh et al. 2017. AxBench: A Multiplatform Benchmark Suite for Approximate Computing. IEEE Design and Test 34, 2 (2017), 60--68.
[51]
Doe Hyun Yoon and Mattan Erez. 2010. Virtualized and flexible ECC for main memory. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 397--408.
[52]
Dingyou Zhang and James J.-Q. Lu. 2017. 3D Integration Technologies: An Overview. Springer International Publishing, Cham, 1--26.
[53]
Hongzhong Zheng et al. 2008. Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. In IEEE/ACM MICRO. 210--221.

Cited By

View all
  • (2023)RefineHD: Accurate and Efficient Single-Pass Adaptive Learning Using Hyperdimensional Computing2023 IEEE International Conference on Rebooting Computing (ICRC)10.1109/ICRC60800.2023.10386671(1-8)Online publication date: 5-Dec-2023
  • (2023)ESD: An ECC-assisted and Selective Deduplication for Encrypted Non-Volatile Main Memory2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071011(977-990)Online publication date: Feb-2023
  • (2022)L2C: Combining Lossy and Lossless Compression on Memory and I/OACM Transactions on Embedded Computing Systems10.1145/348164121:1(1-27)Online publication date: 14-Jan-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
MEMSYS '17: Proceedings of the International Symposium on Memory Systems
October 2017
409 pages
ISBN:9781450353359
DOI:10.1145/3132402
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3D-stacked memory
  2. DRAM
  3. applications reliability analysis
  4. error correcting codes
  5. main memory reliability

Qualifiers

  • Research-article

Funding Sources

Conference

MEMSYS 2017

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)6
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)RefineHD: Accurate and Efficient Single-Pass Adaptive Learning Using Hyperdimensional Computing2023 IEEE International Conference on Rebooting Computing (ICRC)10.1109/ICRC60800.2023.10386671(1-8)Online publication date: 5-Dec-2023
  • (2023)ESD: An ECC-assisted and Selective Deduplication for Encrypted Non-Volatile Main Memory2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071011(977-990)Online publication date: Feb-2023
  • (2022)L2C: Combining Lossy and Lossless Compression on Memory and I/OACM Transactions on Embedded Computing Systems10.1145/348164121:1(1-27)Online publication date: 14-Jan-2022
  • (2021)OnlineHD: Robust, Efficient, and Single-Pass Online Learning Using Hyperdimensional System2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474107(56-61)Online publication date: 1-Feb-2021
  • (2021)CARE: Coordinated Augmentation for Elastic Resilience on DRAM Errors in Data Centers2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00052(533-544)Online publication date: Feb-2021
  • (2020)Runtime-guided ECC protection using online estimation of memory vulnerabilityProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433802(1-14)Online publication date: 9-Nov-2020
  • (2020)MemSZACM Transactions on Architecture and Code Optimization10.1145/342466817:4(1-25)Online publication date: 10-Nov-2020
  • (2020)Runtime-Guided ECC Protection using Online Estimation of Memory VulnerabilitySC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00080(1-14)Online publication date: Nov-2020
  • (2020)A Solution for High Availability Memory AccessAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-38991-8_9(122-137)Online publication date: 22-Jan-2020
  • (2018)Driving into the memory wallProceedings of the International Symposium on Memory Systems10.1145/3240302.3240322(377-386)Online publication date: 1-Oct-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media