[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3358807.3358891guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Who's afraid of uncorrectable bit errors? online recovery of flash errors with distributed redundancy

Published: 10 July 2019 Publication History

Abstract

Due to its high performance and decreasing cost per bit, flash storage is the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exhibit progressively shorter lifetimes before they experience uncorrectable bit errors. In this paper, we present an approach for addressing the flash lifetime problem by allowing devices to operate at much higher bit error rates. We present DIRECT, a set of techniques that harnesses distributed-level redundancy to enable the adoption of new generations of denser and less reliable flash storage technologies. DIRECT does so by using an end-to-end approach to increase the reliability of distributed storage systems.
We implemented DIRECT on two real-world storage systems: ZippyDB, a distributed key-value store in production at Facebook that is backed by and supports transactions on top of RocksDB, and HDFS, a distributed file system. When tested on production traces at Facebook, DIRECT reduces application-visible error rates in ZippyDB by more than 100× and recovery time by more than 10,000×. DIRECT also allows HDFS to tolerate a 10,000-100,000× higher bit error rate without experiencing application-visible errors. By significantly increasing the availability of distributed storage systems in the face of bit errors, DIRECT helps extend flash lifetimes.

References

[1]
CockroachDB docs. https://www.cockroachlabs.com/docs/stable/.
[2]
DRAM prices continue to climb. https://epsnews.com/2017/08/18/dram-prices-continue-climb/.
[3]
Ext4 metadata checksums. https://blogs.msdn.microsoft.com/b8/2012/01/16/building-the-next-generation-file-system-for-windows-refs/.
[4]
High-efficiency SSD for reliable data storage systems. https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2011/20110810_T2A_Yang.pdf.
[5]
Introducing Lightning: A flexible NVMe JBOF. https://code.facebook.com/posts/989638804458007/introducing-lightning-a-flexible-nvme-jbof/.
[6]
LevelDB. http://leveldb.org.
[7]
McDipper: A key-value cache for flash storage. https://www.facebook.com/notes/facebook-engineering/mcdipper-a-key-value-cache-for-flash-storage/10151347090423920/.
[8]
Micron 5210 ION SSD. https://www.micron.com/solutions/technical-briefs/micron-5210-ion-ssd.
[9]
MongoDB docs. https://docs.mongodb.com/.
[10]
Novel 4k error correcting code for QLC NAND. https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/20170809_FE22_Kuo.pdf.
[11]
Project Voldemort: A distributed key-value storage system. http://www.project-voldemort.com/voldemort.
[12]
RocksDB. http://rocksdb.org.
[13]
RocksDB block based table format. https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format.
[14]
SanDisk datasheet. https://www.sandisk.com/business/datacenter/resources/data-sheets/fusion-iomemory-sx350_datasheet.
[15]
Under the hood: Building and open-sourcing RocksDB. http://www.facebook.com/notes/facebook-engineering/under-the-hood-building-and-open-sourcing-rocksdb/10151822347683920.
[16]
What is R.A.I.S.E? https://www.kingston.com/us/ssd/raise.
[17]
XFS reliable detection and repair of metadata corruption. http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption.
[18]
The Z File System (ZFS). https://www.freebsd.org/doc/handbook/zfs.html.
[19]
NAND flash media management through RAIN. Technical report, Micron, 2013.
[20]
R. Alagappan, A. Ganesan, E. Lee, A. Albarghouthi, V. Chidambaram, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Protocol-aware recovery for consensus-based storage. In 16th USENIX Conference on File and Storage Technologies (FAST 18), pages 15-32, Oakland, CA, 2018. USENIX Association.
[21]
C. Albrecht, A. Merchant, M. Stokely, M. Waliji, F. Labelle, N. Coehlo, X. Shi, and E. Schrock. Janus: Optimal flash provisioning for cloud storage workloads. In USENIX Annual Technical Conference, pages 91-102, 2013.
[22]
P. Alcorn. Facebook asks for QLC NAND, Toshiba answers with 100TB QLC SSDs with TSV. http://www.tomshardware.com/news/qlc-nand-ssd-toshiba-facebook,32451.html.
[23]
D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP '09, pages 1-14, New York, NY, USA, 2009. ACM.
[24]
L. N. Bairavasundaram, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, G. R. Goodson, and B. Schroeder. An analysis of data corruption in the storage stack. ACM Transactions on Storage (TOS), 4(3):8, 2008.
[25]
M. Bjørling, J. González, and P. Bonnet. Lightnvm: The linux openchannel {SSD} subsystem. In 15th {USENIX} Conference on File and Storage Technologies ({FAST} 17), pages 359-374, 2017.
[26]
N. Borisov, S. Babu, N. Mandagere, and S. Uttamchandani. Dealing proactively with data corruption: Challenges and opportunities. In Data Engineering Workshops (ICDEW), 2011 IEEE 27th International Conference on, pages 34-39. IEEE, 2011.
[27]
D. Borthakur. HDFS block replica placement in your hands now! http://hadoopblog.blogspot.com/2009/09/hdfs-block-replica-placement-in-your.html.
[28]
E. Brewer, L. Ying, L. Greenfield, R. Cypher, and T. T'so. Disks for data centers. Technical report, Google, 2016.
[29]
Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai. Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis. In Proceedings of the Conference on Design, Automation and Test in Europe, pages 521-526, Dresden, Germany, 2012.
[30]
Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, and O. Mutlu. Data retention in MLC NAND flash memory: Characterization, optimization, and recovery. In Proceedings of the 21st International Symposium on High Performance Computer Architecture, pages 551-563, San Francisco, CA, 2015.
[31]
B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, M. F. u. Haq, M. I. u. Haq, D. Bhardwaj, S. Dayanand, A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, and L. Rigas. Windows Azure Storage: A highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 143-157, New York, NY, USA, 2011. ACM.
[32]
A. Cidon, S. Rumble, R. Stutsman, S. Katti, J. Ousterhout, and M. Rosenblum. Copysets: Reducing the frequency of data loss in cloud storage. In Presented as part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), pages 37-48, San Jose, CA, 2013.
[33]
M. Correia, D. G. Ferro, F. P. Junqueira, and M. Serafini. Practical hardening of crash-tolerant systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC'12, pages 41-41, Berkeley, CA, USA, 2012. USENIX Association.
[34]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. SIGOPS Operating Systems Review, 41(6):205-220, Oct. 2
[35]
T. Do, T. Harter, Y. Liu, H. S. Gunawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. HARDFS: Hardening HDFS with selective and lightweight versioning. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST 13), pages 105-118, San Jose, CA, 2013. USENIX.
[36]
A. Eisenman, A. Cidon, E. Pergament, O. Haimovich, R. Stutsman, M. Alizadeh, and S. Katti. Flashield: a key-value cache that minimizes writes to flash. CoRR, abs/1702.02588, 2017.
[37]
A. Eisenman, D. Gardner, I. AbdelRahman, J. Axboe, S. Dong, K. M. Hazelwood, C. Petersen, A. Cidon, and S. Katti. Reducing DRAM footprint with NVM in Facebook. In Proceedings of the Thirteenth EuroSys Conference, EuroSys 2018, Porto, Portugal, April 23-26, 2018, pages 42:1-42:13, 2018.
[38]
D. Exchange. DRAM supply to remain tight with its annual bit growth for 2018 forecast at just 19.6%. www.dramexchange.com.
[39]
A. Ganesan, R. Alagappan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In 15th USENIX Conference on File and Storage Technologies, pages 149-166, Santa Clara, CA, 2017. USENIX Association.
[40]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP '03, pages 29-43, New York, NY, USA, 2003. ACM.
[41]
P. Gill, N. Jain, and N. Nagappan. Understanding network failures in data centers: measurement, analysis, and implications. ACM SIGCOMM Computer Communication Review, 41(4):350-361, 2011.
[42]
L. M. Grupp, J. D. Davis, and S. Swanson. The bleak future of NAND flash memory. In Proceedings of the 10th USENIX Conference on File and Storage Technologies, pages 17-24, San Jose, CA, 2012.
[43]
H. S. Gunawi, V. Prabhakaran, S. Krishnan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Improving file system reliability with I/O shepherding. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP '07, pages 293-306, New York, NY, USA, 2007. ACM.
[44]
T. Harter, D. Borthakur, S. Dong, A. Aiyer, L. Tang, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Analysis of HDFS under HBase: A Facebook messages case study. In Proceedings of the 12th USENIX Conference on File and Storage Technologies, pages 199-212, Santa Clara, CA, 2014.
[45]
C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, S. Yekhanin, et al. Erasure coding in Windows Azure Storage. In Usenix annual technical conference, pages 15-26. Boston, MA, 2012.
[46]
P. Huang, P. Subedi, X. He, S. He, and K. Zhou. FlexECC: Partially relaxing ECC of MLC SSD for better cache performance. In Proceedings of the 2014 USENIX Annual Technical Conference, pages 489-500, Philadelphia, PA, 2014.
[47]
J. Jeong, S. S. Hahn, S. Lee, and J. Kim. Lifetime improvement of NAND flash-based storage systems using dynamic program and erase scaling. In FAST, pages 61-74, 2014.
[48]
K. Kambatla and Y. Chen. The truth about MapReduce performance on SSDs. In Proceedings of the 28th Large Installation System Administration Conference, pages 118-126, Seattle, WA, 2014.
[49]
U. Kang, H.-s. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and J. S. Choi. Co-architecting controllers and DRAM to enhance DRAM process scaling. In The memory forum, pages 1-4, 2014.
[50]
H. Kim, S.-J. Ahn, Y. G. Shin, K. Lee, and E. Jung. Evolution of NAND flash memory: From 2D to 3D as a storage market leader. In Memory Workshop (IMW), 2017 IEEE International, pages 1-4. IEEE, 2017.
[51]
T. Kolditz, D. Habich, W. Lehner, M. Werner, and S. T. de Bruijn. AHEAD: Adaptable data hardening for on-the-fly hardware error detection during database query processing. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18, pages 1619-1634, New York, NY, USA, 2018. ACM.
[52]
T. Kolditz, T. Kissinger, B. Schlegel, D. Habich, and W. Lehner. Online bit flip detection for in-memory B-trees on unreliable hardware. In Proceedings of the Tenth International Workshop on Data Management on New Hardware, DaMoN '14, pages 5:1-5:9, New York, NY, USA, 2014. ACM.
[53]
A. Krioukov, L. N. Bairavasundaram, G. R. Goodson, K. Srinivasan, R. Thelen, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Parity lost and parity regained. In FAST, volume 2008, page 127, 2008.
[54]
J. Lee, J. Jang, J. Lim, Y. G. Shin, K. Lee, and E. Jung. A new ruler on the storage market: 3D-NAND flash for high-density memory and its technology evolutions and challenges on the future. In Electron Devices Meeting (IEDM), 2016 IEEE International, pages 11-2. IEEE, 2016.
[55]
S. Lee, K. Ha, K. Zhang, J. Kim, and J. Kim. FlexFS: A flexible flash file system for MLC NAND flash memory. In Proceedings of the 2009 Conference on USENIX Annual Technical Conference, USENIX'09, pages 9-9, Berkeley, CA, USA, 2009. USENIX Association.
[56]
S.-H. Lee. Technology scaling challenges and opportunities of memory devices. In Electron Devices Meeting (IEDM), 2016 IEEE International, pages 1-1. IEEE, 2016.
[57]
C. Li, P. Shilane, F. Douglis, H. Shim, S. Smaldone, and G. Wallace. Nitro: A capacity-optimized SSD cache for primary storage. In Proceedings of the 2014 USENIX Annual Technical Conference, pages 501-512, 2014.
[58]
C. Li, P. Shilane, F. Douglis, and G. Wallace. Pannier: A container-based flash cache for compound objects. In Proceedings of the 16th Annual Middleware Conference, pages 50-62, Vancouver, BC, 2015.
[59]
H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. SILT: A memory efficient, high-performance key-value store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 1-13, New York, NY, USA, 2011.
[60]
R. Liu, C. Yang, C. Li, and G. Chen. DuraCache: a durable SSD cache using MLC NAND flash. In Proceedings of the 50th Annual Design Automation Conference 2013, pages 166-171, Austin, TX, 2013.
[61]
R.-S. Liu, C.-L. Yang, and W. Wu. Optimizing NAND flash-based SSDs via retention relaxation. In Proceedings of the 10th USENIX conference on File and Storage Technologies, San Jose, CA, 2012.
[62]
L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. WiscKey: Separating keys from values in SSD-conscious storage. In 14th USENIX Conference on File and Storage Technologies (FAST 16), pages 133-148, Santa Clara, CA, Feb. 2016.
[63]
Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu. Improving 3D NAND flash memory lifetime by tolerating early retention loss and process variation. In Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '18, pages 106-106, New York, NY, USA, 2018. ACM.
[64]
J. Meza, Q. Wu, S. Kumar, and O. Mutlu. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 177-190, Portland, Oregon, 2015.
[65]
R. Micheloni et al. 3D Flash memories. Springer, 2016.
[66]
D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum. Fast crash recovery in RAMCloud. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 29-41, New York, NY, USA, 2011. ACM.
[67]
J. Ouyang, S. Lin, S. Jiang, Z. Hou, Y. Wang, and Y. Wang. SDF: Software-defined flash for web-scale Internet storage systems. SIGARCH Computing Architecture News, 42(1):471-484, 2014.
[68]
P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351-385, 1996.
[69]
D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pages 109-116, Chicago, Illinois, 1988.
[70]
V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Iron file systems. In Proceedings of the twentieth ACM symposium on Operating systems principles, 2005.
[71]
M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. XORing elephants: Novel erasure codes for big data. In Proceedings of the 39th International Conference on Very Large Data Bases, pages 325-336, Trento, Italy, 2013.
[72]
B. Schroeder, R. Lagisetty, and A. Merchant. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies, pages 67-80, Santa Clara, CA, 2016.
[73]
G. Sivathanu, C. P. Wright, and E. Zadok. Ensuring data integrity in storage: Techniques and applications. In Proceedings of the 2005 ACM workshop on Storage security and survivability, pages 26-36. ACM, 2005.
[74]
M. Sivathanu, V. Prabhakaran, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Improving storage system availability with D-GRAID. Trans. Storage, 1(2):133-170, May 2005.
[75]
A. S. Spinelli, C. M. Compagnoni, and A. L. Lacaita. Reliability of NAND flash memories: Planar cells and emerging issues in 3D devices. Computers, 6(2):16, 2017.
[76]
B. Tallis. QLC NAND arrives for consumer SSDs. https://www.anandtech.com/show/13078/the-intel-ssd-660p-ssd-review-qlc-nand-arrives.
[77]
S. Tanakamaru, M. Fukuda, K. Higuchi, A. Esumi, M. Ito, K. Li, and K. Takeuchi. Post-manufacturing, 17-times acceptable raw bit error rate enhancement, dynamic codeword transition ECC scheme for highly reliable solid-state drives, SSDs. Solid-State Electronics, 58(1):2-10, 2011.
[78]
L. Tang, Q. Huang, W. Lloyd, S. Kumar, and K. Li. RIPQ: Advanced photo caching on flash for Facebook. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, pages 373-386, Santa Clara, CA, 2015.
[79]
The Apache Software Foundation. Apache Cassandra. http://cassandra.apache.org/.
[80]
H. Weatherspoon and J. D. Kubiatowicz. Erasure coding vs. replication: A quantitative comparison. In P. Druschel, F. Kaashoek, and A. Rowstron, editors, Peer-to-Peer Systems, pages 328-337, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.
[81]
J. Xu and S. Swanson. NOVA: A log-structured file system for hybrid volatile/non-volatile main memories. In 14th USENIX Conference on File and Storage Technologies (FAST 16), pages 323-338, Santa Clara, CA, 2016. USENIX Association.
[82]
G. Zemor and G. D. Cohen. Error-correcting WOM-codes. IEEE Transactions on Information Theory, 37(3):730-734, 1991.
[83]
K. Zhao, W. Zhao, H. Sun, X. Zhang, N. Zheng, and T. Zhang. LDPC-in-SSD: Making advanced error correction codes work effectively in solid state drives. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST 13), pages 243-256, San Jose, CA, 2013.

Cited By

View all
  • (2022)Improving the Endurance of Next Generation SSD’s using WOM-v CodesACM Transactions on Storage10.1145/356502718:4(1-32)Online publication date: 16-Dec-2022
  • (2021)RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale ApplicationsACM Transactions on Storage10.1145/348384017:4(1-32)Online publication date: 15-Oct-2021
  • (2020)MyRocksProceedings of the VLDB Endowment10.14778/3415478.341554613:12(3217-3230)Online publication date: 14-Sep-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
USENIX ATC '19: Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference
July 2019
1076 pages
ISBN:9781939133038

Sponsors

  • VMware
  • Nutanix: Nutanix
  • NSF
  • Facebook: Facebook
  • ORACLE: ORACLE

Publisher

USENIX Association

United States

Publication History

Published: 10 July 2019

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Improving the Endurance of Next Generation SSD’s using WOM-v CodesACM Transactions on Storage10.1145/356502718:4(1-32)Online publication date: 16-Dec-2022
  • (2021)RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale ApplicationsACM Transactions on Storage10.1145/348384017:4(1-32)Online publication date: 15-Oct-2021
  • (2020)MyRocksProceedings of the VLDB Endowment10.14778/3415478.341554613:12(3217-3230)Online publication date: 14-Sep-2020

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media