[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2928275.2928278acmconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article
Public Access

SSD Failures in Datacenters: What? When? and Why?

Published: 06 June 2016 Publication History

Abstract

Despite the growing popularity of Solid State Disks (SSDs) in the datacenter, little is known about their reliability characteristics in the field. The little knowledge is mainly vendor supplied, and such information cannot really help understand how SSD failures can manifest and impact the operation of production systems, in order to take appropriate remedial measures. Besides actual failure data and the symptoms exhibited by SSDs before failing, a detailed characterization effort requires wide set of data about factors influencing SSD failures, right from provisioning factors to the operational ones. This paper presents an extensive SSD failure characterization by analyzing a wide spectrum of data from over half a million SSDs that span multiple generations spread across several datacenters which host a wide spectrum of workloads over nearly 3 years. By studying the diverse set of design, provisioning and operational factors on failures, and their symptoms, our work provides the first comprehensive analysis of the what, when and why characteristics of SSD failures in production datacenters.

References

[1]
Enhanced Content Distribution Network with Intel Solid-State Drives. http://www.intel.fr/content/dam/www/public/us/en/documents/case-studies/cloud-computing-ssd-beijing-fastwebcase-study.pdf.
[2]
American National Standards Institute. AT attachment 8 - ATA/ATAPI command set (ATA8-ACS), 2008. URL http://www.t13.org/documents/uploadeddocuments/docs2008/d1699r6a-ata8-acs.pdf.
[3]
D. G. Andersen and S. Swanson. Rethinking Flash in the Data Center. IEEE Micro, 2010.
[4]
S. Boboila and P. Desnoyers. Write Endurance in Flash Drives: Measurements and Analysis. In USENIX FAST, 2010.
[5]
L. Breiman. Random Forests. Machine learning, 2001.
[6]
Y. Cai, E. Haratsch, O. Mutlu, and K. Mai. Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis. In DATE, 2012.
[7]
Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. S. Unsal, and K. Mai. Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime. In ICCD, 2012.
[8]
Y. Cai, O. Mutlu, E. F. Haratsch, and K. Mai. Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation. In ICCD, 2013.
[9]
Y. Cai, Y. Luo, S. Ghose, E. F. Haratsch, K. Mai, and O. Mutlu. Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery. In IEEE/IFIP DSN, 2015.
[10]
Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, and O. Mutlu. Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery. In HPCA, 2015.
[11]
P. Cappelletti, R. Bez, D. Cantarelli, and L. Fratin. Failure Mechanisms of Flash Cell in Program/Erase Cycling. In IEDM Tech. Dig., 1994.
[12]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Int. Res., 2002.
[13]
B. Debnath, S. Sengupta, and J. Li. FlashStore: High Throughput Persistent Key-value Store. Proc. VLDB Endow., 2010.
[14]
H. Deng. Interpreting Tree Ensembles with inTrees. arXiv preprint arXiv:1408.5456, 2014.
[15]
L. M. Grupp, A. M. Caulfield, J. Coburn, S. Swanson, E. Yaakobi, P. H. Siegel, and J. K. Wolf. Characterizing Flash Memory: Anomalies, Observations, and Applications. In MICRO, 2009.
[16]
L. M. Grupp, J. D. Davis, and S. Swanson. The Bleak Future of NAND Flash Memory. In USENIX FAST, 2012.
[17]
X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and R. Pletka. Write amplification analysis in flash-based solid state drives. In ACM SYSTOR, 2009.
[18]
M. Isard. Autopilot: Automatic Data Center Management. SIGOPS Oper. Syst. Rev., 2007.
[19]
W. Jiang, C. Hu, Y. Zhou, and A. Kanevsky. Are Disks the Dominant Contributor for Storage Failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics.Trans. Storage, 2008.
[20]
M. Jung and M. Kandemir. Revisiting Widely Held SSD Expectations and Rethinking System-level Implications. In ACM SIGMETRICS, 2013.
[21]
M. Kalisch. Package pcalg. 2015.
[22]
E. L. Kaplan and P. Meier. Nonparametric estimation from incomplete observations. Journal of the American statistical association, 1958.
[23]
S. L. Lauritzen. Graphical models. 1996.
[24]
J. Meza, Q. Wu, S. Kumar, and O. Mutlu. A Large-Scale Study of Flash Memory Failures in the Field. ACM SIGMETRICS, 2015.
[25]
Microsoft Azure Premium Storage. Microsoft Azure Premium Storage, 2015. https://azure.microsoft.com/en-us/blog/azure-premium-storage-now-generally-available-2/.
[26]
N. Mielke, T. Marquart, N. Wu, J. Kessenich, H. Belgal, E. Schares, F. Trivedi, E. Goodness, and L. Nevill. Bit error rate in nand flash memories. In IRPS 2008., 2008.
[27]
J. Pearl. Causality: Models, Reasoning, and Inference. 2000.
[28]
E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure Trends in a Large Disk Drive Population. In USENIX FAST, 2007.
[29]
B. Schroeder and G. A. Gibson. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In USENIX FAST, 2007.
[30]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM Errors in the Wild: A Large-scale Field Study. In ACM SIGMETRICS, 2009.
[31]
B. Schroeder, R. Lagisetty, and A. Merchant. Flash reliability in production: The expected and the unexpected. In FAST, 2016.
[32]
H.-W. Tseng, L. Grupp, and S. Swanson. Understanding the Impact of Power Loss on Flash Memory. In DAC, 2011.
[33]
M. Zheng, J. Tucek, F. Qin, and M. Lillibridge. Understanding the Robustness of SSDs Under Power Fault. In USENIX FAST, 2013.

Cited By

View all
  • (2024)When Green Computing Meets Performance and Resilience SLOs2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00015(17-22)Online publication date: 24-Jun-2024
  • (2023)Multi-view feature-based SSD failure predictionProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585964(409-424)Online publication date: 21-Feb-2023
  • (2023)More than capacityProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585959(331-345)Online publication date: 21-Feb-2023
  • Show More Cited By

Index Terms

  1. SSD Failures in Datacenters: What? When? and Why?

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SYSTOR '16: Proceedings of the 9th ACM International on Systems and Storage Conference
      June 2016
      191 pages
      ISBN:9781450343817
      DOI:10.1145/2928275
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 06 June 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      • Best Student Paper

      Author Tags

      1. characterization
      2. reliability
      3. solid state drives

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      SYSTOR '16
      Sponsor:

      Acceptance Rates

      SYSTOR '16 Paper Acceptance Rate 16 of 49 submissions, 33%;
      Overall Acceptance Rate 108 of 323 submissions, 33%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)567
      • Downloads (Last 6 weeks)104
      Reflects downloads up to 10 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)When Green Computing Meets Performance and Resilience SLOs2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00015(17-22)Online publication date: 24-Jun-2024
      • (2023)Multi-view feature-based SSD failure predictionProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585964(409-424)Online publication date: 21-Feb-2023
      • (2023)More than capacityProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585959(331-345)Online publication date: 21-Feb-2023
      • (2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
      • (2023)Analog-to-digital conversion of information archived in display holograms: I. discussionJournal of the Optical Society of America A10.1364/JOSAA.47849840:4(B47)Online publication date: 31-Mar-2023
      • (2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
      • (2023)Hybrid Block Storage for Efficient Cloud Volume ServiceACM Transactions on Storage10.1145/359644619:4(1-25)Online publication date: 3-Oct-2023
      • (2022)A novel SSD fault detection method using GRU-based Sparse Auto-Encoder for dimensionality reductionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22059043:4(4929-4946)Online publication date: 1-Jan-2022
      • (2022)PaviseProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569662(109-123)Online publication date: 8-Oct-2022
      • (2022)ScalaRAIDProceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3538643.3539740(119-125)Online publication date: 27-Jun-2022
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media