[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3342195.3387546acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

RAIDP: replication with intra-disk parity

Published: 17 April 2020 Publication History

Abstract

Distributed storage systems often triplicate data to reduce the risk of permanent data loss, thereby tolerating at least two simultaneous disk failures at the price of 2/3 of the capacity. To reduce this price, some systems utilize erasure coding. But this optimization is usually only applied to cold data, because erasure coding might hinder performance for warm data.
We propose RAIDP---a new point in the distributed storage design space between replication and erasure coding. RAIDP maintains only two replicas, rather than three or more. It increases durability by utilizing small disk "add-ons" for storing intra-disk erasure codes that are local to the server but fail independently from the disk. By carefully laying out the data, the add-ons allow RAIDP to recover from simultaneous disk failures (add-ons can be stacked to withstand an arbitrary number of failures). RAIDP retains much of the benefits of replication, trading off some performance and availability for substantially reduced storage requirements, networking overheads, and their related costs. We implement RAIDP in HDFS, which triplicates by default. We show that baseline RAIDP achieves performance close to that of HDFS with only two replicas, and performs within 21% of the default triplicating HDFS with an update-oriented variant, while halving the storage and networking overheads and providing similar durability.

References

[1]
DRAMeXchnage website. http://www.dramexchange.com/. (Accessed: Nov 2019).
[2]
Raspberry Pi Zero. https://www.raspberrypi.org/products/raspberry-pi-zero/. (Accessed: Nov 2019).
[3]
Viking Technology. http://www.vikingtechnology.com/.
[4]
A review of emerging non-volatile memory (NVM) technologies and applications. Solid-State Electronics 125 (2016), 25 -- 38.
[5]
Aguilera, M. K., and Janakiraman, R. Using erasure codes efficiently for storage in a distributed system. In In Proc. of DSN'05 (2005), IEEE Computer Society, pp. 336--345.
[6]
Amazon Web Services (AWS). Amazon S3 product details. http://aws.amazon.com/s3/details.
[7]
Amazon.com. USB 2.0 to SATA + IDE cable adapter. http://www.amazon.com/USB-SATA-5-25-Cable-Adapter/dp/B000YJBL78, 2014. (Accessed: Nov 2019).
[8]
Amazon.com. Seagate 4TB desktop HDD SATA 6Gb/s 64MB cache 3.5-inch internal bare drive (ST4000DM000). http://www.amazon.com/Seagate-Desktop-3-5-Inch-Internal-ST4000DM000/dp/B00B99JU4S, 2015. (Accessed: Nov 2019).
[9]
Apache Software Foundation, T. HDFS architecture guide. http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html. (Accessed: Nov 2019).
[10]
Arnold, J. Erasure codes with OpenStack Swift - digging deeper. https://swiftstack.com/blog/2013/07/17/erasure-codes-with-openstack-swift-digging-deeper/, July 2013. (Accessed: Nov 2019).
[11]
B. Calder et. al. Windows Azure Storage: A highly available cloud storage service with strong consistency. In ACM Symp. on Operating Systems Principles (SOSP) (2011), pp. 143--157.
[12]
Beaver, D., Kumar, S., Li, H. C., Sobel, J., Vajgel, P., et al. Finding a needle in Haystack: Facebook's photo storage. In OSDI (2010), vol. 10, pp. 1--8.
[13]
Bhandarkar, D. Watt matters in energy efficiency. Server design summit keynote 2012, http://www.slideshare.net/dileepb/server-design-summit-keynote-handout, 2012.
[14]
Borthakur, D. HDFS architecture guide. The Apache Software Foundation, 2008.
[15]
Ceph Documentation. Ceph erasure coded pools. https://docs.ceph.com/docs/giant/dev/erasure-coded-pool/. (Accessed: Nov 2019).
[16]
Ceph Documentation. Ceph storage cluster. https://docs.ceph.com/docs/master/rados/. (Accessed: Nov 2019).
[17]
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2 (June 2008), 4:1--4:26.
[18]
Cidon, A., Escriva, R., Katti, S., Rosenblum, M., and Sirer, E. G. Tiered replication: A cost-effective alternative to full cluster geo-replication. In USENIX Annual Technical Conference (2015), USENIX ATC.
[19]
Cidon, A., Rumble, S., Stutsman, R., Katti, S., Ousterhout, J., and Rosenblum, M. Copysets: Reducing the frequency of data loss in cloud storage. In Presented as part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13) (San Jose, CA, 2013), USENIX, pp. 37--48.
[20]
Dimakis, A. G., Godfrey, P. B., Wu, Y., Wainwright, M. J., and Ramchandran, K. Network coding for distributed storage systems. IEEE Trans. Inf. Theor. 56, 9 (Sept. 2010), 4539--4551.
[21]
Dragojević, A., Narayanan, D., Nightingale, E. B., Renzelmann, M., Shamis, A., Badam, A., and Castro, M. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 54--70.
[22]
Facebook Engineering. HDFS-RAID. https://engineering.fb.com/core-data/saving-capacity-with-hdfs-raid/. (Accessed: Nov 2019).
[23]
Fan, B., Tantisiriroj, W., Xiao, L., and Gibson, G. DiskReduce: RAID for data-intensive scalable computing. In Proceedings of the 4th Annual Workshop on Petascale Data Storage (2009), ACM, pp. 6--10.
[24]
Fang, J., Wan, S., and He, X. RAFI: Risk-aware failure identification to improve the RAS in erasure-coded data centers. In 2018 USENIX Annual Technical Conference (2018), USENIX ATC.
[25]
Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (2010), USENIX Association, pp. 1--7.
[26]
Ford Jr, L., and Fulkerson, D. Maximal flow through a network. Canadian Journal of Mathematics 8 (1956), 399--404.
[27]
Friedman, R., Kantor, Y., and Kantor, A. Replicated erasure codes for storage and repair-traffic efficiency. In Peer-to-Peer Computing (P2P), 14-th IEEE International Conference on (Sept 2014), pp. 1--10.
[28]
Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google File System. In ACM Symp. on Operating Systems Principles (SOSP) (2003), pp. 29--43.
[29]
Gibson, G., Hellerstein, L., Karp, R., Katz, R., and Patterson, D. Coding techniques for handling failures in large disk arrays. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (1989), pp. 123--32.
[30]
Greenan, K. M., Miller, E. L., Schwarz, T. J. E., and Long, D. D. Disaster recovery codes: Increasing reliability with large-stripe erasure correcting codes. In Proceedings of the 2007 ACM Workshop on Storage Security and Survivability (New York, NY, USA, 2007), StorageSS '07, ACM, pp. 31--36.
[31]
Greenberg, A., Hamilton, J., Maltz, D. A., and Patel, P. The cost of a cloud: Research problems in data center networks. SIGCOMM Comput. Commun. Rev. 39, 1 (Dec. 2008), 68--73.
[32]
Hafner, J. HoVer erasure codes for disk arrays. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on (June 2006), pp. 217--226.
[33]
Hamilton, J. Overall data center costs. Perspectives, http://perspectives.mvdirona.com/2010/09/overall-data-center-costs/, 2010.
[34]
Hamilton, J. Why scale matters and how the cloud really is different. Re:invent 2013, https://www.youtube.com/watch?v=WBrNrI2ZsCo, 2013.
[35]
Holland, M., and Gibson, G. A. Parity declustering for continuous operation in redundant disk arrays. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 1992), ASPLOS V, ACM, pp. 23--35.
[36]
Hopcroft, J., and Karp, R. An n^5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on Computing 2, 4 (1973), 225--231.
[37]
Huang, C., Simitci, H., Xu, Y., Ogus, A., Calder, B., Gopalan, P., Li, J., Yekhanin, S., et al. Erasure coding in Windows Azure Storage. In USENIX conference on Annual Technical Conference, USENIX ATC (2012).
[38]
Hwang, K., Jin, H., and Ho, R. RAID-x: A new distributed disk array for I/O-centric cluster computing. In Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing (Washington, DC, USA, 2000), HPDC '00, IEEE Computer Society, pp. 279--286.
[39]
Intel. Power management in Intel architecture servers., http://download.intel.com/support/motherboards/server/sb/power_management_of_intel_architecture_servers.pdf, 2009.
[40]
Intel. HiBench - the Hadoop Benchmark Suite. https://github.com/intel-hadoop/HiBench, 2015. Github. Accessed Feb 2015.
[41]
Khan, O., Burns, R., Plank, J. S., Pierce, W., and Huang, C. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In USENIX Conf. on File & Storage Technologies (FAST) (San Jose, February 2012).
[42]
Kim, J., Lee, E., and Noh, S. H. I/o scheduling schemes for better I/O proportionality on flash-based SSDs. In IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (2016), MASCOTS, pp. 221--230.
[43]
Kuang, H. Support hsync in HDFS [issue: Hdfs-744]. https://issues.apache.org/jira/browse/HDFS-744, May 2012. Hadoop HDFS JIRA issue tracker (Accessed: November 2019).
[44]
Kuhn, H. The hungarian method for the assignment problem. Naval research logistics quarterly 2, 1--2 (2006), 83--97.
[45]
Lakshman, A., and Malik, P. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (Apr. 2010), 35--40.
[46]
Lee, E. K., and Thekkath, C. A. Petal: Distributed virtual disks. SIGOPS Oper. Syst. Rev. 30, 5 (Sept. 1996), 84--92.
[47]
Li, R., Li, X., Lee, P. P. C., and Huang, Q. Repair pipelining for erasure-coded storage. In USENIX Annual Technical Conference (2017), USENIX ATC.
[48]
MacWilliams, F. J., and Sloane, N. J. A. The theory of error-correcting codes, vol. 16. Elsevier, 1977.
[49]
Manasse, M. S., Thekkath, C. A., and Silverberg, A. A Reed-Solomon code for disk storage, and efficient recovery computations for erasure-coded disk storage. Proceedings in Informatics (July 2005).
[50]
Marvell. Marvell 88SS5000 NVMeoF SSD controller shown with Toshiba BiCS. https://www.servethehome.com/marvell-88ss5000-nvmeof-ssd-controller-shown-with-toshiba-bics. (Accessed: Feb. 2020).
[51]
Microsemi. Flashtec NVMe controllers. https://www.microsemi.com/product-directory/storage/3687-flashtec-nvme-controllers. (Accessed: Nov 2019).
[52]
Mills-Tetty, G. A., Stentz, A. T., and Dias, M. B. The dynamic hungarian algorithm for the assignment problem with changing costs. Tech. Rep. CMU-RI-TR-07-27, Robotics Institute, Pittsburgh, PA, July 2007.
[53]
Mitra, S., Panta, R., Ra, M.-R., and Bagchi, S. Partial-parallel-repair (PPR): A distributed technique for repairing erasure coded storage. In Proceedings of the Eleventh European Conference on Computer Systems (2016), EuroSys.
[54]
Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., and Schwarz, P. ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems (TODS) 17, 1 (1992), 94--162.
[55]
Muralidhar, S., Lloyd, W., Roy, S., Hill, C., Lin, E., Liu, W., Pan, S., Shankar, S., Sivakumar, V., Tang, L., et al. f4: Facebook's warm BLOB storage system. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation (2014), USENIX Association, pp. 383--398.
[56]
Nightingale, E. B., Elson, J., Fan, J., Hofmann, O., Howell, J., and Suzue, Y. Flat datacenter storage. In USENIX Symp. on Operating Systems Design & Implementation (OSDI) (2012), pp. 1--15.
[57]
Pamies-Juarez, L., Blagojević, F., Mateescu, R., Gyuot, C., Gad, E. E., and Bandić, Z. Opening the chrysalis: On the real repair performance of msr codes. In 14th USENIX Conference on File and Storage Technologies (2016), FAST.
[58]
Pamies-Juarez, L., Oggier, F., and Datta, A. Decentralized erasure coding for efficient data archival in distributed storage systems. In Distributed Computing and Networking. Springer, 2013, pp. 42--56.
[59]
Pillai, T. S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, Oct. 2014), USENIX Association, pp. 433--448.
[60]
Pinheiro, E., Weber, W.-D., and Barroso, L. A. Failure trends in a large disk drive population. In USENIX Conf. on File & Storage Technologies (FAST) (2007), pp. 17--28.
[61]
Plank, J. S., and Huang, C. Tutorial: Erasure coding for storage applications. Slides presented at FAST-2013: 11th Usenix Conference on File and Storage Technologies, February 2013.
[62]
Prabhakaran, V., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Analysis and evolution of journaling file systems. In USENIX Annual Technical Conference, General Track (2005), pp. 105--120.
[63]
Rashmi, K., Nakkiran, P., Wang, J., Shah, N. B., and Ramchandran, K. Having your cake and eating it too: Jointly optimal erasure codes for I/O, storage, and network-bandwidth. In 13th USENIX Conference on File and Storage Technologies (FAST 15) (Santa Clara, CA, Feb. 2015), USENIX Association, pp. 81--94.
[64]
Rashmi, K. V., Shah, N. B., Gu, D., Kuang, H., Borthakur, D., and Ramchandran, K. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. In Proceedings of the 5th USENIX Conference on Hot Topics in Storage and File Systems (Berkeley, CA, USA, 2013), HotStorage'13, USENIX Association, pp. 8--8.
[65]
Rashmi, K. V., Shah, N. B., Kumar, P. V., and Ramchandran, K. Explicit construction of optimal exact regenerating codes for distributed storage. In 47th Annual Allerton Conference on Communication, Control, and Computing (2009).
[66]
Reed, I. S., and Solomon, G. Polynomial codes over certain finite fields. Journal of the Society for Industrial & Applied Mathematics 8, 2 (1960), 300--304.
[67]
Rodrigues, R., and Liskov, B. High availability in DHTs: Erasure coding vs. replication. In Peer-to-Peer Systems IV. Springer, 2005, pp. 226--239.
[68]
Rosenfeld, E. RAIDP: ReplicAtion with Intra-Disk Parity. Master's thesis, Technion---Israel Institute of Technology, Israel, 2015.
[69]
Rosenfeld, E., Amit, N., and Tsafrir, D. Using disk add-ons to withstand simultaneous disk failures with fewer replicas. Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA), 2013.
[70]
Sathiamoorthy, M., Asteris, M., Papailiopoulos, D., Dimakis, A. G., Vadali, R., Chen, S., and Borthakur, D. XORing elephants: Novel erasure codes for big data. Proceedings of the VLDB Endowment (2013).
[71]
Schroeder, B., and Gibson, G. A. Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you? In USENIX Conf. on File & Storage Technologies (FAST) (2007).
[72]
Seagate. Seagate Kinetic open storage platform. https://www.seagate.com/em/en/support/enterprise-servers-storage/nearline-storage/kinetic-hdd. (Accessed: Feb. 2020).
[73]
Silberstein, M., Wang, Y., Ganesh, L., Alvisi, L., and Dahlin, M. Lazy means smart: Reducing repair bandwidth costs in erasure-coded distributed storage. In Proceedings of the 7th ACM International Systems and Storage Conference (2014), SYSTOR'14, USENIX Association.
[74]
Thinkmate. Superstorage server 6048r-e1cr72l., http://www.thinkmate.com/system/superstorage-server-6048r-e1cr72l.
[75]
Tian, L., Feng, D., Jiang, H., Zhou, K., Zeng, L., Chen, J., Wang, Z., and Song, Z. PRO: A popularity-based multi-threaded reconstruction optimization for RAID-structured storage systems. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2007), FAST '07, USENIX Association, pp. 32--32.
[76]
Wang, G., Zhang, L., and Xu, W. What can we learn from four years of data center hardware failures? In IEEE/IFIP International Conference on Dependable Systems and Networks (2017), DSN.
[77]
Wang, Z., Dimakis, A. G., and Bruck, J. Rebuilding for array codes in distributed storage systems. CoRR abs/1009.3291 (2010).
[78]
Weatherspoon, H., and Kubiatowicz, J. D. Erasure coding vs. replication: A quantitative comparison. In Peer-to-Peer Systems. Springer, 2002, pp. 328--337.
[79]
website, D. Poweredge rack servers. http://www.dell.com/us/business/p/poweredge-rack-servers/. (Accessed: July 2019).
[80]
Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Berkeley, CA, USA, 2006), OSDI '06, USENIX Association, pp. 307--320.
[81]
Wildani, A., Schwarz, T., Miller, E., and Long, D. Protecting against rare event failures in archival systems. In Modeling, Analysis Simulation of Computer and Telecommunication Systems, 2009. MASCOTS '09. IEEE International Symposium on (Sept 2009), pp. 1--11.
[82]
Wilkes, J., Golding, R., Staelin, C., and Sullivan, T. The HP AutoRAID hierarchical storage system. ACM Trans. Comput. Syst. 14, 1 (Feb. 1996), 108--136.
[83]
Xia, M., Saxena, M., Blaum, M., and Pease, D. A. A tale of two erasure codes in HDFS. In 13th USENIX Conference on File and Storage Technologies (FAST 15) (Santa Clara, CA, Feb. 2015), USENIX Association, pp. 213--226.
[84]
Xiang, L., Xu, Y., Lui, J. C., and Chang, Q. Optimal recovery of single disk failure in RDP code storage systems. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (New York, NY, USA, 2010), SIGMETRICS '10, ACM, pp. 119--130.
[85]
Zhang, D., Ehsan, M., Ferdman, M., and Sion, R. Dimmer: A case for turning off dimms in clouds. In Proceedings of the ACM Symposium on Cloud Computing (New York, NY, USA, 2014), SOCC '14, ACM, pp. 11:1--11:8.

Cited By

View all
  • (2024)Dynamic Adjustment of Disk Redundancy and Scrubbing Strategy With Reinforcement LearningIEEE Transactions on Industrial Informatics10.1109/TII.2024.339739820:8(10680-10690)Online publication date: Aug-2024
  • (2020)Fractional-Overlap Declustered Parity: Evaluating Reliability for Storage Systems2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW)10.1109/PDSW51947.2020.00009(1-6)Online publication date: Nov-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '20: Proceedings of the Fifteenth European Conference on Computer Systems
April 2020
49 pages
ISBN:9781450368827
DOI:10.1145/3342195
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2020

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

EuroSys '20
Sponsor:
EuroSys '20: Fifteenth EuroSys Conference 2020
April 27 - 30, 2020
Heraklion, Greece

Acceptance Rates

EuroSys '20 Paper Acceptance Rate 43 of 234 submissions, 18%;
Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Dynamic Adjustment of Disk Redundancy and Scrubbing Strategy With Reinforcement LearningIEEE Transactions on Industrial Informatics10.1109/TII.2024.339739820:8(10680-10690)Online publication date: Aug-2024
  • (2020)Fractional-Overlap Declustered Parity: Evaluating Reliability for Storage Systems2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW)10.1109/PDSW51947.2020.00009(1-6)Online publication date: Nov-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media