[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3472456.3472485acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Coupling Right-Provisioned Cold Storage Data Centers with Deduplication

Published: 05 October 2021 Publication History

Abstract

Modern cloud-scale cold storage data centers have begun to support right-provisioning of a rack’s resources (power, cooling, etc.), which allows only a small fraction of all hard disks to be active (spinning) concurrently at any given time to reduce the cost of ownership. Data deduplication is a traditional approach to split files into chunks and eliminate duplicate chunks, which can also cut costs for cold storage systems. However, when combined with right-provisioning, classical deduplication may make a file deduplicated and stored across the disks some of which are not active currently, thus leading to unacceptable access performance caused by spinning up and down of the disks.
In this paper, we analyze the deduplication ratio under real-world workloads of cloud cold storage and observe for most workloads: 1) the deduplication ratio generally increases quickly with the first few of versions of the workload, and 2) increases slowly but steadily with the subsequent versions as a long tail. Based on the first observation, we propose an online deduplication way that can improve the deduplication ratio while providing acceptable read performance; based on the second one, we propose an additional offline deduplication way that can achieve comparable deduplication ratios with classical deduplication. We design a cold storage system called DeCold via combining the above two deduplication ways as well as improving deduplication efficiency. We prototype DeCold and conduct testbed experiments on real-world datasets including source code, virtual machine and database. Evaluations show that DeCold achieves better file access performance over the classical deduplication implementation, while maintaining decent deduplication efficiency.

References

[1]
[n.d.]. GCC source code. http://ftp.gnu.org/gnu/gcc/.
[2]
[n.d.]. Linux Kernel. http://www.kernel.org/.
[3]
[n.d.]. Microsoft Azure Cool Blob Storage. https://azure.microsoft.com/en-us/blog/introducing-azure-cool-storage.
[4]
[n.d.]. Redis. https://redis.io/.
[5]
2012. Amazon glacier. http://aws.amazon.com/glacier/.
[6]
2016. Opendedup. http://www.opendedup.org/.
[7]
Samer Al-Kiswany, Dinesh Subhraveti, Prasenjit Sarkar, and Matei Ripeanu. 2011. VMFlock: virtual machine co-migration for the cloud. In Proc. of ACM HPDC.
[8]
George Amvrosiadis and Medha Bhadkamkar. 2015. Identifying trends in enterprise data protection systems. In Proc. of USENIX ATC.
[9]
Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, Sergey Legtchenko, Aaron Ogus, Eric Peterson, and Antony Rowstron. 2014. Pelican: A building block for exascale cold data storage. In Proc. of USENIX OSDI.
[10]
Richard Black, Austin Donnelly, Dave Harper, Aaron Ogus, and Anthony Rowstron. 2016. Feeding the pelican: Using archival hard drives for cold storage racks. In Proc. of USENIX HotStorage.
[11]
Renata Borovica-Gajić, Raja Appuswamy, and Anastasia Ailamaki. 2016. Cheap data analytics using cold storage devices. In Proc. of VLDB Endowment.
[12]
Wenxiang Chen, Yuchong Hu, Siyang Yin, and Wen Xia. 2017. EEC-Dedup: Efficient Erasure-Coded Deduplicated Backup Storage Systems. In Proc. of IEEE ISPA. 251–258.
[13]
Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proc. of ACM SoCC. 143–154.
[14]
Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki. 2009. HYDRAstor: A scalable secondary storage. In Proc. of USENIX FAST.
[15]
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. of USENIX ATC.
[16]
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proc. of USENIX FAST. 331–344.
[17]
Sergey Legtchenko, Xiaozhou Li, Antony Rowstron, Austin Donnelly, and Richard Black. 2016. Flamingo: Enabling evolvable hdd-based near-line storage. In Proc. of USENIX FAST.
[18]
Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proc. of USENIX FAST.
[19]
Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, and Peter Camble. 2009. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proc. of USENIX FAST.
[20]
Chuanyi Liu, Yu Gu, Linchun Sun, Bin Yan, and Dongsheng Wang. 2009. R-admad: High reliability provision for large-scale de-duplication archival storage systems. In Proc. of ACM ICS.
[21]
Jinwei Liu and Haiying Shen. 2016. A popularity-aware cost-effective replication scheme for high data durability in cloud storage. In Proc. of IEEE Big Data.
[22]
Mengting Lu, Fang Wang, Dan Feng, and Yuchong Hu. 2019. A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication. In Proc. of ICPP.
[23]
A. Mendoza. 2013. Cold storage in the cloud: Trends, challenges, and solutions. Intel, White paper (2013).
[24]
T. P MORGAN. October. Facebook loads up innovative cold storage datacenter. https://cloud.google.com/files/.
[25]
Athicha Muthitacharoen, Benjie Chen, and David Mazieres. 2001. A low-bandwidth network file system. In Proc. of ACM SOSP.
[26]
P. NEWSON. 2015. Whitepaper: Google cloud storage nearline. https://cloud.google.com/files/ GoogleCloudStorageNearline.pdf (2015).
[27]
Dorward S Quinlan S. 2002. Venti: A new approach to archival storage. In Proc. of USENIX FAST.
[28]
M. Rabin. 1981. Fingerprinting by random polynomials.
[29]
I. Reed and G. Solomon. 1960. Polynomial Codes over Certain Finite Fields. Journal of the Society for Industrial & Applied Mathematics 8, 2(1960), 300–304.
[30]
Russ Cox Rhea, Sean C. and Alex Pesterev. 2008. Fast, Inexpensive Content-Addressed Storage in Foundation. In Proc. of USENIX FAST.
[31]
Vasily Tarasov, Deepak Jain, Geoff Kuenning, Sonam Mandal, Karthikeyani Palanisami, Philip Shilane, Sagar Trehan, and Erez Zadok. 2014. Dmdedup: Device Mapper Target for Data Deduplication. In Proc. of Ottawa Linux Symposium (OSL).
[32]
Michael Vrable, Stefan Savage, and Geoffrey M Voelker. 2009. Cumulus: Filesystem backup to the cloud. ACM Trans. on Storage 5, 4 (2009), 14.
[33]
Carl A Waldspurger. 2002. Memory resource management in VMware ESX server. Proc. of ACM SIGOPS Operating Systems Review 36, SI (2002).
[34]
Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In Proc. of USENIX FAST.
[35]
Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput. In Proc. of USENIX ATC. 26–30.
[36]
Wenrui Yan, Jie Yao, Qiang Cao, Changsheng Xie, and Hong Jiang. 2018. Ros: A rack-based optical storage system with inline accessibility for long-term data preservation. ACM Trans. on Storage 14, 3 (2018), 28.
[37]
Yucheng Zhang, Hong Jiang, Dan Feng, Wen Xia, Min Fu, Fangting Huang, and Yukun Zhou. 2015. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proc. of IEEE INFOCOM.
[38]
Benjamin Zhu, Kai Li, and R Hugo Patterson. 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Proc. of USENIX FAST.

Cited By

View all
  • (2023)ERP: An Efficient Rewrite Scheme to Improve the Inline Deduplication Restore Performance in Backup Systems2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS56603.2022.00055(371-378)Online publication date: Jan-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cold storage
  2. Deduplication
  3. Right-provisioning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2021

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)2
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)ERP: An Efficient Rewrite Scheme to Improve the Inline Deduplication Restore Performance in Backup Systems2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS56603.2022.00055(371-378)Online publication date: Jan-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media