[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

MyRocks: LSM-tree database storage engine serving Facebook's social graph

Published: 01 August 2020 Publication History

Abstract

Facebook uses MySQL to manage tens of petabytes of data in its main database named the User Database (UDB). UDB serves social activities such as likes, comments, and shares. In the past, Facebook used InnoDB, a B+Tree based storage engine as the backend. The challenge was to find an index structure using less space and write amplification [1]. LSM-tree [2] has the potential to greatly improve these two bottlenecks. RocksDB, an LSM tree-based key/value store was already widely used in variety of applications but had a very low-level key-value interface. To overcome these limitations, MyRocks, a new MySQL storage engine, was built on top of RocksDB by adding relational capabilities. With MyRocks, using the RocksDB API, significant efficiency gains were achieved while still benefiting from all the MySQL features and tools. The transition was mostly transparent to client applications.
Facebook completed the UDB migration from InnoDB to MyRocks in 2017. Since then, ongoing improvements in production operations, and additional enhancements to MySQL, MyRocks, and RocksDB, provided even greater efficiency wins. MyRocks also reduced the instance size by 62.3% for UDB data sets and performed fewer I/O operations than InnoDB. Finally, MyRocks consumed less CPU time for serving the same production traffic workload. These gains enabled us to reduce the number of database servers in UDB to less than half, saving significant resources. In this paper, we describe our journey to build and run an OLTP LSM-tree SQL database at scale. We also discuss the features we implemented to keep pace with UDB workloads, what made migrations easier, and what operational and software development challenges we faced during the two years of running MyRocks in production.
Among the new features we introduced in RocksDB were transactional support, bulk loading, and prefix bloom filters, all are available for the benefit of all RocksDB users.

References

[1]
M. Athanassoulis, M. S. Kester, L. M. Maas, R. I. Stoica, S. Idreos, A. Ailamaki, and M. Callaghan. Designing Access Methods: The RUM Conjecture. In Proceedings of the International Conference on Extending Database Technology (EDBT) Conference, 2016
[2]
Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (June 1996), 351--385.
[3]
Venkateshwaran Venkataramani, Zach Amsden, Nathan Bronson, George Cabrera III, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Jeremy Hoon, Sachin Kulkarni, Nathan Lawrence, Mark Marchukov, Dmitri Petrov, and Lovro Puzar. 2012. TAO: how facebook serves the social graph. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). Association for Computing Machinery, New York, NY, USA, 791--792.
[4]
Facebook's MySQL extensions. https://github.com/facebook/mysql-5.6
[5]
Data centers year in review. Facebook Engineering. https://engineering.fb.com/data-center-engineering/data-centers-2018/.
[6]
Sharma, Y., Ajoux, P., Ang, P., Callies, D., Choudhary, A., Demailly, L., Fersch, T., Guz, L.A., Kotulski, A., Kulkarni, S. and Kumar, S., 2015. Wormhole: Reliable pub-sub to support geo-replicated internet services. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (pp. 351--366).
[7]
Flashcache https://www.facebook.com/notes/mysql-at-facebook/releasing-flashcache/388112370932/
[8]
MySQL Glossary for Covering Index https://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_covering_index
[9]
RocksDB. https://github.com/facebook/rocksdb
[10]
Amy Tai, Andrew Kryczka, Shobhit O. Kanaujia, Kyle Jamieson, Michael J. Freedman, and Asaf Cidon. 2019. Who's afraid of uncorrectable bit errors? online recovery of flash errors with distributed redundancy. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '19). USENIX Association, USA, 977--991.
[11]
Guoqiang Jerry Chen, Janet L. Wiener, Shridhar Iyer, Anshul Jaiswal, Ran Lei, Nikhil Simha, Wei Wang, Kevin Wilfong, Tim Williamson, and Serhat Yilmaz. 2016. Realtime Data Processing at Facebook. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1087--1098.
[12]
Arun Sharma. Dragon: A distributed graph query engine. https://engineering.fb.com/data-infrastructure/dragon-a-distributed-graph-query-engine/
[13]
Ghemawat, S. and Dean, J., 2011. LevelDB. https://github.com/google/leveldb
[14]
S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strumm. Optimizing space amplification in RocksDB. In CIDR, volume 3, page 3, 2017.
[15]
Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and Mark Callaghan. 2013. LinkBench: a database benchmark based on the Facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). Association for Computing Machinery, New York, NY, USA, 1185--1196.
[16]
Tasha Frankie, Gordon Hughes, and Ken Kreutz-Delgado. 2012. A mathematical model of the trim command in NAND-flash SSDs. In Proceedings of the 50th Annual Southeast Regional Conference (ACM-SE '12). Association for Computing Machinery, New York, NY, USA, 59--64.
[17]
MySQL InnoDB Undo Logs https://dev.mysql.com/doc/refman/5.6/en/innodb-undo-logs.html
[18]
George, Lars. HBase: the definitive guide: random access to your planet-size data. "O'Reilly Media, Inc.", 2011.
[19]
Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Analysis of HDFS under HBase: a facebook messages case study. In Proceedings of the 12th USENIX conference on File and Storage Technologies (FAST'14). USENIX Association, USA, 199--212.
[20]
Xiang Li, Thomas Georgiou. Migrating Messenger storage to optimize performance https://engineering.fb.com/core-data/migrating-messenger-storage-to-optimize-performance/
[21]
Evans, J. 2006, A Scalable Concurrent malloc(3) Implementation for FreeBSD
[22]
Stonebraker, M. 1981. Operating System Support for Database Management. Communications of the ACM 24(7): 412--418
[23]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages.
[24]
Lakshman, A. and Malik, P., 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2), pp.35--40.
[25]
Bacon, D.F., Bales, N., Bruno, N., Cooper, B.F., Dickinson, A., Fikes, A., Fraser, C., Gubarev, A., Joshi, M., Kogan, E. and Lloyd, A., 2017, May. Spanner: Becoming a SQL system. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 331--343).
[26]
Taft, R., Sharif, I., Matei, A., VanBenschoten, N., Lewis, J., Grieger, T., Niemi, K., Woods, A., Birzin, A., Poss, R. and Bardea, P., 2020, June. CockroachDB: The Resilient Geo-Distributed SQL Database. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (pp. 1493--1509).
[27]
Yugabyte, Inc. The Leading High-Performance Distributed SQL Database. https://www.yugabyte.com/. Accessed: 2020-02-09.
[28]
PingCAP. Tackling MySQL Scalability with TiDB: the most actively developed open source NewSQL database on GitHub. https://pingcap.com/. Accessed: 2020-02-09.
[29]
Verbitski, A., Gupta, A., Saha, D., Brahmadesam, M., Gupta, K., Mittal, R., Krishnamurthy, S., Maurice, S., Kharatishvili, T. and Bao, X., 2017, May. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 1041--1052).
[30]
I. Tokutek, "TokuDB: MySQL performance, MariaDB performance," http://www.tokutek.com/products/tokudb-for-mysql/, 2013.
[31]
Feifei Li. Cloud-Native Database Systems at Alibaba: Opportunities and Challenges. PVLDB, 12(12): 2263 -- 2272, 2019.
[32]
Elmore, A.J., Das, S., Agrawal, D. and El Abbadi, A., 2011, June. Zephyr: live migration in shared nothing databases for elastic cloud platforms. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (pp. 301--312).
[33]
Netflix Technology Blog. Netflix Billing Migration to AWS --- Part III. https://netflixtechblog.com/netflix-billing-migration-to-aws-part-iii-7d94ab9d1f59
[34]
Migrating from AWS RDS MySQL to AWS Aurora Serverless MySQL Database. https://www.adelatech.com/migrating-from-aws-rds-mysql-to-aws-aurora-serverless-mysql-database/
[35]
Dayan, N., Athanassoulis, M. and Idreos, S., 2017, May. Monkey: Optimal navigable key-value store. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 79--94).
[36]
Zhang, Y., Li, Y., Guo, F., Li, C. and Xu, Y., 2018. ElasticBF: Fine-grained and Elastic Bloom Filter Towards Efficient Read for LSM-tree-based {KV} Stores. In 10th {USENIX} Workshop on Hot Topics in Storage and File Systems (HotStorage 18).
[37]
Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. SuRF: Practical Range Query Filtering with Fast Succinct Tries. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 323--336.

Cited By

View all
  • (2024)On Reducing Space Amplification with Multi-Column Compaction in Apache IoTDBProceedings of the VLDB Endowment10.14778/3681954.368197717:11(2974-2986)Online publication date: 30-Aug-2024
  • (2024)KVBench: A Key-Value Benchmarking SuiteProceedings of the Tenth International Workshop on Testing Database Systems10.1145/3662165.3662765(9-15)Online publication date: 9-Jun-2024
  • (2024)Index Shipping for Efficient Replication in LSM Key-Value Stores with Hybrid KV PlacementACM Transactions on Storage10.1145/365867220:3(1-23)Online publication date: 16-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 13, Issue 12
August 2020
1710 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2020
Published in PVLDB Volume 13, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)226
  • Downloads (Last 6 weeks)13
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)On Reducing Space Amplification with Multi-Column Compaction in Apache IoTDBProceedings of the VLDB Endowment10.14778/3681954.368197717:11(2974-2986)Online publication date: 30-Aug-2024
  • (2024)KVBench: A Key-Value Benchmarking SuiteProceedings of the Tenth International Workshop on Testing Database Systems10.1145/3662165.3662765(9-15)Online publication date: 9-Jun-2024
  • (2024)Index Shipping for Efficient Replication in LSM Key-Value Stores with Hybrid KV PlacementACM Transactions on Storage10.1145/365867220:3(1-23)Online publication date: 16-Apr-2024
  • (2024)D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated StorageACM Transactions on Architecture and Code Optimization10.1145/365658421:3(1-22)Online publication date: 9-Apr-2024
  • (2024)Grafite: Taming Adversarial Queries with Optimal Range FiltersProceedings of the ACM on Management of Data10.1145/36392582:1(1-23)Online publication date: 26-Mar-2024
  • (2024)Wormhole Filters: Caching Your Hash on Persistent MemoryProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629590(456-471)Online publication date: 22-Apr-2024
  • (2024)TimeCloth: Fast Point-in-Time Database Recovery in The CloudCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653382(214-226)Online publication date: 9-Jun-2024
  • (2024)Storage Management with Multi-Version Partitioned BTreesInformation Systems10.1016/j.is.2024.102403125:COnline publication date: 1-Nov-2024
  • (2024)Flutist: Parallelizing Transaction Processing for LSM-Tree-Based Relational DatabaseWeb and Big Data10.1007/978-981-97-7238-4_29(460-476)Online publication date: 31-Aug-2024
  • (2023)ADOCProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585943(65-80)Online publication date: 21-Feb-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media