[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3503222.3507727acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

IOCost: block IO control for containers in datacenters

Published: 22 February 2022 Publication History

Abstract

Resource isolation is a fundamental requirement in datacenter environments. However, our production experience in Meta’s large-scale datacenters shows that existing IO control mechanisms for block storage are inadequate in containerized environments. IO control needs to provide proportional resources to containers while taking into account the hardware heterogeneity of storage devices and the idiosyncrasies of the workloads deployed in datacenters. The speed of modern SSDs requires IO control to execute with low-overheads. Furthermore, IO control should strive for work conservation, take into account the interactions with the memory management subsystem, and avoid priority inversions that lead to isolation failures. To address these challenges, this paper presents IOCost, an IO control solution that is designed for containerized environments and provides scalable, work-conserving, and low-overhead IO control for heterogeneous storage devices and diverse workloads in datacenters. IOCost performs offline profiling to build a device model and uses it to estimate device occupancy of each IO request. To minimize runtime overhead, it separates IO control into a fast per-IO issue path and a slower periodic planning path. A novel work-conserving budget donation algorithm enables containers to dynamically share unused budget. We have deployed IOCost across the entirety of Meta’s datacenters comprised of millions of ma- chines, upstreamed IOCost to the Linux kernel, and open-sourced our device-profiling tools. IOCost has been running in production for two years, providing IO control for Meta’s fleet. We describe the design of IOCost and share our experience deploying it at scale.

References

[1]
Sungyong Ahn, Kwanghyun La, and Jihong Kim. 2016. Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16). USENIX Association, Denver, CO. https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/ahn
[2]
Jens Axboe. 2021. Flexible I/O Tester. https://github.com/axboe/fio
[3]
Microsoft Azure. 2021. Container Instances. https://azure.microsoft.com/en-us/services/container-instances
[4]
J.C.R. Bennett and Hui Zhang. 1996. WF/sup 2/Q: worst-case fair weighted fair queueing. In Proceedings of IEEE INFOCOM ’96. Conference on Computer Communications. 1, 120–128 vol.1. https://doi.org/10.1109/INFCOM.1996.497885
[5]
Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman S. Unsal, and Ken Mai. 2012. Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime. In 2012 IEEE 30th International Conference on Computer Design (ICCD). 94–101. https://doi.org/10.1109/ICCD.2012.6378623
[6]
Feng Chen, David A. Koufaty, and Xiaodong Zhang. 2009. Understanding Intrinsic Characteristics and System Implications of Flash Memory Based Solid State Drives. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’09). Association for Computing Machinery, New York, NY, USA. 181–192. isbn:9781605585116 https://doi.org/10.1145/1555349.1555371
[7]
Renhai Chen, Yi Wang, Duo Liu, Zili Shao, and Song Jiang. 2017. Heating Dispersal for Self-Healing NAND Flash Memory. IEEE Trans. Comput., 66, 2 (2017), 361–367. https://doi.org/10.1109/TC.2016.2595572
[8]
Shuang Chen, Christina Delimitrou, and José F. Martínez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). New York, NY, USA. 107–120.
[9]
Google Cloud. 2021. Containers at Google. https://cloud.google.com/containers
[10]
Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17). Association for Computing Machinery, New York, NY, USA. 153–167. isbn:9781450350853 https://doi.org/10.1145/3132747.3132772
[11]
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). Association for Computing Machinery, New York, NY, USA. 77–88. isbn:9781450318709 https://doi.org/10.1145/2451116.2451125
[12]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). Association for Computing Machinery, New York, NY, USA. 127–144. isbn:9781450323055 https://doi.org/10.1145/2541940.2541941
[13]
A. Demers, S. Keshav, and S. Shenker. 1989. Analysis and Simulation of a Fair Queueing Algorithm. SIGCOMM Comput. Commun. Rev., 19, 4 (1989), Aug., 1–12. issn:0146-4833 https://doi.org/10.1145/75247.75248
[14]
Peter Desnoyers. 2014. Analytic Models of SSD Write Performance. ACM Trans. Storage, 10, 2 (2014), Article 8, March, 25 pages. issn:1553-3077 https://doi.org/10.1145/2577384
[15]
Alexandra Fedorova, Margo Seltzer, and Michael D. Smith. 2007. Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT ’07). IEEE Computer Society, USA. 25–38. isbn:0769529445
[16]
Ajay Gulati, Irfan Ahmad, and Carl A. Waldspurger. 2009. PARDA: Proportional Allocation of Resources for Distributed Storage Access. In 7th USENIX Conference on File and Storage Technologies (FAST 09). USENIX Association, San Francisco, CA.
[17]
Ajay Gulati, Arif Merchant, and Peter J. Varman. 2010. mClock: Handling Throughput Variability for Hypervisor IO Scheduling. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10). USENIX Association, Vancouver, BC.
[18]
Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. 2016. The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments. In 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA. 263–276. isbn:978-1-931971-28-7 https://www.usenix.org/conference/fast16/technical-sessions/presentation/hao
[19]
Jun He, Sudarsun Kannan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. The Unwritten Contract of Solid State Drives. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys ’17). Association for Computing Machinery, New York, NY, USA. 127–144. isbn:9781450349383 https://doi.org/10.1145/3064176.3064187
[20]
Tejun Heo. 2015. Control Group V2. https://www.kernel.org/doc/Documentation/cgroup-v2.txt
[21]
Jian Huang, Anirudh Badam, Laura Caulfield, Suman Nath, Sudipta Sengupta, Bikash Sharma, and Moinuddin K. Qureshi. 2017. FlashBlox: Achieving Both Performance Isolation and Uniform Lifetime for Virtualized SSDs. In 15th USENIX Conference on File and Storage Technologies (FAST 17). USENIX Association, Santa Clara, CA. 375–390. isbn:978-1-931971-36-2
[22]
Lan Huang, Gang Peng, and Tzi-cker Chiueh. 2004. Multi-Dimensional Storage Virtualization. SIGMETRICS Perform. Eval. Rev., 32, 1 (2004), June, 14–24. issn:0163-5999 https://doi.org/10.1145/1012888.1005692
[23]
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In 2010 USENIX Annual Technical Conference (USENIX ATC 10). USENIX Association. https://www.usenix.org/conference/usenix-atc-10/zookeeper-wait-free-coordination-internet-scale-systems
[24]
Harshad Kasture, Davide B. Bartolini, Nathan Beckmann, and Daniel Sanchez. 2015. Rubik: Fast analytical power management for latency-critical systems. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 598–610. https://doi.org/10.1145/2830772.2830797
[25]
Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient Cache Sharing with Strict Qos for Latency-Critical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). Association for Computing Machinery, New York, NY, USA. 729–742. isbn:9781450323055 https://doi.org/10.1145/2541940.2541944
[26]
Joonsung Kim, Pyeongsu Park, Jaehyung Ahn, Jihun Kim, Jong Kim, and Jangwoo Kim. 2018. SSDcheck: Timely and Accurate Prediction of Irregular Behaviors in Black-Box SSDs. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 455–468. https://doi.org/10.1109/MICRO.2018.00044
[27]
Ana Klimovic, Heiner Litz, and Christos Kozyrakis. 2017. ReFlex: Remote Flash ≈ Local Flash. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). Association for Computing Machinery, New York, NY, USA.
[28]
Junghee Lee, Youngjae Kim, Galen M. Shipman, Sarp Oral, and Jongman Kim. 2013. Preemptible I/O Scheduling of Garbage Collection for Solid State Drives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32, 2 (2013), 247–260. https://doi.org/10.1109/TCAD.2012.2227479
[29]
Shaohua Li. 2016. block-throttle: proportional throttle. https://lwn.net/Articles/676823/
[30]
Tong Li, Dan Baumberger, and Scott Hahn. 2009. Efficient and Scalable Multiprocessor Fair Scheduling Using Distributed Weighted Round-Robin. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’09). 65–74.
[31]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2016. Improving Resource Efficiency at Scale with Heracles. 33 pages.
[32]
NetApp. 2021. Guarantee throughput with QoS overview. https://docs.netapp.com/us-en/ontap/performance-admin/guarantee-throughput-qos-task.html#about-throughput-ceilings-qos-max
[33]
Andrew Newell, Dimitrios Skarlatos, Jingyuan Fan, Pavan Kumar, Maxim Khutornenko, Mayank Pundir, Yirui Zhang, Mingjun Zhang, Yuanlai Liu, Linh Le, Brendon Daugherty, Apurva Samudra, Prashasti Baid, James Kneeland, Igor Kabiljo, Dmitry Shchukin, Andre Rodrigues, Scott Michelson, Ben Christensen, Kaushik Veeraraghavan, and Chunqiang Tang. 2021. RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP ’21). Association for Computing Machinery, New York, NY, USA. 505–520. isbn:9781450387095 https://doi.org/10.1145/3477132.3483578
[34]
Amazon Web Services. 2021. Containers on AWS. https://aws.amazon.com/containers
[35]
Prashant J. Shenoy and Harrick M. Vin. 1998. Cello: A Disk Scheduling Framework for next Generation Operating Systems. In Proceedings of the 1998 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’98/PERFORMANCE ’98). Association for Computing Machinery, New York, NY, USA. 44–55. isbn:0897919823 https://doi.org/10.1145/277851.277871
[36]
Aameek Singh, Madhukar Korupolu, and Dushmanta Mohapatra. 2008. Server-storage virtualization: Integration and load balancing in data centers. In SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. 1–12. https://doi.org/10.1109/SC.2008.5222625
[37]
Dimitrios Skarlatos, Qingrong Chen, Jianyan Chen, Tianyin Xu, and Josep Torrellas. 2020. Draco: Architectural and Operating System Support for System Call Security. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 42–57. https://doi.org/10.1109/MICRO50266.2020.00017
[38]
Dimitrios Skarlatos, Umur Darbaz, Bhargava Gopireddy, Nam Sung Kim, and Josep Torrellas. 2021. BabelFish: Fusing Address Translations for Containers. IEEE Micro, 41, 3 (2021), 57–62. https://doi.org/10.1109/MM.2021.3073194
[39]
Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, Ben Christensen, Alex Gartrell, Maxim Khutornenko, Sachin Kulkarni, Marcin Pawlowski, Tuomas Pelkonen, Andre Rodrigues, Rounak Tibrewal, Vaishnavi Venkatesan, and Peter Zhang. 2020. Twine: A Unified Cluster Management System for Shared Infrastructure. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 787–803. isbn:978-1-939133-19-9 https://www.usenix.org/conference/osdi20/presentation/tang
[40]
Arash Tavakkol, Mohammad Sadrosadati, Saugata Ghose, Jeremie S. Kim, Yixin Luo, Yaohua Wang, Nika Mansouri Ghiasi, Lois Orosa, Juan Gómez-Luna, and Onur Mutlu. 2018. FLIN: Enabling Fairness and Enhancing Performance in Modern NVMe Solid State Drives. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA ’18). IEEE Press, 397–410. isbn:9781538659847 https://doi.org/10.1109/ISCA.2018.00041
[41]
Paolo Valente and Fabio Checconi. 2010. High throughput disk scheduling with fair bandwidth distribution. IEEE Trans. Comput., 59, 9 (2010), 1172–1186.
[42]
VMWare. 2021. Storage I/O Control Resource Shares and Limits. https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.resmgmt.doc/GUID-D964A753-0844-4343-A96F-27A4C769F92D.html
[43]
Matthew Wachs and Michael Abd-El-Malek. 2007. Argon: Performance Insulation for Shared Storage Servers. In 5th USENIX Conference on File and Storage Technologies (FAST 07). USENIX Association, San Jose, CA. https://www.usenix.org/conference/fast-07/argon-performance-insulation-shared-storage-servers
[44]
Qi Wu, Guiqiang Dong, and Tong Zhang. 2011. Exploiting Heat-Accelerated Flash Memory Wear-Out Recovery to Enable Self-Healing SSDs. In 3rd Workshop on Hot Topics in Storage and File Systems (HotStorage 11). USENIX Association, Portland, OR. https://www.usenix.org/conference/hotstorage11/exploiting-heat-accelerated-flash-memory-wear-out-recovery-enable-self
[45]
Chengen Yang, Hsing-Min Chen, Trevor Mudge, and Chaitali Chakrabarti. 2014. Improving the Reliability of MLC NAND Flash Memories Through Adaptive Data Refresh and Error Control Coding. Journal of Signal Processing Systems, 76 (2014), 09, 225–234. https://doi.org/10.1007/s11265-014-0880-5
[46]
Suli Yang, Tyler Harter, Nishant Agrawal, Salini Selvaraj Kowsalya, Anand Krishnamurthy, Samer Al-Kiswany, Rini T. Kaushik, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2015. Split-Level I/O Scheduling. Association for Computing Machinery, New York, NY, USA. 474–489. https://doi.org/10.1145/2815400.2815421
[47]
Ting Yang, Tongping Liu, Emery D. Berger, Scott F. Kaplan, and J. Eliot B. Moss. 2008. Redline: First Class Support for Interactivity in Commodity Operating Systems. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, USA. 73–86.
[48]
Aviad Zuck, Philipp Gühring, Tao Zhang, Donald E. Porter, and Dan Tsafrir. 2019. Why and How to Increase SSD Performance Transparency. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS ’19). Association for Computing Machinery, New York, NY, USA. 192–200. isbn:9781450367271 https://doi.org/10.1145/3317550.3321430

Cited By

View all
  • (2025)Meta’s Hyperscale Infrastructure: Overview and InsightsCommunications of the ACM10.1145/370129668:2(52-63)Online publication date: 22-Jan-2025
  • (2024)SymbiosisProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650701(51-70)Online publication date: 27-Feb-2024
  • (2024)INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive PathsProceedings of the ACM Symposium on Cloud Computing10.1145/3698038.3698508(380-397)Online publication date: 20-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
February 2022
1164 pages
ISBN:9781450392051
DOI:10.1145/3503222
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 February 2022

Check for updates

Author Tags

  1. Containers
  2. Datacenters
  3. I/O
  4. Operating Systems

Qualifiers

  • Research-article

Conference

ASPLOS '22

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)140
  • Downloads (Last 6 weeks)16
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Meta’s Hyperscale Infrastructure: Overview and InsightsCommunications of the ACM10.1145/370129668:2(52-63)Online publication date: 22-Jan-2025
  • (2024)SymbiosisProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650701(51-70)Online publication date: 27-Feb-2024
  • (2024)INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive PathsProceedings of the ACM Symposium on Cloud Computing10.1145/3698038.3698508(380-397)Online publication date: 20-Nov-2024
  • (2024)zQoS: Unleashing full performance capabilities of NVMe SSDs while enforcing SLOs in distributed storage systemsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673156(618-628)Online publication date: 12-Aug-2024
  • (2024)Locks as a Resource: Fairly Scheduling Lock Occupation with CFLProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638477(17-29)Online publication date: 2-Mar-2024
  • (2024)BypassD: Enabling fast userspace access to shared SSDsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624854(35-51)Online publication date: 27-Apr-2024
  • (2023)QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDsACM Transactions on Architecture and Code Optimization10.1145/363295521:1(1-25)Online publication date: 14-Nov-2023
  • (2023)Filesystem Fragmentation on Modern Storage SystemsACM Transactions on Computer Systems10.1145/361138641:1-4(1-27)Online publication date: 18-Dec-2023
  • (2023)Disaggregated RAID Storage in Modern DatacentersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582027(147-163)Online publication date: 25-Mar-2023
  • (2023)A Survey on File Defragmentation Techniques on Modern Storage Systems2023 14th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC58733.2023.10393099(785-787)Online publication date: 11-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media