More Web Proxy on the site http://driver.im/

research-article

Open access

INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive Paths

Authors:

Sa WangAuthors Info & Claims

SoCC '24: Proceedings of the 2024 ACM Symposium on Cloud Computing

Pages 380 - 397

https://doi.org/10.1145/3698038.3698508

Published: 20 November 2024 Publication History

Abstract

Identifying and managing performance interference in clouds has long been a critical and challenging task for cloud providers. They keep seeking useful performance indicators from underlying systems to monitor cloud applications accurately. However, state-of-the-art indicators are either sensitive to limited applications and resource contention or are unrobust to the continually changing production environments. There still lacks a practical and efficient indicator for production environments.

This paper proposes INS, a cloud runtime system that can effectively detect the performance fluctuation of online cloud applications and reallocate resources to curb performance interference. It proposes INSPath as the new performance indicator to describe the degree of performance degradation and pinpoint the resource bottlenecks. Our evaluation of nine widely-used applications demonstrates that INS can detect the SLO violations of applications and identify the resource bottleneck accurately. Meanwhile, INS outperforms state-of-the-art PARTIES with more responsive and effective resource tuning and fewer SLO violations.

References

[1]

Anup Agarwal, Shadi Noghabi, Íñigo Goiri, Srinivasan Seshan, and Anirudh Badam. 2023. Unlocking unallocated cloud capacity for long, uninterruptible workloads. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 457--478.

[2]

Mahya Morid Ahmadi, Faiq Khalid, and Muhammad Shafique. 2021. Side-channel attacks on RISC-V processors: Current progress, challenges, and opportunities. arXiv preprint arXiv:2106.08877 (2021).

[3]

George Amvrosiadis, Jun Woo Park, Gregory R Ganger, Garth A Gibson, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the diversity of cluster workloads and its impact on research results. In USENIX ATC 18.

[4]

Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. {IX}: a protected dataplane operating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 49--65.

[5]

Tim Bird. 2009. Measuring function duration with ftrace. In Proceedings of the Linux Symposium, Vol. 1. Citeseer.

[6]

Barry B Brey. 1999. The Intel microprocessors 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium II processors: architecture, programming, and interfacing. Prentice-Hall, Inc.

[7]

Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, and Minyi Guo. 2020. Alita: Comprehensive performance isolation through bias resource management for public clouds. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--13.

[8]

Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, Jason Mars, and Lingjia Tang. 2017. Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 17--32.

Digital Library

[9]

Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. ACM SIGPLAN Notices 51, 4 (2016), 681--696.

Digital Library

[10]

Shuang Chen, Christina Delimitrou, and José F Martínez. 2019. Parties: Qos-aware resource partitioning for multiple interactive services. In ASPLOS.

[11]

Shuang Chen, Yi Jiang, Christina Delimitrou, and José F Martínez. 2022. PIMCloud: QoS-Aware Resource Management of Latency-Critical Applications in Clouds with Processing-in-Memory. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1086--1099.

[12]

Shuang Chen, Angela Jin, Christina Delimitrou, and José F Martínez. 2022. Retail: Opting for learning simplicity to enable qos-aware power management in the cloud. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 155--168.

[13]

Yue Cheng, Zheng Chai, and Ali Anwar. 2018. Characterizing co-located datacenter workloads: An alibaba case study. In Proceedings of the 9th Asia-Pacific Workshop on Systems. 1--3.

Digital Library

[14]

Ira Cohen, Jeffrey S Chase, Moises Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In OSDI.

[15]

Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty, Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. Noise reduction in speech processing (2009).

[16]

colin king. 2020. stress-ng. https://wiki.ubuntu.com/Kernel/Reference/stress-ng.

[17]

Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In SOSP.

[18]

Christina Delimitrou and Christos Kozyrakis. 2013. ibench: Quantifying interference for datacenter applications. In 2013 IEEE international symposium on workload characterization (IISWC). IEEE.

[19]

Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices 48, 4 (2013), 77--88.

Digital Library

[20]

Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144.

Digital Library

[21]

Julien Desfossez, Mathieu Desnoyers, and Michel R Dagenais. 2016. Runtime latency detection and analysis. Software: Practice and Experience 46, 10 (2016), 1397--1409.

Digital Library

[22]

Mathieu Desnoyers and Michel R Dagenais. 2006. The lttng tracer: A low impact performance and behavior monitor for gnu/linux. In OLS (Ottawa Linux Symposium), Vol. 2006. Citeseer, 209--224.

[23]

Francois Doray and Michel Dagenais. 2016. Diagnosing performance variations by comparing multi-level execution traces. TPDS 28, 2 (2016), 462--474.

[24]

Frank Ch Eigler and Red Hat. 2006. Problem solving with systemtap. In Proc. of the Ottawa Linux Symposium. Citeseer, 261--268.

[25]

Yihui Feng, Zhi Liu, Yunjian Zhao, Tatiana Jin, Yidi Wu, Yang Zhang, James Cheng, Chao Li, and Tao Guan. 2021. Scaling large production clusters with partitioned synchronization. In USENIX ATC. 81--97.

[26]

Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. 2020. Caladan: Mitigating interference at microsecond timescales. In OSDI. 281--297.

[27]

Kaihua Fu, Wei Zhang, Quan Chen, Deze Zeng, Xin Peng, Wenli Zheng, and Minyi Guo. 2021. Qos-aware and resource efficient microservice deployment in cloud-edge continuum. In IPDPS. IEEE.

[28]

Sriram Govindan, Jie Liu, Aman Kansal, and Anand Sivasubramaniam. 2011. Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines. In Proceedings of the 2nd ACM Symposium on Cloud Computing. 1--14.

Digital Library

[29]

Jing Guo, Zihao Chang, Sa Wang, Haiyang Ding, Yihui Feng, Liang Mao, and Yungang Bao. 2019. Who limits the resource efficiency of my datacenter: An analysis of alibaba datacenter traces. In Proceedings of the International Symposium on Quality of Service. 1--10.

Digital Library

[30]

Tejun Heo, Dan Schatzberg, Andrew Newell, Song Liu, Saravanan Dhakshinamurthy, Iyswarya Narayanan, Josef Bacik, Chris Mason, Chunqiang Tang, and Dimitrios Skarlatos. 2022. IOCost: block IO control for containers in datacenters. In ASPLOS. 595--608.

[31]

Andrew Herdrich, Edwin Verplanke, Priya Autee, Ramesh Illikkal, Chris Gianos, Ronak Singhal, and Ravi Iyer. 2016. Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family. In HPCA. IEEE, 657--668.

[32]

Introduction to Memory Bandwidth Allocation 2019. Introduction to Memory Bandwidth Allocation. https://www.intel.com/content/www/us/en/developer/articles/technical/.

[33]

Călin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh Elnikety, Manoj Syamala, Vivek Narasayya, Herodotos Herodotou, Paulo Tomita, Alex Chen, Jack Zhang, et al. 2018. {PerfIso}: Performance isolation for commercial {Latency-Sensitive} services. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 519--532.

[34]

Melanie Kambadur, Tipp Moseley, Rick Hank, and Martha A Kim. 2012. Measuring interference between live datacenter applications. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--12.

Digital Library

[35]

In Kee Kim, Jinho Hwang, Wei Wang, and Marty Humphrey. 2020. Guaranteeing performance SLAs of cloud applications under resource storms. IEEE Transactions on Cloud Computing 10, 2 (2020), 1329--1343.

[36]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International symposium on code generation and optimization, 2004. CGO 2004. IEEE, 75--86.

[37]

Jaewon Lee, Changkyu Kim, Kun Lin, Liqun Cheng, Rama Govindaraju, and Jangwoo Kim. 2018. WSMeter: A performance evaluation methodology for Google's production warehouse-scale computers. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. 549--563.

Digital Library

[38]

Jaewon Lee, Dongmoon Min, Ilkwon Byun, Hanhwi Jang, and Jangwoo Kim. 2023. Fast, Light-weight, and Accurate Performance Evaluation using Representative Datacenter Behaviors. In Proceedings of the 24th International Middleware Conference. 220--233.

Digital Library

[39]

David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 450--462.

Digital Library

[40]

David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2016. Improving resource efficiency at scale with heracles. ACM Transactions on Computer Systems (TOCS) 34, 2 (2016), 1--33.

Digital Library

[41]

Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing microservice dependency and performance: Alibaba trace analysis. In SOCC. 412--426.

[42]

Shutian Luo, Huanle Xu, Kejiang Ye, Guoyao Xu, Liping Zhang, Guodong Yang, and Chengzhong Xu. 2022. The power of prediction: microservice auto scaling via workload learning. In Proceedings of the 13th Symposium on Cloud Computing. 355--369.

Digital Library

[43]

Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture. 248--259.

Digital Library

[44]

Sebastiano Miano, Matteo Bertrone, Fulvio Risso, Massimo Tumolo, and Mauricio Vásquez Bernal. 2018. Creating complex network services with ebpf: Experience and lessons learned. In 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR). IEEE, 1--8.

[45]

Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-clouds: managing performance interference effects for qos-aware clouds. In Proceedings of the 5th European conference on Computer systems. 237--250.

Digital Library

[46]

Yuanjiang Ni, Pankaj Mehra, Ethan Miller, and Heiner Litz. 2023. TMC: Near-Optimal Resource Allocation for Tiered-Memory Systems. In SoCC.

Digital Library

[47]

Feng Niu, Che Zhang, Christopher Ré, and Jude W Shavlik. 2012. DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. VLDS 12 (2012), 25--28.

[48]

Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving high {CPU} efficiency for latency-sensitive datacenter workloads. In NSDI. 361--378.

[49]

Tirthak Patel and Devesh Tiwari. 2020. Clite: Efficient and qos-aware co-location of multiple latency-critical jobs for warehouse scale computers. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 193--206.

[50]

Yajuan Peng, Shuang Chen, Yi Zhao, and Zhibin Yu. 2024. {UFO}: The Ultimate {QoS-Aware} Core Management for Virtualized and Oversubscribed Public Clouds. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 1511--1530.

[51]

Vara Prasad, William Cohen, FC Eigler, Martin Hunt, Jim Keniston, and Brad Chen. 2005. Locating system problems using dynamic instrumentation. In 2005 Ottawa Linux Symposium. Citeseer, 49--64.

[52]

George Prekas, Marios Kogias, and Edouard Bugnion. 2017. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles.

Digital Library

[53]

Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. 2018. Arachne:{Core-Aware} Thread Management. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 145--160.

[54]

Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. 2010. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE micro 30, 4 (2010), 65--79.

[55]

Jiuchen Shi, Hang Zhang, Zhixin Tong, Quan Chen, Kaihua Fu, and Minyi Guo. 2023. Nodens: Enabling Resource Efficient and Fast {QoS} Recovery of Dynamic Microservice Applications in Datacenters. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 403--417.

[56]

Amoghavarsha Suresh and Anshul Gandhi. 2021. ServerMore: Opportunistic Execution of Serverless Functions in the Cloud. In Proceedings of the ACM Symposium on Cloud Computing. 570--584.

Digital Library

[57]

Wenda Tang, Yutao Ke, Senbo Fu, Hongliang Jiang, Junjie Wu, Qian Peng, and Feng Gao. 2022. Demeter: Qos-aware cpu scheduling to reduce power consumption of multiple black-box workloads. In Proceedings of the 13th Symposium on Cloud Computing. 31--46.

Digital Library

[58]

Muhammad Tirmazi, Adam Barker, Nan Deng, Md E Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: the next generation. In EuroSys. 1--14.

[59]

Mert Toslali, Emre Ates, Alex Ellis, Zhaoqi Zhang, Darby Huye, Lan Liu, Samantha Puterman, Ayse K Coskun, and Raja R Sambasivan. 2021. Automating instrumentation choices for performance problems in distributed applications with VAIF. In SoCC. 61--75.

[60]

Xinkai Wang, Hao He, Yuancheng Li, Chao Li, Xiaofeng Hou, Jing Wang, Quan Chen, Jingwen Leng, Minyi Guo, and Leibo Wang. 2023. Not All Resources are Visible: Exploiting Fragmented Shadow Resources in Shared-State Scheduler Architecture. In SOCC. 109--124.

[61]

Ziliang Wang, Shiyi Zhu, Jianguo Li, Wei Jiang, KK Ramakrishnan, Yangfei Zheng, Meng Yan, Xiaohong Zhang, and Alex X Liu. 2022. DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems. In SoCC. 16--30.

[62]

Wu Xiang, Yakun Li, Yuquan Ren, Fan Jiang, Chaohui Xin, Varun Gupta, Chao Xiang, Xinyi Song, Meng Liu, Bing Li, et al. 2023. Gödel: Unified Large-Scale Resource Management and Scheduling at ByteDance. In Proceedings of the 2023 ACM Symposium on Cloud Computing. 308--323.

Digital Library

[63]

Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. ACM SIGARCH Computer Architecture News 41, 3 (2013), 607--618.

Digital Library

[64]

Hailong Yang, Quan Chen, Moeiz Riaz, Zhongzhi Luan, Lingjia Tang, and Jason Mars. 2017. Powerchief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained cmp. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 133--146.

Digital Library

[65]

Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. 379--391.

Digital Library

[66]

Zhizhou Zhang, Murali Krishna Ramanathan, Prithvi Raj, Abhishek Parwal, Timothy Sherwood, and Milind Chabbi. 2022. {CRISP}: Critical Path Analysis of {Large-Scale} Microservice Architectures. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 655--672.

[67]

Zhuangzhuang Zhou, Yanqi Zhang, and Christina Delimitrou. 2022. Aquatope: Qos-and-uncertainty-aware resource management for multi-stage serverless workflows. In ASPLOS. 1--14.

[68]

R Zivanovic and C Cairns. 1996. Implementation of PMU technology in state estimation: an overview. Proceedings of IEEE. AFRICON'96 2 (1996), 1006--1011.

Index Terms

INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive Paths
1. Computer systems organization
  1. Real-time systems
    1. Real-time system architecture

Recommendations

Performance Measurement and Interference Profiling in Multi-tenant Clouds
CLOUD '15: Proceedings of the 2015 IEEE 8th International Conference on Cloud Computing

The ongoing rush for cloud-based services by small, medium, and large-scale organizations to reduce operational cost and to have more flexibility in the deployment and management of business applications cannot be overemphasized. However, the performance ...
Performance Analysis of Network I/O Workloads in Virtualized Data Centers

Server consolidation and application consolidation through virtualization are key performance optimizations in cloud-based service delivery industry. In this paper, we argue that it is important for both cloud consumers and cloud providers to understand ...
Q-clouds: managing performance interference effects for QoS-aware clouds
EuroSys '10: Proceedings of the 5th European conference on Computer systems

Cloud computing offers users the ability to access large pools of computational and storage resources on demand. Multiple commercial clouds already allow businesses to replace, or supplement, privately owned IT assets, alleviating them from the burden ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '24: Proceedings of the 2024 ACM Symposium on Cloud Computing

November 2024

1062 pages

ISBN:9798400712869

DOI:10.1145/3698038

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 November 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China

Conference

SoCC '24

Sponsor:

SoCC '24: ACM Symposium on Cloud Computing

November 20 - 22, 2024

WA, Redmond, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
249
Total Downloads

Downloads (Last 12 months)249
Downloads (Last 6 weeks)72

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten