[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3698038.3698508acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive Paths

Published: 20 November 2024 Publication History

Abstract

Identifying and managing performance interference in clouds has long been a critical and challenging task for cloud providers. They keep seeking useful performance indicators from underlying systems to monitor cloud applications accurately. However, state-of-the-art indicators are either sensitive to limited applications and resource contention or are unrobust to the continually changing production environments. There still lacks a practical and efficient indicator for production environments.
This paper proposes INS, a cloud runtime system that can effectively detect the performance fluctuation of online cloud applications and reallocate resources to curb performance interference. It proposes INSPath as the new performance indicator to describe the degree of performance degradation and pinpoint the resource bottlenecks. Our evaluation of nine widely-used applications demonstrates that INS can detect the SLO violations of applications and identify the resource bottleneck accurately. Meanwhile, INS outperforms state-of-the-art PARTIES with more responsive and effective resource tuning and fewer SLO violations.

References

[1]
Anup Agarwal, Shadi Noghabi, Íñigo Goiri, Srinivasan Seshan, and Anirudh Badam. 2023. Unlocking unallocated cloud capacity for long, uninterruptible workloads. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 457--478.
[2]
Mahya Morid Ahmadi, Faiq Khalid, and Muhammad Shafique. 2021. Side-channel attacks on RISC-V processors: Current progress, challenges, and opportunities. arXiv preprint arXiv:2106.08877 (2021).
[3]
George Amvrosiadis, Jun Woo Park, Gregory R Ganger, Garth A Gibson, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the diversity of cluster workloads and its impact on research results. In USENIX ATC 18.
[4]
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. {IX}: a protected dataplane operating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 49--65.
[5]
Tim Bird. 2009. Measuring function duration with ftrace. In Proceedings of the Linux Symposium, Vol. 1. Citeseer.
[6]
Barry B Brey. 1999. The Intel microprocessors 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium II processors: architecture, programming, and interfacing. Prentice-Hall, Inc.
[7]
Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, and Minyi Guo. 2020. Alita: Comprehensive performance isolation through bias resource management for public clouds. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--13.
[8]
Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, Jason Mars, and Lingjia Tang. 2017. Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 17--32.
[9]
Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. ACM SIGPLAN Notices 51, 4 (2016), 681--696.
[10]
Shuang Chen, Christina Delimitrou, and José F Martínez. 2019. Parties: Qos-aware resource partitioning for multiple interactive services. In ASPLOS.
[11]
Shuang Chen, Yi Jiang, Christina Delimitrou, and José F Martínez. 2022. PIMCloud: QoS-Aware Resource Management of Latency-Critical Applications in Clouds with Processing-in-Memory. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1086--1099.
[12]
Shuang Chen, Angela Jin, Christina Delimitrou, and José F Martínez. 2022. Retail: Opting for learning simplicity to enable qos-aware power management in the cloud. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 155--168.
[13]
Yue Cheng, Zheng Chai, and Ali Anwar. 2018. Characterizing co-located datacenter workloads: An alibaba case study. In Proceedings of the 9th Asia-Pacific Workshop on Systems. 1--3.
[14]
Ira Cohen, Jeffrey S Chase, Moises Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In OSDI.
[15]
Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty, Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. Noise reduction in speech processing (2009).
[16]
colin king. 2020. stress-ng. https://wiki.ubuntu.com/Kernel/Reference/stress-ng.
[17]
Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In SOSP.
[18]
Christina Delimitrou and Christos Kozyrakis. 2013. ibench: Quantifying interference for datacenter applications. In 2013 IEEE international symposium on workload characterization (IISWC). IEEE.
[19]
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices 48, 4 (2013), 77--88.
[20]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144.
[21]
Julien Desfossez, Mathieu Desnoyers, and Michel R Dagenais. 2016. Runtime latency detection and analysis. Software: Practice and Experience 46, 10 (2016), 1397--1409.
[22]
Mathieu Desnoyers and Michel R Dagenais. 2006. The lttng tracer: A low impact performance and behavior monitor for gnu/linux. In OLS (Ottawa Linux Symposium), Vol. 2006. Citeseer, 209--224.
[23]
Francois Doray and Michel Dagenais. 2016. Diagnosing performance variations by comparing multi-level execution traces. TPDS 28, 2 (2016), 462--474.
[24]
Frank Ch Eigler and Red Hat. 2006. Problem solving with systemtap. In Proc. of the Ottawa Linux Symposium. Citeseer, 261--268.
[25]
Yihui Feng, Zhi Liu, Yunjian Zhao, Tatiana Jin, Yidi Wu, Yang Zhang, James Cheng, Chao Li, and Tao Guan. 2021. Scaling large production clusters with partitioned synchronization. In USENIX ATC. 81--97.
[26]
Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. 2020. Caladan: Mitigating interference at microsecond timescales. In OSDI. 281--297.
[27]
Kaihua Fu, Wei Zhang, Quan Chen, Deze Zeng, Xin Peng, Wenli Zheng, and Minyi Guo. 2021. Qos-aware and resource efficient microservice deployment in cloud-edge continuum. In IPDPS. IEEE.
[28]
Sriram Govindan, Jie Liu, Aman Kansal, and Anand Sivasubramaniam. 2011. Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines. In Proceedings of the 2nd ACM Symposium on Cloud Computing. 1--14.
[29]
Jing Guo, Zihao Chang, Sa Wang, Haiyang Ding, Yihui Feng, Liang Mao, and Yungang Bao. 2019. Who limits the resource efficiency of my datacenter: An analysis of alibaba datacenter traces. In Proceedings of the International Symposium on Quality of Service. 1--10.
[30]
Tejun Heo, Dan Schatzberg, Andrew Newell, Song Liu, Saravanan Dhakshinamurthy, Iyswarya Narayanan, Josef Bacik, Chris Mason, Chunqiang Tang, and Dimitrios Skarlatos. 2022. IOCost: block IO control for containers in datacenters. In ASPLOS. 595--608.
[31]
Andrew Herdrich, Edwin Verplanke, Priya Autee, Ramesh Illikkal, Chris Gianos, Ronak Singhal, and Ravi Iyer. 2016. Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family. In HPCA. IEEE, 657--668.
[32]
Introduction to Memory Bandwidth Allocation 2019. Introduction to Memory Bandwidth Allocation. https://www.intel.com/content/www/us/en/developer/articles/technical/.
[33]
Călin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh Elnikety, Manoj Syamala, Vivek Narasayya, Herodotos Herodotou, Paulo Tomita, Alex Chen, Jack Zhang, et al. 2018. {PerfIso}: Performance isolation for commercial {Latency-Sensitive} services. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 519--532.
[34]
Melanie Kambadur, Tipp Moseley, Rick Hank, and Martha A Kim. 2012. Measuring interference between live datacenter applications. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--12.
[35]
In Kee Kim, Jinho Hwang, Wei Wang, and Marty Humphrey. 2020. Guaranteeing performance SLAs of cloud applications under resource storms. IEEE Transactions on Cloud Computing 10, 2 (2020), 1329--1343.
[36]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International symposium on code generation and optimization, 2004. CGO 2004. IEEE, 75--86.
[37]
Jaewon Lee, Changkyu Kim, Kun Lin, Liqun Cheng, Rama Govindaraju, and Jangwoo Kim. 2018. WSMeter: A performance evaluation methodology for Google's production warehouse-scale computers. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. 549--563.
[38]
Jaewon Lee, Dongmoon Min, Ilkwon Byun, Hanhwi Jang, and Jangwoo Kim. 2023. Fast, Light-weight, and Accurate Performance Evaluation using Representative Datacenter Behaviors. In Proceedings of the 24th International Middleware Conference. 220--233.
[39]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 450--462.
[40]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2016. Improving resource efficiency at scale with heracles. ACM Transactions on Computer Systems (TOCS) 34, 2 (2016), 1--33.
[41]
Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing microservice dependency and performance: Alibaba trace analysis. In SOCC. 412--426.
[42]
Shutian Luo, Huanle Xu, Kejiang Ye, Guoyao Xu, Liping Zhang, Guodong Yang, and Chengzhong Xu. 2022. The power of prediction: microservice auto scaling via workload learning. In Proceedings of the 13th Symposium on Cloud Computing. 355--369.
[43]
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture. 248--259.
[44]
Sebastiano Miano, Matteo Bertrone, Fulvio Risso, Massimo Tumolo, and Mauricio Vásquez Bernal. 2018. Creating complex network services with ebpf: Experience and lessons learned. In 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR). IEEE, 1--8.
[45]
Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-clouds: managing performance interference effects for qos-aware clouds. In Proceedings of the 5th European conference on Computer systems. 237--250.
[46]
Yuanjiang Ni, Pankaj Mehra, Ethan Miller, and Heiner Litz. 2023. TMC: Near-Optimal Resource Allocation for Tiered-Memory Systems. In SoCC.
[47]
Feng Niu, Che Zhang, Christopher Ré, and Jude W Shavlik. 2012. DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. VLDS 12 (2012), 25--28.
[48]
Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving high {CPU} efficiency for latency-sensitive datacenter workloads. In NSDI. 361--378.
[49]
Tirthak Patel and Devesh Tiwari. 2020. Clite: Efficient and qos-aware co-location of multiple latency-critical jobs for warehouse scale computers. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 193--206.
[50]
Yajuan Peng, Shuang Chen, Yi Zhao, and Zhibin Yu. 2024. {UFO}: The Ultimate {QoS-Aware} Core Management for Virtualized and Oversubscribed Public Clouds. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 1511--1530.
[51]
Vara Prasad, William Cohen, FC Eigler, Martin Hunt, Jim Keniston, and Brad Chen. 2005. Locating system problems using dynamic instrumentation. In 2005 Ottawa Linux Symposium. Citeseer, 49--64.
[52]
George Prekas, Marios Kogias, and Edouard Bugnion. 2017. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles.
[53]
Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. 2018. Arachne:{Core-Aware} Thread Management. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 145--160.
[54]
Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. 2010. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE micro 30, 4 (2010), 65--79.
[55]
Jiuchen Shi, Hang Zhang, Zhixin Tong, Quan Chen, Kaihua Fu, and Minyi Guo. 2023. Nodens: Enabling Resource Efficient and Fast {QoS} Recovery of Dynamic Microservice Applications in Datacenters. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 403--417.
[56]
Amoghavarsha Suresh and Anshul Gandhi. 2021. ServerMore: Opportunistic Execution of Serverless Functions in the Cloud. In Proceedings of the ACM Symposium on Cloud Computing. 570--584.
[57]
Wenda Tang, Yutao Ke, Senbo Fu, Hongliang Jiang, Junjie Wu, Qian Peng, and Feng Gao. 2022. Demeter: Qos-aware cpu scheduling to reduce power consumption of multiple black-box workloads. In Proceedings of the 13th Symposium on Cloud Computing. 31--46.
[58]
Muhammad Tirmazi, Adam Barker, Nan Deng, Md E Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: the next generation. In EuroSys. 1--14.
[59]
Mert Toslali, Emre Ates, Alex Ellis, Zhaoqi Zhang, Darby Huye, Lan Liu, Samantha Puterman, Ayse K Coskun, and Raja R Sambasivan. 2021. Automating instrumentation choices for performance problems in distributed applications with VAIF. In SoCC. 61--75.
[60]
Xinkai Wang, Hao He, Yuancheng Li, Chao Li, Xiaofeng Hou, Jing Wang, Quan Chen, Jingwen Leng, Minyi Guo, and Leibo Wang. 2023. Not All Resources are Visible: Exploiting Fragmented Shadow Resources in Shared-State Scheduler Architecture. In SOCC. 109--124.
[61]
Ziliang Wang, Shiyi Zhu, Jianguo Li, Wei Jiang, KK Ramakrishnan, Yangfei Zheng, Meng Yan, Xiaohong Zhang, and Alex X Liu. 2022. DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems. In SoCC. 16--30.
[62]
Wu Xiang, Yakun Li, Yuquan Ren, Fan Jiang, Chaohui Xin, Varun Gupta, Chao Xiang, Xinyi Song, Meng Liu, Bing Li, et al. 2023. Gödel: Unified Large-Scale Resource Management and Scheduling at ByteDance. In Proceedings of the 2023 ACM Symposium on Cloud Computing. 308--323.
[63]
Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. ACM SIGARCH Computer Architecture News 41, 3 (2013), 607--618.
[64]
Hailong Yang, Quan Chen, Moeiz Riaz, Zhongzhi Luan, Lingjia Tang, and Jason Mars. 2017. Powerchief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained cmp. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 133--146.
[65]
Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. 379--391.
[66]
Zhizhou Zhang, Murali Krishna Ramanathan, Prithvi Raj, Abhishek Parwal, Timothy Sherwood, and Milind Chabbi. 2022. {CRISP}: Critical Path Analysis of {Large-Scale} Microservice Architectures. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 655--672.
[67]
Zhuangzhuang Zhou, Yanqi Zhang, and Christina Delimitrou. 2022. Aquatope: Qos-and-uncertainty-aware resource management for multi-stage serverless workflows. In ASPLOS. 1--14.
[68]
R Zivanovic and C Cairns. 1996. Implementation of PMU technology in state estimation: an overview. Proceedings of IEEE. AFRICON'96 2 (1996), 1006--1011.

Index Terms

  1. INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive Paths

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '24: Proceedings of the 2024 ACM Symposium on Cloud Computing
    November 2024
    1062 pages
    ISBN:9798400712869
    DOI:10.1145/3698038
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 November 2024

    Check for updates

    Author Tags

    1. Cloud Computing
    2. Datacenters
    3. Performance Indicator
    4. Resource Management

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    SoCC '24
    Sponsor:
    SoCC '24: ACM Symposium on Cloud Computing
    November 20 - 22, 2024
    WA, Redmond, USA

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 249
      Total Downloads
    • Downloads (Last 12 months)249
    • Downloads (Last 6 weeks)72
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media