No abstract available.
Proceeding Downloads
Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
- Jinxi Kuang,
- Jinyang Liu,
- Junjie Huang,
- Renyi Zhong,
- Jiazhen Gu,
- Lan Yu,
- Rui Tan,
- Zengyin Yang,
- Michael R. Lyu
Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual ...
Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach
Cloud service owners need to continuously monitor their services to ensure high availability and reliability. Gaps in monitoring can lead to delay in incident detection and significant negative customer impact. Current process of monitor creation is ad-...
FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems
- Junjie Huang,
- Jinyang Liu,
- Zhuangbin Chen,
- Zhihan Jiang,
- Yichen Li,
- Jiazhen Gu,
- Cong Feng,
- Zengyin Yang,
- Yongqiang Yang,
- Michael R. Lyu
Postmortem analysis is essential in the management of incidents within cloud systems, which provides valuable insights to improve system's reliability and robustness. At CloudA1, fault pattern profiling is performed during the postmortem phase, which ...
What Do You Mean by Memory? When Engineers Are Lost in the Maze of Complexity
An accepted practice to decrease applications' memory usage is to reduce the amount and frequency of memory allocations. Factors such as (a) the prevalence of out-of-memory (OOM) killers, (b) memory allocations in modern programming languages done ...
Recommendations
A Survey of Software Engineering Practice: Tools, Methods, and Results
The results of a survey of software development practice are reported and analyzed. The problems encountered in various phases of the software life cycle are measured and correlated with characteristics of the responding installations. The use and ...