More Web Proxy on the site http://driver.im/

Article

Lightweight, high-resolution monitoring for troubleshooting production systems

Authors:

Abhishek Kumar,

Marc E. Fiuczynski,

Larry PetersonAuthors Info & Claims

OSDI'08: Proceedings of the 8th USENIX conference on Operating systems design and implementation

Pages 103 - 116

Published: 08 December 2008 Publication History

Abstract

Production systems are commonly plagued by intermittent problems that are difficult to diagnose. This paper describes a new diagnostic tool, called Chopstix, that continuously collects profiles of low-level OS events (e.g., scheduling, L2 cache misses, CPU utilization, I/O operations, page allocation, locking) at the granularity of executables, procedures and instructions. Chopstix then reconstructs these events offline for analysis. We have used Chopstix to diagnose several elusive problems in a largescale production system, thereby reducing these intermittent problems to reproducible bugs that can be debugged using standard techniques. The key to Chopstix is an approximate data collection strategy that incurs very low overhead. An evaluation shows Chopstix requires under 1% of the CPU, under 256KB of RAM, and under 16MB of disk space per day to collect a rich set of system-wide data.

References

[1]

J. M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger, S.- T. A. Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, and W. E. Weihl. Continuous profiling: where have all the cycles gone? ACM Trans. Comput. Syst., 15(4):357-390, 1997.

Digital Library

[2]

B. Beizer. Software Testing Techniques. International Thomson Computer Press, June 1990.

Digital Library

[3]

B. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic instrumentation of production systems. In USENIX Annual Technical Conference, General Track, 2004.

Digital Library

[4]

I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: a building block for automated diagnosis and control. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 16-29, 2004.

Digital Library

[5]

I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles, pages 105-118, 2005.

Digital Library

[6]

C. Estan, K. Keys, D. Moore, and G. Varghese. Building a better netflow. In SIGCOMM '04: Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications, pages 245-256, 2004.

Digital Library

[7]

L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. In SIGCOMM '98: Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication, 1998.

Digital Library

[8]

Ganglia Development Team. Ganglia monitoring system. URL: http://ganglia.info.

[9]

D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay debugging for distributed applications. In Proceedings of the annual conference on USENIX '06 Annual Technical Conference, 2006.

Digital Library

[10]

S. L. Graham, P. B. Kessler, and M. K. McKusick. gprof: a call graph execution profiler. In SIGPLAN Symposium on Compiler Construction, pages 120-126, 1982.

Digital Library

[11]

Intel. VTune performance analyzer homepage: developer. intel.com/software/products/vtune/index.html.

[12]

John Levon et al. OProfile - a system profiler for ulinux. URL: http://oprofile.sourceforge.net/doc/index.html.

[13]

S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In Proceedings of the annual conference on USENIX Annual Technical Conference, 2005.

Digital Library

[14]

A. Kumar, M. Sung, J. Xu, and J. Wang. Data streaming algorithms for efficient and accurate estimation of flow size distribution, 2002.

[15]

A. Kumar and J. Xu. Sketch guided sampling - using on-line estimates of flow size for adaptive data collection. In Proc. IEEE INFOCOM, 2006.

[16]

A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li. Space-code Bloom filter for efficient per-flow traffic measurement. In Proc. IEEE INFOCOM, Mar. 2004.

[17]

B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI '03: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, 2003.

Digital Library

[18]

L. McVoy and C. Staelin. lmbench: portable tools for performance analysis. In Proceedings of the annual conference on USENIX Annual Technical Conference, 1996.

Digital Library

[19]

M. Olszewski, K. Mierle, A. Czajkowski, and A. D. Brown. JIT instrumentation: a novel approach to dynamically instrument operating systems. SIGOPS Oper. Syst. Rev., 41(3), 2007.

Digital Library

[20]

P. S. Panchamukhi. Kernel debugging with kprobes: Insert printk's into the linux kernel on the fly. URL: http://www.ibm.com/developerworks/library/lkprobes.html.

[21]

K. Park and V. S. Pai. Comon: a mostly-scalable monitoring system for planetlab. SIGOPS Oper. Syst. Rev., 40(1):65-74, 2006.

Digital Library

[22]

L. Peterson, A. Bavier, M. E. Fiuczynski, and S. Muir. Experiences building planetlab. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, 2006.

Digital Library

[23]

F. Qin, J. Tucek, Y. Zhou, and J. Sundaresan. Rx: Treating bugs as allergies--a safe method to survive software failures. ACM Trans. Comput. Syst., 25(3), 2007.

Digital Library

[24]

SGI. KDB - built-in kernel debugger. URL: http://oss.sgi.com/projects/kdb.

[25]

B. Sprunt. The basics of performance-monitoring hardware. IEEE Micro, 22(4):64-71, 2002.

Digital Library

[26]

S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou. Flashback: a lightweight extension for rollback and deterministic replay for software debugging. In Proceedings of the annual conference on USENIX Annual Technical Conference, 2004.

Digital Library

[27]

J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: diagnosing production run failures at the user's site. In SOSP '07: Proceedings of twentyfirst ACM SIGOPS symposium on Operating systems principles, 2007.

Digital Library

[28]

C. Verbowski, E. Kiciman, A. Kumar, B. Daniels, S. Lu, J. Lee, Y.-M. Wang, and R. Roussev. Flight data recorder: monitoring persistent-state interactions to improve systems management. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, 2006.

Digital Library

[29]

J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checking to find serious file system errors. ACM Trans. Comput. Syst., 24(4):393-423, 2006.

Digital Library

[30]

X. Zhang, Z. Wang, N. Gloy, J. B. Chen, and M. D. Smith. System support for automatic profiling and optimization. SIGOPS Oper. Syst. Rev., 31(5):15- 26, 1997.

Digital Library

Cited By

Huang JMozafari BSchoenebeck GWenisch TChirkova RYang JSuciu D(2017)A Top-Down Approach to Achieving Performance Predictability in Database SystemsProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064016(745-758)Online publication date: 9-May-2017
https://dl.acm.org/doi/10.1145/3035918.3064016
Kandalintsev AKliazovich DLoźCigno R(2017)Freeze'nSenseSoftware—Practice & Experience10.1002/spe.245647:6(831-847)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1002/spe.2456
Zellweger GLin DRoscoe TCui HLau FBansal SZhong L(2016)So many performance events, so little timeProceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/2967360.2967375(1-9)Online publication date: 4-Aug-2016
https://dl.acm.org/doi/10.1145/2967360.2967375
Show More Cited By

Lightweight, high-resolution monitoring for troubleshooting production systems

Recommendations

A unified multiple-level cache for high performance storage systems

Multi-level cache hierarchies are widely used in high-performance storage systems to improve I/O performance. However, traditional cache management algorithms are not suited well for such cache organisations. Recently proposed multi-level cache ...
Troubleshooting transiently-recurring problems in production systems with blame-proportional logging
USENIX ATC '18: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference

Many problems in production systems are transiently recurring-- they occur rarely, but when they do, they recur for a short period of time. Troubleshooting these problems is hard as they are rare enough to be missed by sampling techniques, and ...
An Efficient Lightweight Shared Cache Design for Chip Multiprocessors
APPT '09: Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies

The large working sets of commercial and scientific workloads favor a shared L2 cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests in Chip Multiprocessors (CMP). The exponential increase in the number of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

OSDI'08: Proceedings of the 8th USENIX conference on Operating systems design and implementation

December 2008

384 pages

Sponsors

USENIX Assoc: USENIX Assoc

In-Cooperation

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

USENIX Association

United States

Publication History

Published: 08 December 2008

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang JMozafari BSchoenebeck GWenisch TChirkova RYang JSuciu D(2017)A Top-Down Approach to Achieving Performance Predictability in Database SystemsProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064016(745-758)Online publication date: 9-May-2017
https://dl.acm.org/doi/10.1145/3035918.3064016
Kandalintsev AKliazovich DLoźCigno R(2017)Freeze'nSenseSoftware—Practice & Experience10.1002/spe.245647:6(831-847)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1002/spe.2456
Zellweger GLin DRoscoe TCui HLau FBansal SZhong L(2016)So many performance events, so little timeProceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/2967360.2967375(1-9)Online publication date: 4-Aug-2016
https://dl.acm.org/doi/10.1145/2967360.2967375
Bovenzi ABrancati FRusso SBondavalli A(2015)An OS-level Framework for Anomaly Detection in Complex Software SystemsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2014.233430512:3(366-372)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1109/TDSC.2014.2334305
Jeswani DNatu MGhosh R(2015)Adaptive MonitoringJournal of Network and Systems Management10.1007/s10922-014-9330-823:4(950-977)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1007/s10922-014-9330-8
Leesatapornwongsa TGunawi HLazowska ETerry DArpaci-Dusseau RGehrke J(2014)The Case for Drill-Ready Cloud ComputingProceedings of the ACM Symposium on Cloud Computing10.1145/2670979.2670992(1-8)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1145/2670979.2670992
Wang CKavulya STan JHu LKutare MKasick MSchwan KNarasimhan PGandhi R(2013)Performance troubleshooting in data centersACM SIGOPS Operating Systems Review10.1145/2553070.255307947:3(50-62)Online publication date: 26-Nov-2013
https://dl.acm.org/doi/10.1145/2553070.2553079
Jeswani DNatu MGhosh RMedhi D(2012)Adaptive monitoringProceedings of the 8th International Conference on Network and Service Management10.5555/2499406.2499462(350-356)Online publication date: 22-Oct-2012
https://dl.acm.org/doi/10.5555/2499406.2499462
Wang CRayan IEisenhauer GSchwan KTalwar VWolf MHuneycutt C(2012)VScopeProceedings of the 13th International Middleware Conference10.5555/2442626.2442635(121-141)Online publication date: 3-Dec-2012
https://dl.acm.org/doi/10.5555/2442626.2442635
Marian TWeatherspoon HLee KSagar A(2012)FmeterProceedings of the 13th International Middleware Conference10.5555/2442626.2442633(81-100)Online publication date: 3-Dec-2012
https://dl.acm.org/doi/10.5555/2442626.2442633
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents