[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/1855741.1855749acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

Lightweight, high-resolution monitoring for troubleshooting production systems

Published: 08 December 2008 Publication History

Abstract

Production systems are commonly plagued by intermittent problems that are difficult to diagnose. This paper describes a new diagnostic tool, called Chopstix, that continuously collects profiles of low-level OS events (e.g., scheduling, L2 cache misses, CPU utilization, I/O operations, page allocation, locking) at the granularity of executables, procedures and instructions. Chopstix then reconstructs these events offline for analysis. We have used Chopstix to diagnose several elusive problems in a largescale production system, thereby reducing these intermittent problems to reproducible bugs that can be debugged using standard techniques. The key to Chopstix is an approximate data collection strategy that incurs very low overhead. An evaluation shows Chopstix requires under 1% of the CPU, under 256KB of RAM, and under 16MB of disk space per day to collect a rich set of system-wide data.

References

[1]
J. M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger, S.- T. A. Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, and W. E. Weihl. Continuous profiling: where have all the cycles gone? ACM Trans. Comput. Syst., 15(4):357-390, 1997.
[2]
B. Beizer. Software Testing Techniques. International Thomson Computer Press, June 1990.
[3]
B. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic instrumentation of production systems. In USENIX Annual Technical Conference, General Track, 2004.
[4]
I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: a building block for automated diagnosis and control. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 16-29, 2004.
[5]
I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles, pages 105-118, 2005.
[6]
C. Estan, K. Keys, D. Moore, and G. Varghese. Building a better netflow. In SIGCOMM '04: Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications, pages 245-256, 2004.
[7]
L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. In SIGCOMM '98: Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication, 1998.
[8]
Ganglia Development Team. Ganglia monitoring system. URL: http://ganglia.info.
[9]
D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay debugging for distributed applications. In Proceedings of the annual conference on USENIX '06 Annual Technical Conference, 2006.
[10]
S. L. Graham, P. B. Kessler, and M. K. McKusick. gprof: a call graph execution profiler. In SIGPLAN Symposium on Compiler Construction, pages 120-126, 1982.
[11]
Intel. VTune performance analyzer homepage: developer. intel.com/software/products/vtune/index.html.
[12]
John Levon et al. OProfile - a system profiler for ulinux. URL: http://oprofile.sourceforge.net/doc/index.html.
[13]
S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In Proceedings of the annual conference on USENIX Annual Technical Conference, 2005.
[14]
A. Kumar, M. Sung, J. Xu, and J. Wang. Data streaming algorithms for efficient and accurate estimation of flow size distribution, 2002.
[15]
A. Kumar and J. Xu. Sketch guided sampling - using on-line estimates of flow size for adaptive data collection. In Proc. IEEE INFOCOM, 2006.
[16]
A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li. Space-code Bloom filter for efficient per-flow traffic measurement. In Proc. IEEE INFOCOM, Mar. 2004.
[17]
B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI '03: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, 2003.
[18]
L. McVoy and C. Staelin. lmbench: portable tools for performance analysis. In Proceedings of the annual conference on USENIX Annual Technical Conference, 1996.
[19]
M. Olszewski, K. Mierle, A. Czajkowski, and A. D. Brown. JIT instrumentation: a novel approach to dynamically instrument operating systems. SIGOPS Oper. Syst. Rev., 41(3), 2007.
[20]
P. S. Panchamukhi. Kernel debugging with kprobes: Insert printk's into the linux kernel on the fly. URL: http://www.ibm.com/developerworks/library/lkprobes.html.
[21]
K. Park and V. S. Pai. Comon: a mostly-scalable monitoring system for planetlab. SIGOPS Oper. Syst. Rev., 40(1):65-74, 2006.
[22]
L. Peterson, A. Bavier, M. E. Fiuczynski, and S. Muir. Experiences building planetlab. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, 2006.
[23]
F. Qin, J. Tucek, Y. Zhou, and J. Sundaresan. Rx: Treating bugs as allergies--a safe method to survive software failures. ACM Trans. Comput. Syst., 25(3), 2007.
[24]
SGI. KDB - built-in kernel debugger. URL: http://oss.sgi.com/projects/kdb.
[25]
B. Sprunt. The basics of performance-monitoring hardware. IEEE Micro, 22(4):64-71, 2002.
[26]
S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou. Flashback: a lightweight extension for rollback and deterministic replay for software debugging. In Proceedings of the annual conference on USENIX Annual Technical Conference, 2004.
[27]
J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: diagnosing production run failures at the user's site. In SOSP '07: Proceedings of twentyfirst ACM SIGOPS symposium on Operating systems principles, 2007.
[28]
C. Verbowski, E. Kiciman, A. Kumar, B. Daniels, S. Lu, J. Lee, Y.-M. Wang, and R. Roussev. Flight data recorder: monitoring persistent-state interactions to improve systems management. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, 2006.
[29]
J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checking to find serious file system errors. ACM Trans. Comput. Syst., 24(4):393-423, 2006.
[30]
X. Zhang, Z. Wang, N. Gloy, J. B. Chen, and M. D. Smith. System support for automatic profiling and optimization. SIGOPS Oper. Syst. Rev., 31(5):15- 26, 1997.

Cited By

View all
  • (2017)A Top-Down Approach to Achieving Performance Predictability in Database SystemsProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064016(745-758)Online publication date: 9-May-2017
  • (2017)Freeze'nSenseSoftware—Practice & Experience10.1002/spe.245647:6(831-847)Online publication date: 1-Jun-2017
  • (2016)So many performance events, so little timeProceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/2967360.2967375(1-9)Online publication date: 4-Aug-2016
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
OSDI'08: Proceedings of the 8th USENIX conference on Operating systems design and implementation
December 2008
384 pages

Sponsors

  • USENIX Assoc: USENIX Assoc

In-Cooperation

Publisher

USENIX Association

United States

Publication History

Published: 08 December 2008

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2017)A Top-Down Approach to Achieving Performance Predictability in Database SystemsProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064016(745-758)Online publication date: 9-May-2017
  • (2017)Freeze'nSenseSoftware—Practice & Experience10.1002/spe.245647:6(831-847)Online publication date: 1-Jun-2017
  • (2016)So many performance events, so little timeProceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/2967360.2967375(1-9)Online publication date: 4-Aug-2016
  • (2015)An OS-level Framework for Anomaly Detection in Complex Software SystemsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2014.233430512:3(366-372)Online publication date: 1-May-2015
  • (2015)Adaptive MonitoringJournal of Network and Systems Management10.1007/s10922-014-9330-823:4(950-977)Online publication date: 1-Oct-2015
  • (2014)The Case for Drill-Ready Cloud ComputingProceedings of the ACM Symposium on Cloud Computing10.1145/2670979.2670992(1-8)Online publication date: 3-Nov-2014
  • (2013)Performance troubleshooting in data centersACM SIGOPS Operating Systems Review10.1145/2553070.255307947:3(50-62)Online publication date: 26-Nov-2013
  • (2012)Adaptive monitoringProceedings of the 8th International Conference on Network and Service Management10.5555/2499406.2499462(350-356)Online publication date: 22-Oct-2012
  • (2012)VScopeProceedings of the 13th International Middleware Conference10.5555/2442626.2442635(121-141)Online publication date: 3-Dec-2012
  • (2012)FmeterProceedings of the 13th International Middleware Conference10.5555/2442626.2442633(81-100)Online publication date: 3-Dec-2012
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media