[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/945445.945454acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
Article

Performance debugging for distributed systems of black boxes

Published: 19 October 2003 Publication History

Abstract

Many interesting large-scale systems are distributed systems of multiple communicating components. Such systems can be very hard to debug, especially when they exhibit poor performance. The problem becomes much harder when systems are composed of "black-box" components: software from many different (perhaps competing) vendors, usually without source code available. Typical solutions-provider employees are not always skilled or experienced enough to debug these systems efficiently. Our goal is to design tools that enable modestly-skilled programmers (and experts, too) to isolate performance bottlenecks in distributed systems composed of black-box nodes.We approach this problem by obtaining message-level traces of system activity, as passively as possible and without any knowledge of node internals or message semantics. We have developed two very different algorithms for inferring the dominant causal paths through a distributed system from these traces. One uses timing information from RPC messages to infer inter-call causality; the other uses signal-processing techniques. Our algorithms can ascribe delay to specific nodes on specific causal paths. Unlike previous approaches to similar problems, our approach requires no modifications to applications, middleware, or messages.

References

[1]
Alignment Software, Inc. Appassure. http://www.alignmentsoftware.com, 2003.
[2]
S. Bagchi, G. Kar, and J. L. Hellerstein. Dependency analysis in distributed systems using fault injection: Application to problem determination in an e-commerce environment. In Proc. 12th Intl. Workshop on Distributed Systems: Operations & Management, Nancy, France, Oct. 2001.
[3]
A. Brown, G. Kar, and A. Keller. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In Proc. 7th IFIP/IEEE Intl. Symp. on Integrated Network Management, Seattle, WA, May 2001.
[4]
R. C. Carrasco and J. Oncina. Learning stochastic regular grammars by means of a state merging method. In Proc. 2nd Intl. Colloq. on Grammatical Inference, pages 139--150, Alicante, Spain, Sep. 1994.
[5]
M. Chen, E. Kiciman, A. Accardi, A. Fox, and E. Brewer. Using runtime paths for macro analysis. In Proc. HotOS-IX, Kauai, HI, May 2003.
[6]
M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic systems. In Proc. 2002 Intl. Conf. on Dependable Systems and Networks, pages 595--604, Washington, DC, June 2002.
[7]
A. Feldmann. BLT: Bi-layer tracing of HTTP and TCP/IP. In Proc. WWW9, pages 321--335, Amsterdam, May 2000.
[8]
E. R. Gansner and S. C. North. An open graph visualization system and its applications to software engineering. Software -- Practice and Experience, 30(11):1203--1233, Sept 1999.
[9]
S. L. Graham, P. B. Kessler, and M. K. McKusick. gprof: A call graph execution profiler. In Proc. SIGPLAN Symp. on Compiler Construction, pages 120--126, Boston, MA, June 1982.
[10]
J. L. Hellerstein, M. Maccabee, W. N. Mills, and J. J. Turek. ETE: A customizable approach to measuring end-to-end response times and their components in distributed systems. In Proc. ICDCS, pages 152--162, Austin, TX, May 1999.
[11]
C. Hrischuk, J. Rolia, and C. Woodside. Automatic generation of a software performance model using an object-oriented prototype. In Proc. MASCOTS '95, pages 399--409, Durham, NC, Jan. 1995.
[12]
P. Huang, A. Feldmann, and W. Willinger. A non-intrusive, wavelet-based approach to detecting network performance problems. In Proc. Internet Measurement Workshop, San Francisco, CA, Nov. 2001.
[13]
R. Isaacs and P. Barham. Performance analysis in loosely-coupled distributed systems. In 7th CaberNet Radicals Workshop, Bertinoro, Italy, Oct. 2002.
[14]
V. Jacobson, C. Leres, and S. McCanne. tcpdump. www.tcpdump.org, 1989.
[15]
JBoss Group. http://www.jboss.org/.
[16]
E. Kiciman. JBoss request-tracing in Pinpoint, 2003.
[17]
J. B. Micheel. Personal communication, 2003.
[18]
B. P. Miller. Dpm: A measurement system for distributed programs. IEEE Trans. on Computers, 37(2):243--248, Feb 1988.
[19]
B. P. Miller and C.-Q. Yang. Critical path analysis for the execution of parallel and distributed programs. In Proc. 8th Intl. Conf. on Distributed Computing Systems, pages 366--373, San Jose, CA, June 1988.
[20]
D. L. Mills. The network computer as precision timekeeper. In Proc. Precision Time and Time Interval (PTTI) Applications and Planning Meeting, pages 96--108, Reston, VA, Dec. 1996.
[21]
V. Paxson. Automated packet trace analysis of TCP implementations. In Proc. SIGCOMM '97, pages 167--179, Cannes, France, Sep. 1997.
[22]
Performant, Inc. Optibench. http://www.performant.com/.
[23]
Quest Software Inc. Performasure. http://java.quest.com/performasure, 2003.
[24]
Sun Microsystems, Inc. Java Pet Store Demo. http://developer.java.sun.com/developer/releases/petstore/.
[25]
Sun Microsystems, Inc. J2EE platform specification. http://java.sun.com/j2ee/, 2003.
[26]
B. Tierney, W. Johnston, B. Crowley, G. Hoo, C. Brooks, and D. Gunter. The NetLogger methodology for high performance distributed systems performance analysis. In Proc. IEEE High Performance Distributed Computing Conf. (HPDC-7), July 1998.
[27]
Y. Zhang and V. Paxson. Detecting stepping stones. In Proc. 9th USENIX Security Symp., Denver, CO, Aug. 2000.

Cited By

View all
  • (2024)Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal GraphCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663827(50-61)Online publication date: 10-Jul-2024
  • (2024)SparseRCA: Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00045(391-402)Online publication date: 28-Oct-2024
  • (2023)Component-distinguishable Co-location and Resource Reclamation for High-throughput ComputingACM Transactions on Computer Systems10.1145/363000642:1-2(1-37)Online publication date: 18-Nov-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles
October 2003
338 pages
ISBN:1581137575
DOI:10.1145/945445
  • cover image ACM SIGOPS Operating Systems Review
    ACM SIGOPS Operating Systems Review  Volume 37, Issue 5
    SOSP '03
    December 2003
    329 pages
    ISSN:0163-5980
    DOI:10.1145/1165389
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. black box systems
  2. distributed systems
  3. performance analysis
  4. performance debugging

Qualifiers

  • Article

Conference

SOSP03
Sponsor:
SOSP03: ACM Symposium on Operating Systems Principles
October 19 - 22, 2003
NY, Bolton Landing, USA

Acceptance Rates

SOSP '03 Paper Acceptance Rate 22 of 128 submissions, 17%;
Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)105
  • Downloads (Last 6 weeks)12
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal GraphCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663827(50-61)Online publication date: 10-Jul-2024
  • (2024)SparseRCA: Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00045(391-402)Online publication date: 28-Oct-2024
  • (2023)Component-distinguishable Co-location and Resource Reclamation for High-throughput ComputingACM Transactions on Computer Systems10.1145/363000642:1-2(1-37)Online publication date: 18-Nov-2023
  • (2023)Always-On Recording Framework for Serverless Computations: Opportunities and ChallengesProceedings of the 1st Workshop on SErverless Systems, Applications and MEthodologies10.1145/3592533.3592810(41-49)Online publication date: 8-May-2023
  • (2023)Performance Bug Analysis and Detection for Distributed Storage and Computing SystemsACM Transactions on Storage10.1145/358028119:3(1-33)Online publication date: 19-Jun-2023
  • (2023)WAFFLE: Exposing Memory Ordering Bugs Efficiently with Active Delay InjectionProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567507(111-126)Online publication date: 8-May-2023
  • (2023)Combatting Energy Issues for Mobile ApplicationsACM Transactions on Software Engineering and Methodology10.1145/352785132:1(1-44)Online publication date: 13-Feb-2023
  • (2023)FLEET: Fluid Layout of End to End Topology2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS56603.2022.00084(601-608)Online publication date: Jan-2023
  • (2022)B-MEGCompanion of the 2022 ACM/SPEC International Conference on Performance Engineering10.1145/3491204.3527494(7-11)Online publication date: 14-Jul-2022
  • (2022)Microservices Monitoring with Event Logs and Black Box Execution TracingIEEE Transactions on Services Computing10.1109/TSC.2019.294000915:1(294-307)Online publication date: 1-Jan-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media