[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2541940.2541973acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Leveraging the short-term memory of hardware to diagnose production-run software failures

Published: 24 February 2014 Publication History

Abstract

Failures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and diagnosis latency requirements all at once.
This paper designs a low overhead, low latency, privacy preserving production-run failure diagnosis system based on two observations. First, short-term memory of program execution is often sufficient for failure diagnosis, as many bugs have short propagation distances. Second, maintaining a short-term memory of execution is much cheaper than maintaining a record of the whole execution. Following these observations, we first identify an existing hardware unit, Last Branch Record (LBR), that records the last few taken branches to help diagnose sequential bugs. We then propose a simple hardware extension, Last Cache-coherence Record (LCR), to record the last few cache accesses with specified coherence states and hence help diagnose concurrency bugs. Finally, we design LBRA and LCRA to automatically locate failure root causes using LBR and LCR.
Our evaluation uses 31 real-world sequential and concurrency bug failures from 18 representative open-source software. The results show that with just 16 record entries, LBR and LCR enable our system to automatically locate the root causes for 27 out of 31 failures, with less than 3% run-time overhead. As our system does not rely on sampling,

References

[1]
G. Altekar and I. Stoica. ODR: Output-deterministic replay for multicore debugging. In SOSP, 2009.
[2]
J. Arulraj, P.-C. Chang, G. Jin, and S. Lu. Production-run software failure diagnosis via hardware performance counters. In ASPLOS, 2013.
[3]
012)}nainar.12P. Arumuga Nainar. Applications of Static Analysis and Program Structure in Statistical Debugging. PhD thesis, University of Wisconsin -- Madison, 2012.
[4]
P. Arumuga Nainar and B. Liblit. Adaptive bug isolation. In ICSE, 2010.
[5]
B. Buck and J. K. Hollingsworth. An API for runtime code patching. Int. J. High Perform. Comput. Appl., 2000.
[6]
L. Ceze, P. Montesinos, C. von Praun, and J. Torrellas. Colorama: Architectural support for data-centric synchronization. In HPCA, 2007.
[7]
L. Chew and D. Lie. Kivati: Fast detection and prevention of atomicity violations. In EuroSys, 2010.
[8]
CNNMoneyTech. Is knight's $440 million glitch the costliest computer bug ever? http://money.cnn.com/2012/08/09/technology/knight-expensive-computer-bug/index.html.
[9]
J. Demme and S. Sethumadhavan. Rapid identification of architectural bottlenecks via precise event counting. In ISCA, 2011.
[10]
J. Demme, M. Maycock, J. Schmitz, A. Tang, A. Waksman, S. Sethumadhavan, and S. Stolfo. On the feasibility of online malware detection with performance counters. In ISCA, 2013.
[11]
J. L. Greathouse, Z. Ma, M. I. Frank, R. Peri, and T. M. Austin. Demand-driven software race detection using hardware performance counters. In ISCA, 2011.
[12]
W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang. Characterization of Linux kernel behavior under errors. In DSN, 2003.
[13]
D. R. Hower, P. Montesinos, L. Ceze, M. D. Hill, and J. Torrellas. Two hardware-based approaches for deterministic multiprocessor replay. Commun. ACM, 52 (6), June 2009.
[14]
Intel. Nehalem PMU programming guide. http://software.intel.com/sites/default/files/76/87/30320, 2010.
[15]
Intel. Intel 64 and IA-32 architectures software developer's manual. http://download.intel.com/products/processor/manual/325462.pdf, 2013.
[16]
Intel. GDB - the GNU* project debugger for Intel architecture. http://software.intel.com/en-us/articles/intel-system-studio-gdb#Trace, March 2013.
[17]
J. L. Lions et. al. ARIANE 5 Flight 501 Failure -- report by the inquiry board. http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html.
[18]
G. Jin, A. Thakur, B. Liblit, and S. Lu. Instrumentation and sampling strategies for cooperative concurrency bug isolation. In OOPSLA, 2010.
[19]
S. T. King, G. W. Dunlap, and P. M. Chen. Operating systems with time-traveling virtual machines. In Usenix Annual Technical Conference, 2005.
[20]
D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. Respec: Efficient online multiprocessor replay via speculation and external determinism. In ASPLOS, 2010.
[21]
N. Leveson and C. S. Turner. An investigation of the Therac-25 accidents. In IEEE Computer, 1993.
[22]
B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI, 2003.
[23]
B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In PLDI, 2005.
[24]
S. Lu, J. Tucek, F. Qin, and Y. Zhou. AVIO: Detecting atomicity violations via access interleaving invariants. In ASPLOS, 2006.
[25]
S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes -- a comprehensive study of real world concurrency bug characteristics. In ASPLOS, 2008.
[26]
B. Lucia and L. Ceze. Finding concurrency bugs with context-aware communication graphs. In MICRO, 2009.
[27]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005.
[28]
V. Nagarajan and R. Gupta. ECMon: Exposing cache events for monitoring. In ISCA, 2009.
[29]
S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously recording program execution for deterministic replay debugging. In ISCA, 2005.
[30]
}nasdaqPCWorld. Nasdaq's Facebook Glitch Came From Race Conditions. http://www.pcworld.com/businesscenter/article/255911/nasdaqs_facebook_glitch_came_from_race_conditions.html.
[31]
C. Pedersen and J. Acampora. Intel code execution trace resources. http://noggin.intel.com/content/intel-code-execution-trace-resources, 2012.
[32]
M. Prvulovic. CORD: Cost-effective (and nearly overhead-free) order-reordering and data race detection. In HPCA, 2006.
[33]
M. Prvulovic and J. Torrellas. ReEnact: Using thread-level speculation mechanisms to debug data races in multithreaded codes. In ISCA, 2003.
[34]
F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: Treating bugs as allergies c a safe method to survive software failures. In SOSP, 2005.
[35]
R. Santelices, J. A. Jones, Y. Yu, and M. J. Harrold. Lightweight fault-localization using multiple coverage types. In ICSE, 2009.
[36]
SecurityFocus. Software bug contributed to blackout. http://www.securityfocus.com/news/8016.
[37]
T. Sheng, N. Vachharajani, S. Eranian, R. Hundt, W. Chen, and W. Zheng. RACEZ: A lightweight and non-invasive race detection tool for production applications. In ICSE, 2011.
[38]
J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In SOSP, 2007.
[39]
G. Venkataramani, B. Roemer, Y. Solihin, and M. Prvulovic. MemTracker: Efficient and programmable support for memory access monitoring and debugging. In HPCA, 2007.
[40]
K. Walcott-Justice, J. Mars, and M. L. Soffa. THeME: A system for testing by hardware monitoring events. In ISSTA, 2012.
[41]
C. Willems, R. Hund, A. Fobian, D. Felsch, T. Holz, and A. Vasudevan. Down to the bare metal: Using processor features for binary analysis. In ACSAC, 2012.
[42]
J. Yu and S. Narayanasamy. A case for an interleaving constrained shared-memory multi-processor. In ISCA, 2009.
[43]
D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: Error diagnosis by connecting clues from run-time logs. In ASPLOS, 2010.
[44]
Yuan, Zheng, Park, Zhou, and Savage}logenhancer.asplos11D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage. Improving software diagnosability via log enhancement. In ASPLOS, 2011.
[45]
D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, Y. Zhou, and S. Savage. Be conservative: Enhancing failure diagnosis with proactive logging. In OSDI, 2012.
[46]
Yuan, Xing, Chen, and Zang}binyu.apsys11L. Yuan, W. Xing, H. Chen, and B. Zang. Security breaches as PMU deviation: Detecting and identifying security attacks using performance counters. In phAPSys, 2011.
[47]
W. Zhang, C. Sun, and S. Lu. ConMem: Detecting severe concurrency bugs through an effect-oriented approach. In ASPLOS, 2010.
[48]
W. Zhang, J. Lim, R. Olichandran, J. Scherpelz, G. Jin, S. Lu, and T. Reps. ConSeq: Detecting concurrency bugs through sequential errors. In ASPLOS, 2011.
[49]
P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iWatcher: Efficient architecture support for software debugging. In ISCA, 2004.
[50]
P. Zhou, R. Teodorescu, and Y. Zhou. Hard: Hardware-assisted lockset-based race detection. In HPCA, 2007.

Cited By

View all
  • (2023)Diagnosing Kernel Concurrency Failures with AITIAProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567486(94-110)Online publication date: 8-May-2023
  • (2022)A deep study of the effects and fixes of server-side request races in web applicationsProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528463(744-756)Online publication date: 23-May-2022
  • (2021)Understanding and detecting server-side request races in web applicationsProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468594(842-854)Online publication date: 20-Aug-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
February 2014
780 pages
ISBN:9781450323055
DOI:10.1145/2541940
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. concurrency bugs
  2. failure diagnosis
  3. hardware performance monitoring unit
  4. production runs

Qualifiers

  • Research-article

Conference

ASPLOS '14

Acceptance Rates

ASPLOS '14 Paper Acceptance Rate 49 of 217 submissions, 23%;
Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)4
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Diagnosing Kernel Concurrency Failures with AITIAProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567486(94-110)Online publication date: 8-May-2023
  • (2022)A deep study of the effects and fixes of server-side request races in web applicationsProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528463(744-756)Online publication date: 23-May-2022
  • (2021)Understanding and detecting server-side request races in web applicationsProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468594(842-854)Online publication date: 20-Aug-2021
  • (2021)JPortal: precise and efficient control-flow tracing for JVM programs with Intel processor traceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454096(1080-1094)Online publication date: 19-Jun-2021
  • (2019)The inflection point hypothesisProceedings of the 27th ACM Symposium on Operating Systems Principles10.1145/3341301.3359650(131-146)Online publication date: 27-Oct-2019
  • (2019)Understanding Real-World Concurrency Bugs in GoProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304069(865-878)Online publication date: 4-Apr-2019
  • (2019)Semantics-aware scheduling policies for synchronization determinismProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295731(242-256)Online publication date: 16-Feb-2019
  • (2018)REPTProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291171(17-32)Online publication date: 8-Oct-2018
  • (2018)iReplayer: in-situ and identical record-and-replay for multithreaded applicationsACM SIGPLAN Notices10.1145/3296979.319238053:4(344-358)Online publication date: 11-Jun-2018
  • (2018)TIFFProceedings of the 34th Annual Computer Security Applications Conference10.1145/3274694.3274746(505-517)Online publication date: 3-Dec-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media