[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3472883.3487016acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

OptDebug: Fault-Inducing Operation Isolation for Dataflow Applications

Published: 01 November 2021 Publication History

Abstract

Fault-isolation is extremely challenging in large scale data processing in cloud environments. Data provenance is a dominant existing approach to isolate data records responsible for a given output. However, data provenance concerns fault isolation only in the data-space, as opposed to fault isolation in the code-space---how can we precisely localize operations or APIs responsible for a given suspicious or incorrect result?
We present OptDebug that identifies fault-inducing operations in a dataflow application using three insights. First, debugging is easier with a small-scale input than a large-scale input. So it uses data provenance to simplify the original input records to a smaller set leading to test failures and test successes. Second, keeping track of operation provenance is crucial for debugging. Thus, it leverages automated taint analysis to propagate the lineage of operations downstream with individual records. Lastly, each operation may contribute to test failures to a different degree. Thus OptDebug ranks each operation's spectra---the relative participation frequency in failing vs. passing tests. In our experiments, OptDebug achieves 100% recall and 86% precision in terms of detecting faulty operations and reduces the debugging time by 17x compared to a naïve approach. Overall, OptDebug shows great promise in improving developer productivity in today's complex data processing pipelines by obviating the need to re-execute the program repetitively with different inputs and manually examine program traces to isolate buggy code.

Supplementary Material

MP4 File (Day2_8-1.mp4)
Presentation video

References

[1]
2021. Apache Spark. https://spark.apache.org/.
[2]
2021. Hadoop. http://hadoop.apache.org/.
[3]
2021. Intel HiBench. https://github.com/Intel-bigdata/HiBench.
[4]
2021. JaCoCo. https://jacoco.github.io/.
[5]
2021. Kaggle Datasets. https://www.kaggle.com.
[6]
2021. Pig Mix Benchmark. https://cwiki.apache.org/confluence/display/pig/PigMix/.
[7]
2021. TPC. http://tpc.org/default5.asp/.
[8]
Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund. 2009. Spectrum-Based Multiple Fault Localization. In Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering (ASE '09). IEEE Computer Society, USA, 88--99. https://doi.org/10.1109/ASE.2009.25
[9]
Rui Abreu, Peter Zoeteweij, Rob Golsteijn, and Arjan J. C. van Gemund. 2009. A Practical Evaluation of Spectrum-Based Fault Localization. J. Syst. Softw. 82, 11 (Nov. 2009), 1780--1792. https://doi.org/10.1016/j.jss.2009.06.035
[10]
Hiralal Agrawal and Joseph R. Horgan. 1990. Dynamic Program Slicing. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation (White Plains, New York, USA) (PLDI '90). ACM, New York, NY, USA, 246--256. https://doi.org/10.1145/93542.93576
[11]
Manish Kumar Anand, Shawn Bowers, and Bertram Ludäscher. 2010. Techniques for Efficiently Querying Scientific Workflow Provenance Graphs. In Proceedings of the 13th International Conference on Extending Database Technology (Lausanne, Switzerland) (EDBT '10). ACM, New York, NY, USA, 287--298. https://doi.org/10.1145/1739041.1739078
[12]
Olivier Biton, Sarah Cohen-Boulakia, Susan B. Davidson, and Carmem S. Hara. 2008. Querying and Managing Provenance Through User Views in Scientific Workflows. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE '08). IEEE Computer Society, Washington, DC, USA, 1072--1081. https://doi.org/10.1109/ICDE.2008.4497516
[13]
Adriane P. Chapman, H. V. Jagadish, and Prakash Ramanan. 2008. Efficient Provenance Storage. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD '08). Association for Computing Machinery, New York, NY, USA, 993--1006. https://doi.org/10.1145/1376616.1376715
[14]
Zaheer Chothia, John Liagouris, Frank McSherry, and Timothy Roscoe. 2016. Explaining Outputs in Modern Data Analytics. Proc. VLDB Endow. 9, 12 (Aug. 2016), 1137--1148. https://doi.org/10.14778/2994509.2994530
[15]
James Clause, Wanchun Li, and Alessandro Orso. 2007. Dytan: A Generic Dynamic Taint Analysis Framework. In Proceedings of the 2007 International Symposium on Software Testing and Analysis (London, United Kingdom) (ISSTA '07). ACM, New York, NY, USA, 196--206. https://doi.org/10.1145/1273463.1273490
[16]
James Clause and Alessandro Orso. 2009. Penumbra: Automatically Identifying Failure-relevant Inputs Using Dynamic Tainting. In Proceedings of the Eighteenth International Symposium on Software Testing and Analysis (Chicago, IL, USA) (ISSTA '09). ACM, New York, NY, USA, 249--260. https://doi.org/10.1145/1572272.1572301
[17]
Bertty Contreras-Rojas, Jorge-Arnulfo Quiané-Ruiz, Zoi Kaoudi, and Saravanan Thirumuruganathan. 2019. TagSniff: Simplified Big Data Debugging for Dataflow Jobs. In Proceedings of the ACM Symposium on Cloud Computing (Santa Cruz, CA, USA) (SoCC '19). Association for Computing Machinery, New York, NY, USA, 453--464. https://doi.org/10.1145/3357223.3362738
[18]
Y. Cui and J. Widom. 2003. Lineage Tracing for General Data Warehouse Transformations. The VLDB Journal 12, 1 (May 2003), 41--58. https://doi.org/10.1007/s00778-002-0083-8
[19]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. https://doi.org/10.1145/1327452.1327492
[20]
Muhammad Ali Gulzar, Matteo Interlandi, Xueyuan Han, Mingda Li, Tyson Condie, and Miryung Kim. 2017. Automated Debugging in Data-Intensive Scalable Computing. In Proceedings of the 2017 Symposium on Cloud Computing (Santa Clara, California) (SoCC '17). Association for Computing Machinery, New York, NY, USA, 520--534. https://doi.org/10.1145/3127479.3131624
[21]
Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali, Tyson Condie, Todd Millstein, and Miryung Kim. 2016. BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark. In Proceedings of the 38th International Conference on Software Engineering (Austin, Texas) (ICSE '16). Association for Computing Machinery, New York, NY, USA, 784--795. https://doi.org/10.1145/2884781.2884813
[22]
Muhammad Ali Gulzar, Shaghayegh Mardani, Madanlal Musuvathi, and Miryung Kim. 2019. White-Box Testing of Big Data Analytics with Complex User-Defined Functions. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 290--301. https://doi.org/10.1145/3338906.3338953
[23]
Muhammad Ali Gulzar, Siman Wang, and Miryung Kim. 2018. BigSift: Automated Debugging of Big Data Analytics in Data-Intensive Scalable Computing. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 863--866. https://doi.org/10.1145/3236024.3264586
[24]
Neelam Gupta, Haifeng He, Xiangyu Zhang, and Rajiv Gupta. 2005. Locating Faulty Code Using Failure-inducing Chops. In Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering (Long Beach, CA, USA) (ASE '05). ACM, New York, NY, USA, 263--272. https://doi.org/10.1145/1101908.1101948
[25]
Thomas Heinis and Gustavo Alonso. 2008. Efficient Lineage Tracking for Scientific Workflows. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD '08). ACM, New York, NY, USA, 1007--1018. https://doi.org/10.1145/1376616.1376716
[26]
Robert Ikeda, Hyunjung Park, and Jennifer Widom. 2011. Provenance for generalized map and reduce workflows. In In Proc. Conference on Innovative Data Systems Research (CIDR).
[27]
Matteo Interlandi, Ari Ekmekji, Kshitij Shah, Muhammad Ali Gulzar, Sai Deep Tetali, Miryung Kim, Todd Millstein, and Tyson Condie. 2018. Adding Data Provenance Support to Apache Spark. The VLDB Journal 27, 5 (Oct. 2018), 595--615. https://doi.org/10.1007/s00778-017-0474-5
[28]
Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie. 2015. Titian: Data Provenance Support in Spark. Proc. VLDB Endow. 9, 3 (Nov. 2015), 216--227. https://doi.org/10.14778/2850583.2850595
[29]
James A. Jones, Mary Jean Harrold, and John Stasko. 2002. Visualization of Test Information to Assist Fault Localization. In Proceedings of the 24th International Conference on Software Engineering (Orlando, Florida) (ICSE '02). ACM, New York, NY, USA, 467--477. https://doi.org/10.1145/581339.581397
[30]
Avinash Kumar, Zuozhi Wang, Shengquan Ni, and Chen Li. 2020. Amber: A Debuggable Dataflow System Based on the Actor Model. Proc. VLDB Endow. 13, 5 (Jan. 2020), 740--753. https://doi.org/10.14778/3377369.3377381
[31]
Timothy Robert Leek, Graham Z Baker, Ruben Edward Brown, Michael A Zhivich, and RP Lippmann. 2007. Coverage maximization using dynamic taint tracing. Technical Report. DTIC Document.
[32]
Dionysios Logothetis, Soumyarupa De, and Kenneth Yocum. 2013. Scalable lineage capture for debugging DISC analytics. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 17.
[33]
W. Masri, A. Podgurski, and D. Leon. 2004. Detecting and debugging insecure information flows. In 15th International Symposium on Software Reliability Engineering. 198--209. https://doi.org/10.1109/ISSRE.2004.17
[34]
Tobias Müller, Benjamin Dietrich, and Torsten Grust. 2018. You Say 'What', i Hear 'where' and 'Why': (Mis-)Interpreting SQL to Derive Fine-Grained Provenance. Proc. VLDB Endow. 11, 11 (July 2018), 1536--1549. https://doi.org/10.14778/3236187.3236204
[35]
Lee Naish, Hua Jie Lee, and Kotagiri Ramamohanarao. 2011. A Model for Spectra-Based Software Diagnosis. ACM Trans. Softw. Eng. Methodol. 20, 3, Article 11 (Aug. 2011), 32 pages. https://doi.org/10.1145/2000791.2000795
[36]
James Newsome and Dawn Song. 2005. Dynamic taint analysis: Automatic detection, analysis, and signature generation of exploit attacks on commodity software. In In In Proceedings of the 12th Network and Distributed Systems Security Symposium. Citeseer.
[37]
Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael D. Ernst, Deric Pang, and Benjamin Keller. 2017. Evaluating and Improving Fault Localization. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 609--620. https://doi.org/10.1109/ICSE.2017.62
[38]
Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-Grained Lineage at Interactive Speed. Proc. VLDB Endow. 11, 6 (Feb. 2018), 719--732. https://doi.org/10.14778/3199517.3199522
[39]
El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani, and Michael Stonebraker. 2020. Dagger: A Data (not code) Debugger. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2020/papers/p35-rezig-cidr20.pdf
[40]
Jason Teoh, Muhammad Ali Gulzar, and Miryung Kim. 2020. Influence-Based Provenance for Dataflow Applications with Taint Propagation. In Proceedings of the 11th ACM Symposium on Cloud Computing (Virtual Event, USA) (SoCC '20). Association for Computing Machinery, New York, NY, USA, 372--386. https://doi.org/10.1145/3419111.3421292
[41]
Mark Weiser. 1981. Program Slicing. In Proceedings of the 5th International Conference on Software Engineering (San Diego, California, USA) (ICSE '81). IEEE Press, Piscataway, NJ, USA, 439--449. http://dl.acm.org/citation.cfm?id=800078.802557
[42]
W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A Survey on Software Fault Localization. IEEE Transactions on Software Engineering 42, 8 (2016), 707--740. https://doi.org/10.1109/TSE.2016.2521368
[43]
Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliers in Aggregate Queries. Proc. VLDB Endow. 6, 8 (June 2013), 553--564. https://doi.org/10.14778/2536354.2536356
[44]
Qian Zhang, Jiyuan Wang, Muhammad Ali Gulzar, Rohan Padhye, and Miryung Kim. 2020. BigFuzz: Efficient Fuzz Testing for Data Analytics Using Framework Abstraction. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (Virtual Event, Australia) (ASE '20). Association for Computing Machinery, New York, NY, USA, 722--733. https://doi.org/10.1145/3324884.3416641

Cited By

View all
  • (2024)DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable ComputingProceedings of the ACM on Software Engineering10.1145/36437611:FSE(767-788)Online publication date: 12-Jul-2024
  • (2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: Apr-2024
  • (2024) Version - [SAMbA-RaP is music to scientists’ ears: Adding provenance support to spark-based scientific workflows] SoftwareX10.1016/j.softx.2024.10192728(101927)Online publication date: Dec-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '21: Proceedings of the ACM Symposium on Cloud Computing
November 2021
685 pages
ISBN:9781450386388
DOI:10.1145/3472883
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bug isolation
  2. data intensive scalable computing
  3. debugging
  4. taint analysis

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SoCC '21
Sponsor:
SoCC '21: ACM Symposium on Cloud Computing
November 1 - 4, 2021
WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)66
  • Downloads (Last 6 weeks)9
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable ComputingProceedings of the ACM on Software Engineering10.1145/36437611:FSE(767-788)Online publication date: 12-Jul-2024
  • (2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: Apr-2024
  • (2024) Version - [SAMbA-RaP is music to scientists’ ears: Adding provenance support to spark-based scientific workflows] SoftwareX10.1016/j.softx.2024.10192728(101927)Online publication date: Dec-2024
  • (2023)Software Engineering for Data Intensive Scalable Computing and Heterogeneous Computing2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE)10.1109/ICSE-FoSE59343.2023.00006(54-68)Online publication date: 14-May-2023
  • (2022)Machine programmingProceedings of the VLDB Endowment10.14778/3554821.355489215:12(3754-3757)Online publication date: 1-Aug-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media