[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3324884.3416641acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article
Open access

BigFuzz: efficient fuzz testing for data analytics using framework abstraction

Published: 27 January 2021 Publication History

Abstract

As big data analytics become increasingly popular, data-intensive scalable computing (DISC) systems help address the scalability issue of handling large data. However, automated testing for such data-centric applications is challenging, because data is often incomplete, continuously evolving, and hard to know a priori. Fuzz testing has been proven to be highly effective in other domains such as security; however, it is nontrivial to apply such traditional fuzzing to big data analytics directly for three reasons: (1) the long latency of DISC systems prohibits the applicability of fuzzing: naïve fuzzing would spend 98% of the time in setting up a test environment; (2) conventional branch coverage is unlikely to scale to DISC applications because most binary code comes from the framework implementation such as Apache Spark; and (3) random bit or byte level mutations can hardly generate meaningful data, which fails to reveal real-world application bugs.
We propose a novel coverage-guided fuzz testing tool for big data analytics, called BigFuzz. The key essence of our approach is that: (a) we focus on exercising application logic as opposed to increasing framework code coverage by abstracting the DISC framework using specifications. BigFuzz performs automated source to source transformations to construct an equivalent DISC application suitable for fast test generation, and (b) we design schema-aware data mutation operators based on our in-depth study of DISC application error types. BigFuzz speeds up the fuzzing time by 78 to 1477X compared to random fuzzing, improves application code coverage by 20% to 271%, and achieves 33% to 157% improvement in detecting application errors. When compared to the state of the art that uses symbolic execution to test big data analytics, BigFuzz is applicable to twice more programs and can find 81% more bugs.

References

[1]
2020. https://hadoop.apache.org/.
[2]
2020. https://spark.apache.org/.
[3]
2020. https://stackoverflow.com/.
[4]
2020. https://github.com/.
[5]
2020. https://asm.ow2.io/.
[6]
2020. https://stackoverflow.com/questions/37525136/.
[7]
2020. https://stackoverflow.com/questions/36015704/.
[8]
2020. https://stackoverflow.com/questions/52083828/.
[9]
2020. https://stackoverflow.com/questions/49505241/.
[10]
2020. https://stackoverflow.com/questions/41708814/.
[11]
2020. https://stackoverflow.com/questions/56478820/.
[12]
2020. https://stackoverflow.com/questions/41143862/.
[13]
2020. https://stackoverflow.com/questions/32028729/.
[14]
2020. https://stackoverflow.com/questions/36131942/.
[15]
2020. https://stackoverflow.com/questions/45962453/.
[16]
2020. https://stackoverflow.com/questions/59977879/is-there-any-convenient-way-to-do-the-debugging-for-spark-program.
[17]
2020. American Fuzz Loop. http://lcamtuf.coredump.cx/afl/.
[18]
2020. Microsoft Open Database Connectivity (ODBC). https://msdn.microsoft.com/en-us/library/ms710252(v=vs.85).aspx.
[19]
2020. Mozilla Security - dharma. https://github.com/mozillasecurity/dharma.
[20]
A. Alsharif, G. M. Kapfhammer, and P. McMinn. 2018. DOMINO: Fast and Effective Test Data Generation for Relational Database Schemas. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). 12--22.
[21]
V. Atlidakis, P. Godefroid, and M. Polishchuk. 2019. RESTler: Stateful REST API Fuzzing. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 748--758.
[22]
Domagoj Babić, Lorenzo Martignoni, Stephen McCamant, and Dawn Song. 2011. Statically-Directed Dynamic Automated Test Generation. In Proceedings of the 2011 International Symposium on Software Testing and Analysis (ISSTA '11). Association for Computing Machinery, New York, NY, USA, 12--22.
[23]
Jaspreet Arora Guoqing Harry Xu Miryung Kim Bobby Bruce, TianyiZhang. [n. d.]. JShrink: In-Depth Investigation into Debloating Modern Java Applications. In ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE '20).
[24]
J. Burnim and K. Sen. 2008. Heuristics for Scalable Dynamic Test Generation. In 2008 23rd IEEE/ACM International Conference on Automated Software Engineering. 443--446.
[25]
Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, USA, 209--224.
[26]
Sang Kil Cha, Maverick Woo, and David Brumley. 2015. Program-Adaptive Mutational Fuzzing. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP '15). IEEE Computer Society, USA, 725--741.
[27]
Shumo Chu, Chenglong Wang, Konstantin Weitz, and Alvin Cheung. 2017. Cosette: An Automated Prover for SQL. In CIDR.
[28]
Christoph Csallner and Yannis Smaragdakis. 2004. JCrasher: an automatic robustness tester for Java. Software: Practice and Experience 34, 11 (2004), 1025--1050.
[29]
Marcelo d'Amorim, Carlos Pacheco, Darko Marinov, Tao Xie, and Michael D. Ernst. 2006. An empirical comparison of automated generation and classification techniques for object-oriented unit testing. In ASE 2006: Proceedings of the 21st Annual International Conference on Automated Software Engineering. Tokyo, Japan, 59--68.
[30]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation - Volume 6 (OSDI'04). USENIX Association, 1.
[31]
Michael Emmi, Rupak Majumdar, and Koushik Sen. 2007. Dynamic Test Input Generation for Database Applications. In Proceedings of the 2007 International Symposium on Software Testing and Analysis (ISSTA '07). Association for Computing Machinery, New York, NY, USA, 151--162.
[32]
Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Generation for Object-Oriented Software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE '11). Association for Computing Machinery, New York, NY, USA, 416--419.
[33]
Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed Automated Random Testing. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '05). ACM, New York, NY, USA, 213--223.
[34]
Patrice Godefroid, Michael Y. Levin, and David A Molnar. 2008. Automated Whitebox Fuzz Testing. In Network Distributed Security Symposium (NDSS). Internet Society. http://www.truststc.org/pubs/499.html
[35]
Rahul Gopinath, Björn Mathis, and Andreas Zeller. 2019. Inferring Input Grammars from Dynamic Control Flow. arXiv:cs.SE/1912.05937
[36]
Rahul Gopinath and Andreas Zeller. 2019. Building Fast Fuzzers. arXiv:cs.SE/1911.07707
[37]
Muhammad Gulzar, Yongkang Zhu, and Xiaofeng Han. 2019. Perception and practices of differential testing. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 71--80.
[38]
Muhammad Ali Gulzar, Shaghayegh Mardani, Madanlal Musuvathi, and Miryung Kim. 2019. White-Box Testing of Big Data Analytics with Complex User-Defined Functions. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 290--301.
[39]
B. P. Gupta, D. Vira, and S. Sudarshan. 2010. X-data: Generating test data for killing SQL mutants. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). 876--879.
[40]
Kim Herzig, Michaela Greiler, Jacek Czerwonka, and Brendan Murphy. 2015. The Art of Testing Less without Sacrificing Quality. In Proceedings of the 37th International Conference on Software Engineering - Volume 1 (ICSE '15). IEEE Press, 483--493.
[41]
James C. King. 1976. Symbolic Execution and Program Testing. Commun. ACM 19, 7 (July 1976), 385--394.
[42]
George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS '18). Association for Computing Machinery, New York, NY, USA, 2123--2138.
[43]
Xuan-Bach D. Le, Corina S. Pasareanu, Rohan Padhye, David Lo, Willem Visser, and Koushik Sen. 2019. Saffron: Adaptive Grammar-based Fuzzing for Worst-Case Analysis. ACM SIGSOFT Software Engineering Notes 44, 4 (2019), 14.
[44]
Caroline Lemieux and Koushik Sen. 2018. FairFuzz: a targeted mutation strategy for increasing greybox fuzz testing coverage. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3--7, 2018, Marianne Huchard, Christian Kästner, and Gordon Fraser (Eds.). ACM, 475--485.
[45]
Kaituo Li, Christoph Reichenbach, Yannis Smaragdakis, Yanlei Diao, and Christoph Csallner. 2013. SEDGE: Symbolic example data generation for dataflow programs. In Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on. IEEE, 235--245.
[46]
Nan Li, Yu Lei, Haider Riaz Khan, Jingshu Liu, and Yun Guo. 2016. Applying Combinatorial Test Data Generation to Big Data Applications. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE 2016). Association for Computing Machinery, New York, NY, USA, 637--647.
[47]
You Li, Zhendong Su, Linzhang Wang, and Xuandong Li. 2013. Steering Symbolic Execution to Less Traveled Paths. SIGPLAN Not. 48, 10 (Oct. 2013), 19--32.
[48]
David Lion, Adrian Chiu, Hailong Sun, Xin Zhuang, Nikola Grcevski, and Ding Yuan. 2016. Don't Get Caught in the Cold, Warm-up Your JVM: Understand and Eliminate JVM Warm-up Overhead in Data-Parallel Systems. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 383--400. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/lion
[49]
Valentin Manes, HyungSeok Han, Choongwoo Han, sang cha, Manuel Egele, Edward Schwartz, and Maverick Woo. 2019. The Art, Science, and Engineering of Fuzzing: A Survey. IEEE Transactions on Software Engineering PP (10 2019), 1--1.
[50]
Zhengjie Miao, Sudeepa Roy, and Jun Yang. 2019. Explaining Wrong Queries Using Small Examples. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 503--520.
[51]
Stefan Nagy and Matthew Hicks. 2019. Full-speed fuzzing: Reducing fuzzing overhead through coverage-guided tracing. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 787--802.
[52]
Christopher Olston, Shubham Chopra, and Utkarsh Srivastava. 2009. Generating Example Data for Dataflow Programs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09). ACM, New York, NY, USA, 245--256.
[53]
Alessandro Orso and Gregg Rothermel. 2014. Software Testing: A Research Travelogue (2000--2014). In Proceedings of the on Future of Software Engineering (FOSE 2014). Association for Computing Machinery, New York, NY, USA, 117--132.
[54]
C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball. 2007. Feedback-Directed Random Test Generation. In 29th International Conference on Software Engineering (ICSE'07). 75--84.
[55]
Rohan Padhye, Caroline Lemieux, and Koushik Sen. 2019. JQF: Coverage-Guided Property-Based Testing in Java. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 398--401.
[56]
Rohan Padhye, Caroline Lemieux, Koushik Sen, Mike Papadakis, and Yves Le Traon. 2019. Semantic Fuzzing with Zest. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 329--340.
[57]
Kai Pan, Xintao Wu, and Tao Xie. 2011. Database State Generation via Dynamic Symbolic Execution for Coverage Criteria. In Proceedings of the Fourth International Workshop on Testing Database Systems (DBTest '11). Association for Computing Machinery, New York, NY, USA, Article Article 4, 6 pages.
[58]
K. Pan, X. Wu, and T. Xie. 2011. Generating program inputs for database application testing. In 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011). 73--82.
[59]
Kai Pan, Xintao Wu, and Tao Xie. 2015. Program-input generation for testing database applications using existing database states. Automated Software Engineering 22, 4 (2015), 439--473.
[60]
Chenxiong Qian, Hong Hu, Mansour Alharthi, Pak Ho Chung, Taesoo Kim, and Wenke Lee. 2019. {RAZOR}: A Framework for Post-deployment Software Debloating. In 28th {USENIX} Security Symposium ({USENIX} Security 19). 1733--1750.
[61]
Anh Quach, Aravind Prakash, and Lok Yan. 2018. Debloating software through piece-wise compilation and loading. In 27th {USENIX} Security Symposium ({USENIX} Security 18). 869--886.
[62]
Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: A Concolic Unit Testing Engine for C. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE-13). ACM, New York, NY, USA, 263--272.
[63]
Hashim Sharif, Muhammad Abubakar, Ashish Gehani, and Fareed Zaffar. 2018. TRIMMER: application specialization for code debloating. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 329--339.
[64]
Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. 2016. Driller: Augmenting Fuzzing Through Selective Symbolic Execution.
[65]
Art Taylor. 2002. Jdbc: Database Programming with J2Ee with Cdrom. Prentice Hall Professional Technical Reference.
[66]
Nikolai Tillmann and Jonathan de Halleux. 2008. Pex-White Box Test Generation for .NET. In Tests and Proofs, Bernhard Beckert and Reiner Hähnle (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 134--153.
[67]
M. Vakilian, R. Sauciuc, J. D. Morgenthaler, and V. Mirrokni. 2015. Automated Decomposition of Build Targets. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 123--133.
[68]
Margus Veanes, Jonathan de Halleux, Nikolai Tillmann, and Peli de Halleux. 2009. Qex: Symbolic SQL Query Explorer. Technical Report MSR-TR-2009-2015. https://www.microsoft.com/en-us/research/publication/qex-symbolic-sql-query-explorer/ Updated January 2010.
[69]
Willem Visser, Corina S. Pundefinedsundefinedreanu, and Sarfraz Khurshid. 2004. Test Input Generation with Java PathFinder. In Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA '04). Association for Computing Machinery, New York, NY, USA, 97--107.
[70]
Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. 2019. Superion: Grammar-Aware Greybox Fuzzing. In Proceedings of the 41st International Conference on Software Engineering (ICSE '19). IEEE Press, 724--735.
[71]
Hui Xu, Zirui Zhao, Yangfan Zhou, and Michael R Lyu. 2018. Benchmarking the capability of symbolic execution tools with logic bombs. IEEE Transactions on Dependable and Secure Computing (2018).
[72]
Zhihong Xu, Martin Hirzel, Gregg Rothermel, and Kun-Lung Wu. 2013. Testing properties of dataflow program operators. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 103--113.
[73]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12). 15--28.

Cited By

View all
  • (2024)Natural Symbolic Execution-Based Testing for Big Data AnalyticsProceedings of the ACM on Software Engineering10.1145/36608251:FSE(2677-2700)Online publication date: 12-Jul-2024
  • (2024)A novel generative adversarial network-based fuzzing cases generation method for industrial control system protocolsComputers and Electrical Engineering10.1016/j.compeleceng.2024.109268117(109268)Online publication date: Jul-2024
  • (2024)When less is more: on the value of “co-training” for semi-supervised software defect predictorsEmpirical Software Engineering10.1007/s10664-023-10418-429:2Online publication date: 24-Feb-2024
  • Show More Cited By
  1. BigFuzz: efficient fuzz testing for data analytics using framework abstraction

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering
    December 2020
    1449 pages
    ISBN:9781450367684
    DOI:10.1145/3324884
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 January 2021

    Check for updates

    Author Tags

    1. big data analytics
    2. fuzz testing
    3. test generation

    Qualifiers

    • Research-article

    Funding Sources

    • NSF
    • Samsung
    • Alexander von Humboldt Foundation
    • Google PhD Fellowship
    • Intel
    • ONR

    Conference

    ASE '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 82 of 337 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)249
    • Downloads (Last 6 weeks)28
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Natural Symbolic Execution-Based Testing for Big Data AnalyticsProceedings of the ACM on Software Engineering10.1145/36608251:FSE(2677-2700)Online publication date: 12-Jul-2024
    • (2024)A novel generative adversarial network-based fuzzing cases generation method for industrial control system protocolsComputers and Electrical Engineering10.1016/j.compeleceng.2024.109268117(109268)Online publication date: Jul-2024
    • (2024)When less is more: on the value of “co-training” for semi-supervised software defect predictorsEmpirical Software Engineering10.1007/s10664-023-10418-429:2Online publication date: 24-Feb-2024
    • (2023)Leveraging Hardware Probes and Optimizations for Accelerating Fuzz Testing of Heterogeneous ApplicationsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616318(1101-1113)Online publication date: 30-Nov-2023
    • (2023)Co-dependence Aware Fuzzing for Dataflow-Based Big Data AnalyticsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616298(1050-1061)Online publication date: 30-Nov-2023
    • (2023)Guiding Greybox Fuzzing with Mutation TestingProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598107(929-941)Online publication date: 12-Jul-2023
    • (2023)Randomized Testing of Byzantine Fault Tolerant AlgorithmsProceedings of the ACM on Programming Languages10.1145/35860537:OOPSLA1(757-788)Online publication date: 6-Apr-2023
    • (2023)Uncovering Bugs in Code Coverage Profilers via Control Flow Constraint SolvingIEEE Transactions on Software Engineering10.1109/TSE.2023.332138149:11(4964-4987)Online publication date: Nov-2023
    • (2023)SparkAC: Fine-Grained Access Control in Spark for Secure Data Sharing and AnalyticsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.314954420:2(1104-1123)Online publication date: 1-Mar-2023
    • (2023)Generating Test Databases for Database-Backed Applications2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00173(2048-2059)Online publication date: May-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media