[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ICSE43902.2021.00083acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Centris: A Precise and Scalable Approach for Identifying Modified Open-Source Software Reuse

Published: 05 November 2021 Publication History

Abstract

Open-source software (OSS) is widely reused as it provides convenience and efficiency in software development. Despite evident benefits, unmanaged OSS components can introduce threats, such as vulnerability propagation and license violation. Unfortunately, however, identifying reused OSS components is a challenge as the reused OSS is predominantly modified and nested. In this paper, we propose CENTRIS, a precise and scalable approach for identifying modified OSS reuse. By segmenting an OSS code base and detecting the reuse of a unique part of the OSS only, CENTRIS is capable of precisely identifying modified OSS reuse in the presence of nested OSS components. For scalability, CENTRIS eliminates redundant code comparisons and accelerates the search using hash functions. When we applied CENTRIS on 10,241 widely-employed GitHub projects, comprising 229,326 versions and 80 billion lines of code, we observed that modified OSS reuse is a norm in software development, occurring 20 times more frequently than exact reuse. Nonetheless, CENTRIS identified reused OSS components with 91% precision and 94% recall in less than a minute per application on average, whereas a recent clone detection technique, which does not take into account modified and nested OSS reuse, hardly reached 10% precision and 40% recall.

References

[1]
2018 open source security and risk analysis (OSSRA), Synopsys, 2018, https://www.blackducksoftware.com/about/news-events/releases/audits-show-open-source-risks.
[2]
The GitHub Blog - Thank you for 100 million repositories, GitHub, 2018, https://github.blog/2018-11-08-100m-repos/.
[3]
H. Li, H. Kwon, J. Kwon, and H. Lee, "CLORIFI: software vulnerability discovery using code clone verification," in Concurrency and Computation: Practice and Experience, vol. 28, no. 6. Wiley Online Library, 2016, pp. 1900--1917.
[4]
S. Kim, S. Woo, H. Lee, and H. Oh, "VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery," in Proceedings of the 38th IEEE Symposium on Security and Privacy (SP). IEEE, 2017, pp. 595--614.
[5]
R. Duan, A. Bijlani, M. Xu, T. Kim, and W. Lee, "Identifying Open-Source License Violation and 1-day Security Risk at Large Scale," in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2017, pp. 2169--2185.
[6]
S. Kim and H. Lee, "Software systems at risk: An empirical study of cloned vulnerabilities in practice," Computers & Security, vol. 77, pp. 720--736, 2018.
[7]
Software Composition Analysis Explained, WhiteSource, 2019, https://resources.whitesourcesoftware.com/blog-whitesource/software-composition-security-analysis.
[8]
Technology Insight for Software Composition Analysis, Gartner, Inc., 2019.
[9]
A. S. Barb, C. J. Neill, R. S. Sangwan, and M. J. Piovoso, "A statistical study of the relevance of lines of code measures in software projects," in Innovations in Systems and Software Engineering, vol. 10, no. 4. Springer, 2014, pp. 243--260.
[10]
H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and C. V. Lopes, "SourcererCC: Scaling code clone detection to big-code," in 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, 2016, pp. 1157--1168.
[11]
C. V. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek, "DéjàVu: a map of code duplicates on GitHub," in Proceedings of the ACM on Programming Languages, vol. 1, no. (OOPSLA). ACM, 2017, p. 84.
[12]
P. Wang, J. Svajlenko, Y. Wu, Y. Xu, and C. K. Roy, "CCAligner: a token based large-gap clone detector," in Proceedings of the 40th International Conference on Software Engineering (ICSE). ACM, 2018, pp. 1066--1077.
[13]
C. W. Krueger, "Software reuse," in ACM Computing Surveys (CSUR), vol. 24, no. 2. ACM, 1992, pp. 131--183.
[14]
M. L. Griss, "Software reuse architecture, process, and organization for business success," in Proceedings of the Eighth Israeli Conference on Computer Systems and Software Engineering. IEEE, 1997, pp. 86--89.
[15]
R. Duan, A. Bijlani, Y. Ji, O. Alrawi, Y. Xiong, M. Ike, B. Saltaformaggio, and W. Lee, "Automating Patching of Vulnerable Open-Source Software Versions in Application Binaries," in In Proceedings of the 2019 Annual Network and Distributed System Security Symposium (NDSS), 2019.
[16]
A. Lee and T. Atkison, "A comparison of fuzzy hashes: evaluation, guidelines, and future suggestions," in Proceedings of the SouthEast Conference. ACM, 2017, pp. 18--25.
[17]
G. Salton and M. J. McGill, Introduction to modern information retrieval. New York: McGraw - Hill Book Company, 1983.
[18]
Version Control Systems Popularity in 2016, Rhodecode, 2016, https://rhodecode.com/insights/version-control-systems-2016.
[19]
Universal Ctags, Ctags, 2021, https://github.com/universal-ctags/.
[20]
J. Kornblum, "Identifying almost identical files using context triggered piecewise hashing," in Digital investigation, vol. 3. Elsevier, 2006, pp. 91--97.
[21]
V. Roussev, "Hashing and data fingerprinting in digital forensics," in IEEE Security & Privacy, vol. 7, no. 2. IEEE, 2009, pp. 49--55.
[22]
J. Oliver, C. Cheng, and Y. Chen, "TLSH-a locality sensitive hash," in 2013 Fourth Cybercrime and Trustworthy Computing Workshop. IEEE, 2013, pp. 7--13.
[23]
G. M. Kapitsaki, N. D. Tselikas, and I. E. Foukarakis, "An insight into license tools for open source software systems," Journal of Systems and Software, vol. 102, pp. 72--87, 2015.
[24]
S. Ikeda, A. Ihara, R. G. Kula, and K. Matsumoto, "An empirical study of readme contents for javascript packages," IEICE TRANSACTIONS on Information and Systems, vol. 102, no. 2, pp. 280--288, 2019.
[25]
Z. Ma, H. Wang, Y. Guo, and X. Chen, "Libradar: fast and accurate detection of third-party libraries in android apps," in Proceedings of the 38th international conference on software engineering companion, 2016, pp. 653--656.
[26]
M. Backes, S. Bugiel, and E. Derr, "Reliable third-party library detection in android and its security applications," in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2016, pp. 356--367.
[27]
M. Li, W. Wang, P. Wang, S. Wang, D. Wu, J. Liu, R. Xue, and W. Huo, "Libd: scalable and precise third-party library detection in android markets," in Proceedings of the 39th International Conference on Software Engineering (ICSE). IEEE, 2017, pp. 335--346.
[28]
W. Tang, D. Chen, and P. Luo, "Bcfinder: A lightweight and platform-independent tool to find third-party components in binaries," in 2018 25th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 2018, pp. 288--297.
[29]
An open source management solution, CoPilot, 2019, https://copilot.blackducksoftware.com/.
[30]
S. Ghaisas, P. Rose, M. Daneva, K. Sikkel, and R. J. Wieringa, "Generalizing by similarity: Lessons learnt from industrial case studies," in 2013 1st International Workshop on Conducting Empirical Studies in Industry (CESI). IEEE, 2013, pp. 37--42.
[31]
R. Wieringa and M. Daneva, "Six strategies for generalizing software engineering theories," Science of computer programming, vol. 101, pp. 136--152, 2015.
[32]
R. J. Wieringa, Design science methodology for information systems and software engineering. Springer, 2014.
[33]
B. S. Baker, "On finding duplication and near-duplication in large software systems," in Reverse Engineering, Proceedings of 2nd Working Conference on. IEEE, 1995, pp. 86--95.
[34]
I. D. Baxter, A. Yahin, L. Moura, M. Sant'Anna, and L. Bier, "Clone detection using abstract syntax trees," in Proceedings. International Conference on Software Maintenance. IEEE, 1998, pp. 368--377.
[35]
R. Komondoor and S. Horwitz, "Using slicing to identify duplication in source code," in International static analysis symposium. Springer, 2001, pp. 40--56.
[36]
T. Kamiya, S. Kusumoto, and K. Inoue, "CCFinder: a multilinguistic token-based code clone detection system for large scale source code," in IEEE Transactions on Software Engineering, vol. 28, no. 7. IEEE, 2002, pp. 654--670.
[37]
G. Myles and C. Collberg, "Detecting software theft via whole program path birthmarks," in International Conference on Information Security. Springer, 2004, pp. 404--415.
[38]
Z. Li, S. Lu, S. Myagmar, and Y. Zhou, "CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code," in OSDI, vol. 4, no. 19, 2004, pp. 289--302.
[39]
G. Myles and C. Collberg, "K-gram based software birthmarks," in Proceedings of the 2005 ACM symposium on Applied computing. ACM, 2005, pp. 314--318.
[40]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu, "Deckard: Scalable and accurate tree-based detection of code clones," in Proceedings of the 29th International Conference on Software Engineering (ICSE). IEEE Computer Society, 2007, pp. 96--105.
[41]
S. Schleimer, D. S. Wilkerson, and A. Aiken, "Winnowing: local algorithms for document fingerprinting," in Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, 2003, pp. 76--85.
[42]
C. K. Roy and J. R. Cordy, "A survey on software clone detection research," in Queen's School of Computing TR, vol. 541, no. 115, 2007, pp. 64--68.
[43]
C. K. Roy and J. R. Cordy, "NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization," in 16th IEEE International Conference on Program Comprehension. IEEE, 2008, pp. 172--181.
[44]
Y. Semura, N. Yoshida, E. Choi, and K. Inoue, "CCFinderSW: Clone Detection Tool with Flexible Multilingual Tokenization," in Asia-Pacific Software Engineering Conference (APSEC), 2017 24th. IEEE, 2017, pp. 654--659.
[45]
M. A. Nishi and K. Damevski, "Scalable code clone detection and search based on adaptive prefix filtering," in Journal of Systems and Software, vol. 137. Elsevier, 2018, pp. 130--142.
[46]
A source code search engine, Searchcode, 2021, http://searchcode.com/.
[47]
D. Luciv, D. Koznov, G. Chernishev, H. A. Basit, K. Romanovsky, and A. Terekhov, "Duplicate finder toolkit," in Proceedings of the 40th International Conference on Software Engineering: Companion Proceedings. ACM, 2018, pp. 171--172.
[48]
M. Gharehyazie, B. Ray, M. Keshani, M. S. Zavosht, A. Heydarnoori, and V. Filkov, "Cross-project code clones in GitHub," in Empirical Software Engineering. Springer, 2018, pp. 1--36.
[49]
T. Vislavski, G. Rakic, N. Cardozo, and Z. Budimac, "LICCA: A tool for cross-language clone detection," in IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2018, pp. 512--516.
[50]
R. Bhoraskar, S. Han, J. Jeon, T. Azim, S. Chen, J. Jung, S. Nath, R. Wang, and D. Wetherall, "Brahmastra: Driving Apps to Test the Security of Third-Party Components," in Proceedings of the 23rd USENIX Security Symposium (Security), 2014, pp. 1021--1036.
[51]
A complete open source management solution by Synopsys, Black Duck Hub, 2019, https://www.blackducksoftware.com/products/hub.
[52]
A comprehensive software analysis solution, Synopsys, 2021.
[53]
A Software Artifacts Knowledge Base (the service is currently hold), Antepedia, 2019, http://www.antepedia.com/.
[54]
S. Kim, S. Woo, H. Lee, and H. Oh, "Poster: Iotcube: an automated analysis platform for finding security vulnerabilities," in Proceedings of the 38th IEEE Symposium on Poster presented at Security and Privacy, 2017.

Cited By

View all
  • (2024)VMud: Detecting Recurring Vulnerabilities with Multiple Fixing Functions via Function Selection and Semantic Equivalent Statement MatchingProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3690372(3958-3972)Online publication date: 2-Dec-2024
  • (2024)CNEPS: A Precise Approach for Examining Dependencies among Third-Party C/C++ Open-Source ComponentsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639209(1-12)Online publication date: 20-May-2024
  • (2024)LibAlchemy: A Two-Layer Persistent Summary Design for Taming Third-Party Libraries in Static Bug-Finding SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639132(1-13)Online publication date: 20-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '21: Proceedings of the 43rd International Conference on Software Engineering
May 2021
1768 pages
ISBN:9781450390859

Sponsors

Publisher

IEEE Press

Publication History

Published: 05 November 2021

Check for updates

Author Tags

  1. Open-Source Software
  2. Software Composition Analysis
  3. Software Security

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICSE '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)8
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)VMud: Detecting Recurring Vulnerabilities with Multiple Fixing Functions via Function Selection and Semantic Equivalent Statement MatchingProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3690372(3958-3972)Online publication date: 2-Dec-2024
  • (2024)CNEPS: A Precise Approach for Examining Dependencies among Third-Party C/C++ Open-Source ComponentsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639209(1-12)Online publication date: 20-May-2024
  • (2024)LibAlchemy: A Two-Layer Persistent Summary Design for Taming Third-Party Libraries in Static Bug-Finding SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639132(1-13)Online publication date: 20-May-2024
  • (2024)BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code MatchingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639100(1-13)Online publication date: 20-May-2024
  • (2023)LibAM: An Area Matching Framework for Detecting Third-Party Libraries in BinariesACM Transactions on Software Engineering and Methodology10.1145/362529433:2(1-35)Online publication date: 23-Dec-2023
  • (2023)Software Composition Analysis for Vulnerability Detection: An Empirical Study on Java ProjectsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616299(960-972)Online publication date: 30-Nov-2023
  • (2021)Dicos: Discovering Insecure Code Snippets from Stack Overflow Posts by Leveraging User DiscussionsAnnual Computer Security Applications Conference10.1145/3485832.3488026(194-206)Online publication date: 6-Dec-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media