[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110704308A - Multistage feature extraction method - Google Patents

Multistage feature extraction method Download PDF

Info

Publication number
CN110704308A
CN110704308A CN201910857082.4A CN201910857082A CN110704308A CN 110704308 A CN110704308 A CN 110704308A CN 201910857082 A CN201910857082 A CN 201910857082A CN 110704308 A CN110704308 A CN 110704308A
Authority
CN
China
Prior art keywords
code
software project
information
file
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910857082.4A
Other languages
Chinese (zh)
Other versions
CN110704308B (en
Inventor
程华
王明扬
吕正辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201910857082.4A priority Critical patent/CN110704308B/en
Publication of CN110704308A publication Critical patent/CN110704308A/en
Application granted granted Critical
Publication of CN110704308B publication Critical patent/CN110704308B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention belongs to the technical field of software code similarity detection, and particularly relates to a multistage feature extraction method applied to code similarity detection. It is characterized by comprising: acquiring and storing a mixed feature set of each software project of a code base; the mixed feature set comprises folder-level features representing folder structures in the software project, file-level features representing file semantics in the software project, function-level features representing function semantics and syntax in the software project, and code segment-level features representing syntax, semantics and texts of code segments in the software project. The comprehensive mixed feature set of each software project in the code base is obtained and stored in the code base in advance, so that the information of the software project at multiple levels such as folders, files, functions, code segments and the like can be comprehensively described, and the detection precision of the system is powerfully improved. After the software to be tested is input, the comparison can be carried out only by calculating the characteristics of the code base in real time, so that the comparison speed is improved.

Description

Multistage feature extraction method
Technical Field
The invention belongs to the technical field of software code similarity detection, and particularly relates to a multistage feature extraction method applied to code similarity detection.
Background
The detection of repeated codes (also called clone codes) is an important task in the development and maintenance activities of computer software, and is widely applied in a plurality of fields of source code plagiarism detection, software component library inquiry, software defect detection, program understanding and the like.
Application publication No. CN101697121A, application publication date 2010, 4-month-21-day of the invention patent application discloses a code similarity detection method based on program source code for analysis: respectively analyzing two sections of source codes to be detected into control dependency trees of two system dependency graphs, and respectively executing basic code standardization; extracting a candidate code control dependency tree of the control dependency tree after two basic codes are standardized by a utilization quantitative value method; performing high-level code standardization operation on the extracted candidate similar codes; and calculating semantic similarity to obtain a similarity result, and completing code similarity detection. The method solves the problems that the similarity detection accuracy of codes with different univocal meanings similar to grammar representation is low, the calculation complexity is high, and the similarity detection of large-scale program codes cannot be realized in the prior art.
Conventional code similarity analysis systems, such as the above-mentioned patents, characterize code using relatively single features based on their application scenarios and code sizes: the analysis scale is basically fixed as a code line, one of four angles of text/lexical/grammatical/semantic is selected, and a single value/string/tree/graph characteristic is constructed. The single feature is widely applied to the analysis of code similarity in projects, but is not suitable for the analysis of code similarity between projects: as code specifications dramatically increase, flat single features do not provide full code delineation.
The application publication number CN109542766A, the application publication date 2019, 3, 29 and the method for rapidly detecting the similarity of the large-scale program based on code mapping and lexical analysis and generating the evidence, and the method for detecting the plagiarism and generating the evidence of the large-scale software sample by adopting a two-layer similarity detection method comprises the following steps: firstly, carrying out coarse-grained similarity analysis on a large-scale program by using a code mapping method, and quickly searching a suspected similar program; and then, performing fine-grained analysis on the suspicious similar programs by adopting lexical analysis, judging program similarity, and quickly and accurately finding plagiarism codes in large-scale samples.
In a conventional code similarity analysis system such as the above patent, all code features are calculated in real time in the system, which is relatively costly and long. In addition, the user cannot adjust the comparison strategy and the comparison condition according to the requirements on the detection type, the detection precision and the detection speed. And the similarity analysis of the large-scale autonomous mixed-source software adopts inter-project comparison, namely, software projects input into a similarity analysis system need to be compared with a huge code base in the system. Different software to be tested has different code amount, detection requirements and the like, which characteristics need to be calculated can be determined after the software to be tested is input, and the characteristics of the code library are calculated in real time after the software to be tested is input, so that the time is consumed.
Disclosure of Invention
The invention aims to provide a multi-stage feature extraction method which is applicable to code similarity analysis among projects and can provide comprehensive code description when the code scale is increased sharply.
A multi-stage feature extraction method is characterized in that:
acquiring and storing a mixed feature set of each software project of a code base;
the mixed feature set comprises folder-level features representing folder structures in the software project, file-level features representing file semantics in the software project, function-level features representing function semantics and syntax in the software project, and code segment-level features representing syntax, semantics and texts of code segments in the software project.
According to the technical scheme, the comprehensive mixed feature set of each software project in the code base is obtained and stored in the code base in advance, so that the information of the software project at multiple levels such as folders, files, functions and code segments can be completely described, and the detection precision of the system is powerfully improved. After the software to be tested is input, the comparison can be carried out only by calculating the characteristics of the software to be tested in real time, so that the comparison speed is increased.
Further, after acquiring and storing the mixed feature set, the method further includes: acquiring one or more of folder level characteristics, file level characteristics, function level characteristics and code segment level characteristics of the software project to be detected according to the detection requirement of the software project to be detected; and performing feature matching on the acquired one or more features of the software item to be detected and corresponding features in the mixed feature set of the software items in the code base. And a plurality of mixed feature subsets which can be formed based on the mixed feature set so as to meet different testing requirements, and the usability of the system is remarkably improved.
Preferably, the folder statistics information of each folder in the software project and the association information between the files/functions/variables contained in each folder are obtained as the folder-level features of the corresponding software project.
Preferably, the folder statistical information includes the number of files, the file type, the file size, and the programming language of the folder; the association information between the files/functions/variables contained in the folder comprises a file association graph and a function cross-file call graph.
Preferably, file statistical information of each file in the software project and associated information between functions in each file are acquired as file-level features of the corresponding software project.
Preferably, the file statistical information comprises an API calling type, API calling times, a static variable type, static variable definition times and static variable use times; the associated information among the functions in the file comprises a function call relation graph.
Preferably, function statistical information of each function in the software project, structured semantic information of codes in the function and structured grammar information of the codes in the function are acquired as the function-level features of the corresponding software project.
Preferably, the function statistical information includes code structure statistical information and variable statistical information; the structured semantic information of the code in the function comprises a code program dependency graph; the structured syntax information of the code within the function includes a code abstract syntax tree.
Preferably, the original text information, symbol information, and definition and use information of variables in different contexts of each code segment in the software project are obtained.
Preferably, the code original text information comprises character string information formed by standard preprocessing of code segments; the code symbol information includes a symbol sequence based on the code original file.
The invention has the following beneficial effects:
the multi-level software project mixed feature set can represent information of codes of the software project in various scales such as a folder level, a file level, a function level and a code segment level; the code characteristics are characterized from multiple angles of semantics, grammar, lexical methods, texts and the like; various expressions such as numerical values, strings, trees, graphs and the like are adopted. The mixed feature complete set can be flexibly decomposed into a plurality of mixed feature subsets, different testing requirements are met, and the availability of the system is obviously improved.
Drawings
Fig. 1 is a schematic diagram of a typical application scenario of the method of the present invention.
Detailed Description
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Unless otherwise defined, all terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that the conventional terms should be interpreted as having a meaning that is consistent with their meaning in the relevant art and this disclosure. The present disclosure is to be considered as an example of the invention and is not intended to limit the invention to the particular embodiments.
A multi-level feature extraction method, comprising:
step S1, acquiring and storing a mixed feature set of each software project of the code library;
step S2, acquiring one or more of folder level characteristics, file level characteristics, function level characteristics and code segment level characteristics of the to-be-detected software project according to the detection requirement of the to-be-detected software project.
And step S3, performing feature matching on the acquired one or more features of the to-be-detected software item and corresponding features in the mixed feature set of the software items in the code library.
The mixed feature set comprises folder level features for representing the structure of each folder in the software project, file level features for representing the semantics of each file in the software project, function level features for representing the semantics and syntax of each function in the software project, and code segment level features for representing the syntax, the semantics and the text of each code segment in the software project.
Folder level features
Folder-level features focus on the delineation of folder structures, which are used to characterize each folder structure in a software project. Two main types of features are included. One type is folder statistics, including the number of files contained in the folder, file type, file size, programming language, creation time, and the like. Another class is features that characterize associations between files/functions/variables contained within folders, including file association graphs and function cross-file call graphs, etc.
In steps S1 and S2, the folder statistics information of each folder in the software project and the association information between the files/functions/variables contained in each folder are obtained as the folder-level characteristics of the corresponding software project.
Document level features
The file-level features concern the depiction of coarse-grained information such as functions and static variables in files and are used for representing the semantics of all files in a software project. The file-level features may take the form of a mixture of values, graphs, etc., which contain primarily two types of features. One type is file statistical information, and comprises an API calling type, API calling times, a static variable type, static variable definition times, static variable use times and the like; another type is a feature that characterizes associations between functions within a file, such as a function call relationship graph.
The file statistics of each file in the software project and the association information between the functions in each file are obtained as the file level characteristics of the corresponding software project in step S1 and step S2.
Function class characteristics
The function-level features concern the depiction of special code structures such as function inner loop, branch and the like and the information such as the relation between sentences and the like, and are used for representing semantic and syntactic information. The features of the function level may take the form of a mixture of values, trees, graphs, etc., which contain three classes of features: the first type is function statistics, which include code structure statistics (e.g., form, number of loops), variable statistics (e.g., number of definitions of variables, number of uses), and the like. The second type is structured semantic information of code within a function, i.e., a Program Dependency Graph (PDG), which contains control flow and data flow information within the function. The third type is the structured Syntax information of the code in the function, i.e. Abstract Syntax Tree (AST), which contains the operation type, operation data, context, and other information of each statement in the function.
The function statistical information of each function in the software project, the structured semantic information of codes in the function and the structured grammar information of the codes in the function are obtained in the steps S1 and S2 to be used as the function level characteristics of the corresponding software project.
Code segment level features
Features at the code segment level focus on text and lexical information within the code segment, as well as the use of variables to characterize grammatical, lexical, and textual information. The features at the code segment level may take the form of a mixture of values, strings, etc., which contains three types of features: one is the original text information of the code segment, and the code is processed by standard pretreatment including removing blank space, replacing constant and the like to form character string information, and the character string can further calculate a hash value for comparison. The second type is symbol information of a code segment, the code is converted into a symbol (Token) sequence through lexical analysis, and the word frequency of each Token and the like are calculated. The third category is where variables are defined and used in different contexts, such as what statistics occur in arithmetic operations.
The original text information, symbol information, and definition and usage information of variables in different contexts of each code segment in the software project are obtained in steps S1 and S2.
In summary, the invention extracts the folder, file, function, and code segment level features of the software project code, and represents the semantics, syntax, lexical, and text information of the code, where the form of the features may be values, strings, trees, or diagrams.
Fig. 1 shows a typical application scenario of the method of the present invention: the code base stores the mixed feature complete set of each software project, and the software to be detected can extract different mixed feature subsets for matching according to different testing requirements.
For example, under the first test requirement, it is necessary to mainly check whether the software to be tested has been rewritten by renaming variables, adding empty rows, and the like to the open source software library, and then the mixed feature subset thereof will mainly include the syntactic/semantic features at the code segment level. That is, in the above step S2 of the present embodiment, symbol information at the code segment level of the software item to be detected, definition of variables in different contexts, and usage information are calculated in real time. In the above step S3 of the present embodiment, the code segment-level symbolic information, definitions of variables in different contexts, and usage information in the software item mixed feature set compared with the software item in the code library are combined into the mixed feature subset 1 of the software item, and the mixed feature subset 1 is matched with the software item features extracted in step S2, so as to calculate the similarity between the two features.
Under the second test requirement, whether the software to be tested completely multiplexes some files in the open source software library needs to be quickly checked, and the mixed feature subset of the software to be tested mainly comprises text/lexical features at the file level. That is, in the above step S2 of the present embodiment, the file statistics of the file level of the item of software to be detected are calculated in real time. In the above step S3 of the present embodiment, the file statistics information at the file level in the software project mixed feature set compared with the software project in the code library is combined into the mixed feature subset 2 of the software project to be matched with the software project features extracted in step S2, so as to calculate the similarity between the two features.
Although embodiments of the present invention have been described, various changes or modifications may be made by one of ordinary skill in the art within the scope of the appended claims.

Claims (10)

1. A multi-stage feature extraction method is characterized by comprising the following steps:
acquiring and storing a mixed feature set of each software project of a code base;
the mixed feature set comprises folder-level features representing folder structures in the software project, file-level features representing file semantics in the software project, function-level features representing function semantics and syntax in the software project, and code segment-level features representing syntax, semantics and texts of code segments in the software project.
2. The multi-stage feature extraction method according to claim 1, further comprising, after acquiring and storing the mixed feature set:
acquiring one or more of folder level characteristics, file level characteristics, function level characteristics and code segment level characteristics of the software project to be detected according to the detection requirement of the software project to be detected;
and performing feature matching on the acquired one or more features of the software item to be detected and corresponding features in the mixed feature set of the software items in the code base.
3. The multi-stage feature extraction method according to claim 1 or 2, characterized in that:
acquiring folder statistical information of each folder in the software project and associated information between files/functions/variables contained in each folder as folder-level characteristics of the corresponding software project.
4. The multi-stage feature extraction method according to claim 3, characterized in that:
the folder statistical information comprises the number of files, the types of the files, the sizes of the files and programming languages of the folders;
the association information between the files/functions/variables contained in the folder comprises a file association graph and a function cross-file call graph.
5. The multi-stage feature extraction method according to claim 1 or 2, characterized in that:
and acquiring file statistical information of each file in the software project and associated information among functions in each file as file-level characteristics of the corresponding software project.
6. The multi-stage feature extraction method according to claim 5, wherein:
the file statistical information comprises an API calling type, API calling times, a static variable type, static variable definition times and static variable use times;
the associated information among the functions in the file comprises a function call relation graph.
7. The multi-stage feature extraction method according to claim 1 or 2, characterized in that:
and acquiring function statistical information of each function in the software project, structured semantic information of codes in the function and structured grammar information of the codes in the function as function level characteristics of the corresponding software project.
8. The multi-stage feature extraction method according to claim 7, wherein:
the function statistical information comprises code structure statistical information and variable statistical information;
the structured semantic information of the code in the function comprises a code program dependency graph;
the structured syntax information of the code within the function includes a code abstract syntax tree.
9. The multi-stage feature extraction method according to claim 1 or 2, characterized in that:
and acquiring the definition and use information of the original text information, the symbol information and the variable of each code segment in the software project in different contexts.
10. The multi-stage feature extraction method according to claim 9, wherein:
the code original text information comprises character string information formed by standard preprocessing of code segments;
the code symbol information includes a symbol sequence based on the code original file.
CN201910857082.4A 2019-09-11 2019-09-11 Multistage feature extraction method Active CN110704308B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910857082.4A CN110704308B (en) 2019-09-11 2019-09-11 Multistage feature extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910857082.4A CN110704308B (en) 2019-09-11 2019-09-11 Multistage feature extraction method

Publications (2)

Publication Number Publication Date
CN110704308A true CN110704308A (en) 2020-01-17
CN110704308B CN110704308B (en) 2022-09-09

Family

ID=69195260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910857082.4A Active CN110704308B (en) 2019-09-11 2019-09-11 Multistage feature extraction method

Country Status (1)

Country Link
CN (1) CN110704308B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065726A (en) * 2021-11-18 2022-02-18 北京迪力科技有限责任公司 Data processing method and device
CN116305134A (en) * 2022-11-03 2023-06-23 苏州棱镜七彩信息科技有限公司 Binary system-based software traceability detection method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169321A (en) * 2017-06-10 2017-09-15 西安交通工程学院 The program plagiarism detection method and system being combined based on attribute count and structure measurement technology
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169321A (en) * 2017-06-10 2017-09-15 西安交通工程学院 The program plagiarism detection method and system being combined based on attribute count and structure measurement technology
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065726A (en) * 2021-11-18 2022-02-18 北京迪力科技有限责任公司 Data processing method and device
CN116305134A (en) * 2022-11-03 2023-06-23 苏州棱镜七彩信息科技有限公司 Binary system-based software traceability detection method

Also Published As

Publication number Publication date
CN110704308B (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN109697162B (en) Software defect automatic detection method based on open source code library
Rigby et al. Discovering essential code elements in informal documentation
CN110737899B (en) Intelligent contract security vulnerability detection method based on machine learning
US8539475B2 (en) API backward compatibility checking
CN111459799B (en) Software defect detection model establishing and detecting method and system based on Github
Liu et al. Automatic detection of outdated comments during code changes
US12106095B2 (en) Deep learning-based java program internal annotation generation method and system
CN114297654A (en) Intelligent contract vulnerability detection method and system for source code hierarchy
US20230273776A1 (en) Code Processing Method and Apparatus, Device, and Medium
Fluri et al. Discovering patterns of change types
CN101576850B (en) Method for testing improved host-oriented embedded software white box
CN114398069B (en) Method and system for identifying accurate version of public component library based on cross fingerprint analysis
CN115066674A (en) Method for evaluating source code using numeric array representation of source code elements
CN111881300A (en) Third-party library dependency-oriented knowledge graph construction method and system
CN110704308B (en) Multistage feature extraction method
US20200226232A1 (en) Method of selecting software files
CN116975881A (en) LLVM (LLVM) -based vulnerability fine-granularity positioning method
CN108563561B (en) Program implicit constraint extraction method and system
CN112199115A (en) Cross-Java byte code and source code line association method based on feature similarity matching
CN115373737B (en) Code clone detection method based on feature fusion
EP4258107A1 (en) Method and system for automated discovery of artificial intelligence and machine learning assets in an enterprise
Greenan Method-level code clone detection on transformed abstract syntax trees using sequence matching algorithms
CN114996705B (en) Cross-software vulnerability detection method and system based on vulnerability type and Bi-LSTM
CN115373982A (en) Test report analysis method, device, equipment and medium based on artificial intelligence
Bangare et al. Code parser for object Oriented software Modularization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant