CN111176993A

CN111176993A - Code static detection method based on abstract syntax tree

Info

Publication number: CN111176993A
Application number: CN201911349236.5A
Authority: CN
Inventors: 陆茜茜; 张楚一; 胡岩峰; 刘亮; 岳才杰; 陶家顺; 周梓泽; 高玮徽; 许浩
Original assignee: Suzhou Research Institute Institute Of Electronics Chinese Academy Of Sciences
Current assignee: Suzhou Research Institute Institute Of Electronics Chinese Academy Of Sciences
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-19

Abstract

The invention discloses a code static detection method based on an abstract syntax tree, which constructs rule extension templates for different language environments; constructing a rule base based on base class rules of the rule extension template; adopting a registration list mechanism to carry out rule registration; nesting a rule base into an open source platform; and performing code scanning and analysis based on the abstract syntax tree. The invention supports the customization of scanning rules of mainstream development languages, can be quickly integrated into open source software, scans codes by an automatic means, finds unsafe, ambiguous and fuzzy codes in a program, reduces defects and problems in the development and design process of software or a system, and ensures the quality of the software.

Description

Code static detection method based on abstract syntax tree

Technical Field

The invention relates to a software development and test technology, in particular to a code static detection method based on an abstract syntax tree.

Background

With the updating of software design technology and the expansion of software or system scale, the complexity of software products is continuously improved, the problem of software safety is increasingly exposed, and the software test is very important for ensuring the software quality. At present, a code defect-oriented method is mainly adopted for code static test, firstly defects are summarized from a code level of software, the defects are abstracted into corresponding defect modes, then values of related expressions in a program are approximately calculated, and finally a calculation result is applied to defect detection. Static testing is the scanning of code problems without running code, with which 30% -70% of defects in logic design and coding can be effectively discovered. In this field, companies at home and abroad have developed a series of typical code detection tools, such as Klocwork, Fortify, CheckmarxSuite, CodeSecure, etc. The Klockwork is very effective for detecting C/C + + defects and security vulnerabilities of embedded software, but is weak in detection capability for non-C/C + + languages and inconvenient to expand; fortify is the most widely used and most language-supporting static source code detection tool in the world at present, but is inconvenient to use and insufficient in source opening property; the CheckmarxConsuite scans and analyzes security vulnerabilities and weaknesses in source codes by adopting a lexical analysis method, but language support is incomplete; CodeSecure finds the source code with information security problem by syntax parsing and provides the patching suggestion to adjust, but also supports less kinds of languages. In summary, although some more efficient static code detection tools exist at present, the following two problems still exist in these tools: (1) the openness of the tool is insufficient, and the tool cannot be customized and expanded according to the needs; (2) the languages supporting scanning are not comprehensive, and the universality is poor.

Disclosure of Invention

The invention aims to provide a code static detection method based on an abstract syntax tree.

The technical solution for realizing the purpose of the invention is as follows: a code static detection method based on an abstract syntax tree comprises the following steps:

step 1, establishing rule extension templates for different language environments;

step 2, constructing a rule base based on the base class rules of the rule extension template;

step 3, adopting a registration list mechanism to perform rule registration;

step 4, nesting the rule base into an open source platform;

and 5, scanning codes based on the abstract syntax tree to obtain error codes which do not accord with the construction rule.

Compared with the prior art, the invention has the following remarkable advantages: 1) customizing a code static detection judgment standard, and adapting to the code quality scanning requirements of different language obstructed scenes; 2) the degree of polymerization of the static scanning rule and the scanning tool is reduced, so that the static scanning rule can be integrated into an open source tool in a plug-in mode; 3) the automation degree is high, the application range is wide, manual intervention and development environment support are not needed in the detection process, and the detection result is obtained quickly.

Drawings

FIG. 1 is a diagram of a static detection architecture for code based on abstract syntax trees.

FIG. 2 is a rule customization extension template development flow diagram.

FIG. 3 is a rule customization development flow diagram.

FIG. 4 is a schematic diagram of a customized rule registration.

FIG. 5 is a flowchart of code scanning based on an abstract syntax tree.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings.

The method supports the customization of all the scanning rules of the mainstream development language at present, can be quickly integrated into open source software, scans codes by an automatic means, finds unsafe, ambiguous and fuzzy codes in a program, reduces defects and problems in the development and design process of software or a system, and ensures the quality of the software. The method comprises the following steps:

step 1: rule expansion templates are built for different language environments to form a logic production line based on an abstract tree, and the process of building the rule expansion templates is shown in FIG. 2.

Step 1.1: building different language operation environments, and importing different language operation environment dependency packages;

step 1.2: customizing different programming language base classes;

step 1.3: carrying out standardized description on the customized rule and constructing a description expansion interface;

taking a java programming language as an example, dividing a rule into a left part and a right part, and describing the left part of the custom rule in a standardized manner according to a java programming language base class, as shown in table 1:

table 1java rule left part specification description table

The right part of the custom rule is described in an expanded way, as shown in table 2:

TABLE 2 Java rule right part specification description Table

Step 1.4: and creating a unit test case according to the customized rule, and compiling through the build instruction to ensure the normal operation of the test case and construct a perfect operation environment. The project can also be compiled to form a jar package, the jar package is imported into a third-party tool, and whether the jar package is loaded successfully or not is tested by running the third-party integration tool;

step 1.5: and releasing the base class interface and the specification.

Step 2: the rule base is constructed based on the rules of the base class, and the rule customization development flow chart is shown in fig. 3.

Step 2.1: for the above five types of base classes, five types of rule resolvers are constructed: interface scan class (classTree), interface scan Variable (Variable Tree), interface scan Metal (metaldTree Tree), interface scan Block (blockTree), interface scan expression (expressTree Tree).

Step 2.2: aiming at the five rule analyzers, a five-class rule analysis class library is constructed: class scanClass, class scanVariable, class scanMethold, class scanBlock, class scanexpress.

Step 2.3: aiming at the five rule class libraries, a five-class rule analysis method is constructed: scanClass (classtTree), scanVariable (variable Tree tree), scanMethold (metaldTree), scanBlock (blockTree), scanExpression (expressTree).

Step 2.4: the registration rule detection criteria include rule content, rule type, rule scope, rule key attribute, and rule level, for example, as follows:

table 3 registration rule testing basis table

Step 2.5: and (3) constructing a rule stage detection queue, and dividing a whole section of code into 6 stages of member initialization, construction class, construction function, callback function, exception throwing, result feedback and the like.

Step 2.6: and constructing a component detection queue, continuously dividing the component detection queue for the stage obtained by analysis to obtain a component rule unit jointly composed of parameters and annotations, and further representing the code of a certain stage into smaller components.

Step 2.7: and constructing a unit detection queue, continuously dividing the components to form rule units such as independent individuals, relations and member variables, and generating corresponding rule class instances. In actual operation, the rule classes are distributed into different queues according to the division basis of the analysis stage where the rule classes are located during generation, and during analysis, the analyzer only needs to call the corresponding rule of the stage.

Step 2.8: and constructing a complex unit detection queue, and merging associated rules to generate a rule class instance when the unit queue of the divided component unit has context association relation. The method adopts a single-case design mode for design, calls the case creating function for multiple times under the condition of not creating multiple cases, adds related parameters into the case created for the first time, and can realize convenient operation of detecting a group of multiple rules at one time.

Step 2.9: because the rule unit cannot be directly used for executing code detection, the rule class instances generated in the steps 2.7-2.8 need to be converted into the rule file in the xml format, and error detection comments are added to the xml rule file to construct a rule base.

Step 2.10: and integrally packaging each rule to constrain the rule structure and the rule file.

In order to accurately describe the rules in the rule base, each rule is integrally encapsulated according to an encapsulation paradigm, information quantity meeting the basis of the registration rule is used as rule input, and the encapsulation paradigm is shown in table 3:

TABLE 3 encapsulation paradigm table

Query key

Attribute group (Structure)

Description information (Structure)

Regular filename

The description information comprises a scope, a key attribute and a detection level; the attribute group comprises a stage name, a component name, a unit name, a complex unit incidence relation list and the like; and querying a rule file index corresponding to the keyword, wherein the rule file index comprises file error annotation information and keyword information.

And step 3: the rule registration is performed by using a registration list mechanism, and a schematic diagram of customized rule registration is shown in fig. 4.

Step 3.1: connecting a rule registration interface;

step 3.2: calling a rule registration method;

step 3.3: configuring a dependent package file path;

step 3.4: transmitting the path of the rule;

step 3.5: a user-defined rule object is transmitted;

step 3.6: adding into a registration list;

step 3.7: extracting a rule file keyword annotation list and constructing an Issures list;

step 3.8: an example file of pages is formed.

And 4, step 4: establishing and compiling rules, and embedding the rule models into the open source platform;

and 5: rule scanning and analysis, abstract syntax tree based code scanning flow diagram as shown in FIG. 5

Step 5.1: scanning a code extraction convention type as a key node (root) and marking as a starting symbol;

step 5.2: recording each leaf node, marking the inner node of each leaf node with a non-terminal character, if A is the non-terminal character mark of a certain inner node, X1, X2, …, Xn is the mark of all child nodes arranged from left to right of the node, A → X1X2 … Xn is a production formula, X1, X2, …, Xn is a terminal character or a non-terminal character;

step 5.5: and (4) lexical analysis. The source program is decomposed into word-symbol strings to form a symbol table for syntactic analysis. And compiling a lexical analysis program by using a lexical analysis tool, generating a source program file after compiling, and decomposing the source program file into symbol strings by executing the lexical analysis program.

Step 5.6: and (5) analyzing the syntax. The method includes the steps of compiling a grammar analysis program by means of a grammar analysis tool, compiling to generate a source program file, recognizing an input character string as a word symbol stream by means of the grammar analysis program, namely judging whether a symbol string can be generated by a grammar, and generating a grammar tree.

Step 5.7: and (5) semantic analysis. Through the processing of names and operators, the syntax tree is converted into a standard form, which includes an object and a symbol table representing type information, to form a complete syntax tree containing all information of the program structure.

Step 5.8: and (4) fault detection. Except for the root node root, the rest of the syntax tree are all child nodes. The remaining nodes are divided into two types: an inner node and a leaf node. The inner nodes are divided into father nodes and child nodes, and the leaf nodes only contain the father nodes. Each type of node is represented by a corresponding symbol, different data types define corresponding attributes and operations, and rule action, generated result, expected output and external interfaces are given. And (4) realizing function built-in actions by traversing a program syntax tree, and finishing static detection on software faults.

The invention is friendly and compatible with a third-party tool, completes the static detection work of a concrete project by using the abstract syntax tree principle, and forms a test feedback result. In order to facilitate the continuous increase and improvement of the rules, the invention provides an open expansion interface, provides a base class for developing each type of rules in a source program, and can carry out the customized development and the expansion development of the rules in the base class through method rewriting; the invention provides a one-key deployment function, which is convenient for adapting to the access of a third-party tool, constructing a continuous integration and delivery tool chain and realizing the quality inspection and the rapid iteration of software.

Claims

1. A code static detection method based on an abstract syntax tree is characterized by comprising the following steps:

step 3, adopting a registration list mechanism to perform rule registration;

step 4, nesting the rule base into an open source platform;

2. The method for detecting static state of code based on abstract syntax tree as claimed in claim 1, wherein in step 1, the concrete method for constructing the rule extension template is:

step 1.2: customizing different programming language base classes;

step 1.4: creating a unit test case according to the customized rule, compiling through a build instruction to ensure that the test case normally runs and a complete running environment is constructed, or compiling the project to form a jar package, importing the jar package into a third-party tool, and testing whether the jar package is successfully loaded or not by running a third-party integration tool;

step 1.5: and releasing the base class interface and the specification.

3. The abstract syntax tree-based code static detection method as claimed in claim 2, wherein in step 1.3, the rule is divided into two parts, namely a left part and a right part, for the java programming language, and the left part of the custom rule is normalized and described according to the java programming language base class, as shown in table 1:

table 1java rule left part specification description table

TABLE 2 Java rule right part specification description Table

4. The method for statically detecting codes based on abstract syntax trees as claimed in claim 1, wherein in step 2, the concrete method for constructing the rule base is:

step 2.1: for the above five types of base classes, five types of rule resolvers are constructed: interface scan class (classTree), interface scan Variable (Variable Tree), interface scan Metal (metaldTree Tree), interface scan Block (blockTree), interface scan expression (expressTree);

step 2.2: aiming at the five rule analyzers, a five-class rule analysis class library is constructed: class scanClass, class scanvariable, class scanMethold, class scanBlock, class scanexpress;

step 2.3: aiming at the five rule class libraries, a five-class rule analysis method is constructed: scanClass (classtTree), scanVariable (variable Tree tree), scanMethold (metaldTree), scanBlock (blockTree), scanexpress (expressTree);

step 2.4: the registration rule detection basis comprises rule content, rule type, rule scope, rule key attribute and rule level;

step 2.5: constructing a rule stage detection queue, and dividing a whole section of code into 6 stages of member initialization, construction class, construction function, callback function, abnormal throwing, result feedback and the like;

step 2.6: constructing a component detection queue, and continuously dividing the component detection queue for the stage obtained by analysis to obtain a component rule unit consisting of parameters and annotations;

step 2.7: constructing a unit detection queue, continuously dividing components to form independent rule units such as individuals, relations and member variables, and generating corresponding rule class instances;

step 2.8: constructing a complex unit detection queue, and merging associated rules to generate a rule class instance when context association exists in a unit queue divided by a component unit;

step 2.9: converting the rule class instances generated in the steps 2.7-2.8 into rule files in an xml format, adding error detection comments to the xml rule files, and constructing a rule base;

5. The method according to claim 4, wherein in step 2.10, each rule is encapsulated integrally according to an encapsulation paradigm, and the information quantity meeting the basis of the registration rule is input as the rule, and the encapsulation paradigm is shown in table 3:

TABLE 3 encapsulation paradigm table

Query key Attribute group (Structure) Description information (Structure) Regular filename

The description information comprises a scope, a key attribute and a detection level; the attribute group comprises a stage name, a component name, a unit name and a complex unit incidence relation list; and querying a rule file index corresponding to the keyword, wherein the rule file index comprises file error annotation information and keyword information.

6. The method for detecting static state of code based on abstract syntax tree as claimed in claim 1, wherein in step 3, the specific method for rule registration is:

step 3.1: connecting a rule registration interface;

step 3.2: calling a rule registration method;

step 3.3: configuring a dependent package file path;

step 3.4: transmitting the path of the rule;

step 3.5: a user-defined rule object is transmitted;

step 3.6: adding into a registration list;

step 3.8: an example file of pages is formed.

7. The method for detecting static state of code based on abstract syntax tree as claimed in claim 1, wherein in step 5, the specific method for scanning and analyzing code is:

step 5.1: scanning a code extraction appointed type as a key node, and marking the key node as a starting symbol;

step 5.2: recording each leaf node, and marking the inner node of each leaf node by using a non-terminal character;

step 5.3: and performing lexical analysis, syntactic analysis, semantic analysis and fault detection.