CN115906086A

CN115906086A - Method, system and storage medium for detecting webpage backdoor based on code attribute graph

Info

Publication number: CN115906086A
Application number: CN202310153428.9A
Authority: CN
Inventors: 陈远超; 潘祖烈; 陈燏; 赵军; 沈毅; 严尹彤
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-04-04

Abstract

The invention provides a method, a system and a storage medium for detecting a webpage backdoor based on a code attribute graph, and belongs to the technical field of network security. The detection method comprises the steps of establishing an abstract syntax tree according to a test sample, and then establishing a control flow graph, a program dependence graph and a function call graph according to the abstract syntax tree; respectively extracting grammatical attributes, control flows, data flows and relationship combinations among processes in a control flow graph, a program dependency graph and a function call relationship graph to construct a code attribute graph, and storing the code attribute graph into a code attribute graph database; and finally, searching a taint path based on the reverse traversal of the code attribute graph database, and performing webpage backdoor detection. The webpage backdoor detection method can realize effective utilization of data set samples, and improves the utilization rate of data by extracting the relations among syntactic attributes, control flows, data flows and processes in the static source code analysis method.

Description

Web page backdoor detection method, system and storage medium based on code attribute graph

技术领域technical field

本发明属于网络安全技术领域，尤其涉及一种基于代码属性图的网页后门检测方法、系统及存储介质。The invention belongs to the technical field of network security, and in particular relates to a webpage backdoor detection method, system and storage medium based on a code attribute graph.

背景技术Background technique

随着互联网的迅猛发展，web网页应用的快速普及，网页web服务攻击事件越发频繁，而向目标站点植入Webshell也成为了攻击者们最常用的攻击方法之一，攻击者利用Webshell获取系统的命令执行环境，从而进一步控制网站服务器，达到信息嗅探、数据窃取或篡改等目的。为对抗Web服务攻击，Webshell检测的研究成为了当前的重要工作之一。With the rapid development of the Internet and the rapid popularization of web applications, web service attacks have become more frequent, and implanting Webshells into target sites has become one of the most commonly used attack methods for attackers. Attackers use Webshells to obtain system information. Command execution environment, so as to further control the website server, to achieve the purpose of information sniffing, data theft or tampering. In order to combat Web service attacks, research on Webshell detection has become one of the current important tasks.

Webshell检测技术目前主要分为动态特征检测和静态特征检测以及采用机器学习方法进行检测三种方法，其中动态特征检测主要是基于文件行为、Webshell通信流量等特征进行检测，需要在Webshell动态执行的情况下才能进行检测分析。静态特征检测方法主要是基于Webshell的文本内容或者系统的日志信息进行检测分析，而当前静态特征检测方法的主要研究方向是基于Webshell的文件内容，最早得到广泛使用的是利用正则表达式进行Webshell的检测，但是由于正则表达式是从现有的Webshell中提取生成的，需要不断地进行更新，并且没有检测当前未出现过的Webshell的能力，并且由于Webshell代码混淆加密技术不断趋于成熟，导致利用正则表达式的检测方法能够被Webshell轻松规避。将机器学习运用到Webshell检测中，起到了较好的效果。通过机器学习进行Webshell检测，其中决定性作用在于Webshell的特征选择。Webshell detection technology is currently mainly divided into three methods: dynamic feature detection, static feature detection, and machine learning methods. Among them, dynamic feature detection is mainly based on file behavior, Webshell communication traffic and other characteristics, and it needs to be dynamically executed in Webshell. Only then can the detection and analysis be carried out. The static feature detection method is mainly based on the text content of the Webshell or the log information of the system for detection and analysis, and the main research direction of the current static feature detection method is based on the file content of the Webshell. detection, but since the regular expression is extracted from the existing Webshell, it needs to be continuously updated, and it does not have the ability to detect the Webshell that has not appeared at present, and because the Webshell code obfuscation encryption technology continues to mature, resulting in the use of Regular expression detection methods can be easily circumvented by webshells. Applying machine learning to Webshell detection has achieved good results. Webshell detection through machine learning, where the decisive role lies in the feature selection of the Webshell.

Van-Giap Le等人在2016年提出了基于污点分析进行Xss、Sql注入漏洞检测和Webshell检测，污点分析也可以称为数据流追踪定位技术，需要找到一条从用户可控输入端到敏感函数的数据流，如果存在，则可以确认该文件为Webshell。这种技术对应于未加密混淆的简单php文件能起到很好的效果，但是因为php语言的灵活性，若攻击者利用混淆、加密等变形技术，可以轻松地绕过这类检测系统。Van-Giap Le et al. proposed Xss, Sql injection vulnerability detection and Webshell detection based on taint analysis in 2016. Stain analysis can also be called data flow tracking and positioning technology. It is necessary to find a path from user-controllable input to sensitive function. Data flow, if present, can confirm that the file is a webshell. This technique works well for unencrypted and obfuscated simple php files, but because of the flexibility of the php language, if an attacker uses obfuscation, encryption and other deformation techniques, he can easily bypass this type of detection system.

Yong Fang等人在2018年提出利用随机森林算法结合FastText的方法构建训练模型进行Webshell检测的方法，提取了最长字符串、信息熵、符合指数这类统计学特征以及签名函数和黑名单关键字特征，同时也首次提出提取php opcode代码，经过FastText处理后与上述特征共同输入到随机森林算法中进行模型训练。In 2018, Yong Fang and others proposed a method of using the random forest algorithm combined with the FastText method to build a training model for Webshell detection, and extracted statistical features such as the longest string, information entropy, coincidence index, signature function and blacklist keywords features, and it is also the first time to extract the php opcode code, which is processed by FastText and input to the random forest algorithm together with the above features for model training.

现有技术中的Webshell的检测方法主要采用机器学习方法进行检测，通过提取Webshell的统计学特征，结合文本中单词重要性或词频特征进行模型构建。由于统计学特征只能从整体的角度进行检测分析，基于当前Web服务的快速发展，开发人员为避免源码泄露，常会对源代码进行混淆加密，从而导致采用基于统计学特征的方法检测Webshell时会产生大量误报，不能够进行有效的检测分析。而总结提取Webshell的其他特征需要大量的Webshell样本，并且所提取的特征往往取决于所采用的样本集，导致最终检测模型存在过拟合的情况。The detection method of Webshell in the prior art mainly adopts machine learning method for detection, and builds a model by extracting statistical features of Webshell and combining word importance or word frequency features in the text. Since statistical features can only be detected and analyzed from an overall perspective, based on the rapid development of current Web services, developers often obfuscate and encrypt the source code in order to avoid source code leaks, resulting in the detection of Webshells using statistical features-based methods. A large number of false positives are generated, and effective detection and analysis cannot be performed. However, summarizing and extracting other features of Webshell requires a large number of Webshell samples, and the extracted features often depend on the sample set used, resulting in overfitting of the final detection model.

发明内容Contents of the invention

为解决上述问题，提高Webshell检测的准确率，本发明提出一种提取Webshell源代码中语法属性、控制流、数据流、过程间关系，结合构造代码属性图，基于代码属性图，将代码属性图存入图数据中，采用图搜索方法对Webshell进行检测，能够有效提高Webshell检测的准确率，减少误报。In order to solve the above problems and improve the accuracy of Webshell detection, the present invention proposes a method of extracting the relationship between grammatical attributes, control flow, data flow, and process in the Webshell source code, combined with the construction of a code attribute graph, and based on the code attribute graph, the code attribute graph Stored in the graph data, using the graph search method to detect Webshells can effectively improve the accuracy of Webshell detection and reduce false positives.

本发明第一方面公开了一种基于代码属性图的网页后门检测方法，所述方法包括：The first aspect of the present invention discloses a web page backdoor detection method based on a code attribute graph, the method comprising:

步骤S1：依据测试样本建立抽象语法树，再依据所述抽象语法树建立控制流图、程序依赖图和函数调用关系图；Step S1: establish an abstract syntax tree according to the test sample, and then establish a control flow graph, a program dependency graph, and a function call relationship graph according to the abstract syntax tree;

步骤S2：提取所述抽象语法树中的语法属性作为节点，分别提取控制流图中的控制流，程序依赖图中的数据流，函数调用关系图中的函数过程间关系，共同组合构建代码属性图，并存入代码属性图数据库；Step S2: Extract the grammatical attributes in the abstract syntax tree as nodes, respectively extract the control flow in the control flow graph, the data flow in the program dependency graph, and the relationship between function procedures in the function call graph, and jointly construct code attributes graph, and store it in the code attribute graph database;

步骤S3：基于所述代码属性图数据库反向遍历的污点路径搜索，进行网页后门检测。Step S3: Perform webpage backdoor detection based on the taint path search through the reverse traversal of the code attribute graph database.

根据本发明第一方面的方法，步骤S3具体为：在所述代码属性图中搜索定位网页后门类型的sink函数节点，基于所述sink函数节点搜索Web应用中提供获取用户输入数据的超全局变量。According to the method of the first aspect of the present invention, step S3 specifically includes: searching and locating the sink function node of the web page backdoor type in the code attribute graph, and searching for the superglobal variable provided in the Web application to obtain user input data based on the sink function node .

根据本发明第一方面的方法，所述sink函数节点为与数据库、文件系统以及系统终端进行交互的敏感函数。According to the method of the first aspect of the present invention, the sink function node is a sensitive function that interacts with databases, file systems and system terminals.

根据本发明第一方面的方法，步骤S3还包括：所述代码属性图中包含语法属性以及函数调用关系，根据这两类属性反向遍历，对sink函数节点以及变量的传递关系进行回溯，直至找到source节点。According to the method of the first aspect of the present invention, step S3 further includes: the code attribute graph includes syntax attributes and function call relationships, reversely traverses according to these two types of attributes, and traces back to the sink function node and the transfer relationship of variables until Find the source node.

根据本发明第一方面的方法，在所述代码属性图中搜索一条污点路径，并将污点路径中不存在特殊字符转义的情况判断为Webshell。According to the method of the first aspect of the present invention, a taint path is searched in the code attribute graph, and the situation that there is no special character escape in the taint path is judged as Webshell.

根据本发明第一方面的方法，步骤S2具体为：通过提取抽象语法树中的语法属性作为节点，基于控制流图中的控制流信息，程序依赖图中的数据流信息，以及函数调用关系图中的函数过程间关系基于开源图数据库New4j进行代码属性图的图谱构建，生成代码属性图。According to the method of the first aspect of the present invention, step S2 is specifically: by extracting the grammatical attributes in the abstract syntax tree as nodes, based on the control flow information in the control flow graph, the data flow information in the program dependency graph, and the function call relationship graph The relationship between functions and processes in the graph is constructed based on the open source graph database New4j to generate code attribute graphs.

本发明第二方面公开了一种基于代码属性图的网页后门检测系统，包括：The second aspect of the present invention discloses a webpage backdoor detection system based on a code attribute graph, including:

第一模块，用于依据测试样本建立抽象语法树，再依据所述抽象语法树建立控制流图、程序依赖图和函数调用关系图；The first module is used to establish an abstract syntax tree according to the test sample, and then establish a control flow graph, a program dependency graph and a function call relationship graph according to the abstract syntax tree;

第二模块：用于提取所述抽象语法树中的语法属性作为节点，分别提取控制流图中的控制流，程序依赖图中的数据流，函数调用关系图中的函数过程间关系，共同组合构建代码属性图，并存入代码属性图数据库；The second module: used to extract the grammatical attributes in the abstract syntax tree as nodes, respectively extract the control flow in the control flow graph, the data flow in the program dependency graph, and the relationship between function procedures in the function call graph, and combine them together Construct the code attribute graph and store it in the code attribute graph database;

第三模块：用于基于所述代码属性图数据库反向遍历的污点路径搜索，进行网页后门检测。The third module: it is used to search the taint path based on the reverse traversal of the code attribute graph database, and detect the backdoor of the web page.

本发明第三方面公开了一种计算机存储介质，所述计算机存储介质存储计算机程序指令，所述计算机程序指令被执行时，实现前述的网页后门检测方法。The third aspect of the present invention discloses a computer storage medium. The computer storage medium stores computer program instructions. When the computer program instructions are executed, the aforementioned method for detecting web page backdoors is realized.

本发明第四方面公开了一种基于代码属性图的网页后门检测系统，其包括所述存储介质，并用于执行所述基于代码属性图的网页后门检测方法。The fourth aspect of the present invention discloses a webpage backdoor detection system based on a code attribute graph, which includes the storage medium and is used to execute the method for detecting a webpage backdoor based on a code attribute graph.

本发明提供的基于代码属性图的网页后门检测方案主要实现如下效果：The webpage backdoor detection scheme based on the code attribute graph provided by the present invention mainly realizes the following effects:

（1）利用所述方案可以实现对数据集样本的有效利用，通过提取静态源代码分析方法中语法属性、控制流、数据流、过程间关系，提高数据的利用率；(1) The scheme can be used to realize the effective utilization of data set samples, and improve the utilization rate of data by extracting grammatical attributes, control flow, data flow, and relationship between processes in the static source code analysis method;

（2）利用所述方案可以有效提高Webshell检测的准确性，基于代码属性图进行Webshell检测，查询数据输入到输出的路径，能提高正常文件和webshell的区分，降低检测误报；(2) The accuracy of Webshell detection can be effectively improved by using the above scheme. Webshell detection is performed based on the code attribute graph, and the path from data input to output is queried, which can improve the distinction between normal files and webshells, and reduce detection false positives;

（3）利用所述方案可以提高Webshell检测的效率,图论中具有许多高效的图搜索算法，将Webshell静态分析转换为代码属性图，利用图搜索算法高效地进行Webshell的检测。(3) Using the above scheme can improve the efficiency of Webshell detection. There are many efficient graph search algorithms in graph theory, which convert the static analysis of Webshell into a code attribute graph, and use the graph search algorithm to efficiently detect Webshell.

附图说明Description of drawings

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the description of the specific embodiments or prior art. Obviously, the accompanying drawings in the following description These are some implementations of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without creative work.

图1为根据本发明实施例的一种基于代码属性图的网页后门检测方法的流程图；Fig. 1 is a flow chart of a webpage backdoor detection method based on a code attribute graph according to an embodiment of the present invention;

图2为根据本发明的一种基于代码属性图的网页后门检测方法的Webshell示例图；Fig. 2 is a Webshell example diagram of a web page backdoor detection method based on a code attribute graph according to the present invention;

图3为根据本发明的一种基于代码属性图的网页后门检测方法的AST树形图；Fig. 3 is the AST tree diagram of a kind of web page backdoor detection method based on the code attribute graph according to the present invention;

图4为根据本发明实施例的一种基于代码属性图的网页后门检测方法的控制流图；FIG. 4 is a control flow diagram of a method for detecting web page backdoors based on a code attribute graph according to an embodiment of the present invention;

图5 为根据本发明实施例的一种基于代码属性图的网页后门检测方法的Webshell示例的程序依赖图。Fig. 5 is a program dependency graph of a Webshell example of a webpage backdoor detection method based on a code attribute graph according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例只是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本实施例提供一种基于代码属性图的网页后门检测方法，如图1所示，其包括：The present embodiment provides a web page backdoor detection method based on a code attribute graph, as shown in Figure 1, which includes:

步骤S1，依据测试样本建立抽象语法树，再依据所述抽象语法树提取控制流图、程序依赖图和函数调用关系图；Step S1, establishing an abstract syntax tree according to the test sample, and then extracting a control flow graph, a program dependency graph, and a function call relationship graph according to the abstract syntax tree;

步骤S2，提取所述抽象语法树中的语法属性作为节点，分别提取控制流图中的控制流，程序依赖图中的数据流，函数调用关系图中的函数过程间关系，共同组合构建代码属性图，并存入代码属性图数据库；Step S2, extracting the grammatical attributes in the abstract syntax tree as nodes, respectively extracting the control flow in the control flow graph, the data flow in the program dependency graph, and the relationship between function procedures in the function call graph, and jointly constructing code attributes graph, and store it in the code attribute graph database;

步骤S3，基于图反向遍历的污点路径搜索，进行网页后门（Webshell）检测。Step S3, based on the taint path search of the graph reverse traversal, webpage backdoor (Webshell) detection is performed.

所述步骤S1中的抽象语法树是通过树状形式表示程序源代码的语法结构，控制流来自控制流图，数据流来自程序依赖图，函数过程间关系来自于函数调用关系图。The abstract syntax tree in step S1 represents the grammatical structure of the program source code in a tree form, the control flow comes from the control flow graph, the data flow comes from the program dependency graph, and the relationship between functions and procedures comes from the function call relationship graph.

以下以一段Webshell计算机程序代码为例，具体展示基于代码属性图的网页后门检测方法具体是如何实现的，具体参见图2-5，其中图2为根据本发明实施例中一种基于代码属性图的网页后门检测方法的采用的Webshell计算机程序代码示例图。The following takes a piece of Webshell computer program code as an example to specifically demonstrate how the webpage backdoor detection method based on the code attribute graph is implemented. For details, see FIGS. 2-5, wherein FIG. An example diagram of the Webshell computer program code used in the webpage backdoor detection method.

如图3所示，所述抽象语法树上的每个节点都表示源代码中的一种结构，本实施例中，抽象语法树的节点分为两类，其中内部节点代表运算符，如赋值或函数调用，叶节点是诸如常量或标识符的操作数。As shown in Figure 3, each node on the abstract syntax tree represents a structure in the source code. In this embodiment, the nodes of the abstract syntax tree are divided into two categories, wherein internal nodes represent operators, such as assignment or function calls, the leaf nodes are operands such as constants or identifiers.

图3是抽象语法树的图形表示模式，抽象语法树适合通过函数或源文件中的语法构造来建模漏洞，以明确编程结构是如何嵌套，但抽象语法树不包含语义信息，如应用程序的控制流或数据流。因此，抽象语法树不能用于分析程序中攻击者控制的数据流，需要额外的结构。Figure 3 is a graphical representation of the abstract syntax tree. The abstract syntax tree is suitable for modeling vulnerabilities through functions or syntax constructs in source files to clarify how programming structures are nested, but the abstract syntax tree does not contain semantic information, such as application control flow or data flow. Therefore, abstract syntax trees cannot be used to analyze attacker-controlled data flow in a program, requiring additional structures.

本实施例中根据所述抽象语法树构建的控制流图 (Control Flow Graph, CFG)，是一个过程或程序的抽象表现，是用在编译器中的一个抽象数据结构，由编译器在内部维护，其代表一个程序执行过程中会遍历到的所有路径；并用图的形式表示一个过程内所有基本块执行的可能流向, 也反映一个过程的实时执行过程。一个函数的控制流图包含一个指定的入口节点，一个指定的出口节点，以及该函数中包含的每个语句和谓词的节点，每个节点通过标记的有向边连接，以指示控制流。标签ε表示无条件的控制流边，来自谓词的边被标记为true或false，以表示谓词将控制转移到目的节点。The control flow graph (Control Flow Graph, CFG) constructed according to the abstract syntax tree in this embodiment is an abstract representation of a process or program, and is an abstract data structure used in the compiler, which is maintained internally by the compiler , which represents all the paths traversed during the execution of a program; it also represents the possible execution flow of all basic blocks in a process in the form of a graph, and also reflects the real-time execution process of a process. A function's control flow graph contains a designated entry node, a designated exit node, and nodes for each statement and predicate contained in the function, each connected by a labeled directed edge to indicate control flow. The label ε denotes an unconditional control flow edge, and edges from a predicate are labeled true or false to indicate that the predicate transfers control to the destination node.

图4表示Webshell示例的控制流图，示例代码大部分为无条件的控制流，其中谓词节点“if(md5($pass)== 'e10adc3949ba59abbe56e057f20f883e ')”存在两条边，该语句表达的意思是判断md5($pass)的值是否为十六进制数据段’e10adc3949ba59abbe56e057f20f883e’，即根据该谓词节点计算的结果指为true还是false，判断最终函数的控制流转移至if函数中第一条语句还是该函数终止执行。Figure 4 shows the control flow graph of the Webshell example. Most of the sample code is an unconditional control flow. There are two edges in the predicate node "if(md5($pass)== 'e10adc3949ba59abbe56e057f20f883e ')", which means to judge Whether the value of md5($pass) is the hexadecimal data segment 'e10adc3949ba59abbe56e057f20f883e', that is, according to the result calculated by the predicate node is true or false, to judge whether the control flow of the final function is transferred to the first statement in the if function or the Function terminates execution.

程序依赖图（Program Dependence Graphs，PDG）用于执行程序切片，对语句和谓词之间的依赖关系进行描述；程序依赖图的节点是函数的语句和谓词，程序依赖图中的边包括数据依赖边和控制依赖边两种类型，其中数据依赖边是为了指示目标语句中使用了在源语句中定义的变量，控制依赖边表示一个语句的执行依赖于谓词。Program Dependence Graphs (PDG) are used to execute program slices and describe the dependencies between statements and predicates; the nodes of the program dependency graph are the statements and predicates of the function, and the edges in the program dependency graph include data dependency edges There are two types of control-dependent edges and control-dependent edges. The data-dependent edge indicates that the variable defined in the source statement is used in the target statement, and the control-dependent edge indicates that the execution of a statement depends on a predicate.

图5显示出本实施例中Webshell示例的程序依赖图，其中D表示数据依赖边，C表示控制依赖边，即变量$pass的数据传递，以及谓词节点“if(md5($pass)== 'e10adc3949ba59abbe56e057f20f883e ')”的控制流传递。Figure 5 shows the program dependency graph of the Webshell example in this embodiment, where D represents a data dependency edge, C represents a control dependency edge, that is, the data transfer of the variable $pass, and the predicate node "if(md5($pass)== ' e10adc3949ba59abbe56e057f20f883e ')" for control flow passing.

控制流图以及程序依赖图的分析都是基于函数级定义的，只支持过程内分析。函数调用关系图（Call graph，CG）通常为有向无环图，其中节点为各个函数，边为函数间的调用关系，基于函数调用关系图，可实现过程间的控制流和数据流分析。The analysis of control flow graphs and program dependency graphs is based on function-level definitions, and only supports intra-procedural analysis. A function call graph (Call graph, CG) is usually a directed acyclic graph, in which the nodes are functions and the edges are the call relationships between functions. Based on the function call graph, the control flow and data flow analysis between processes can be realized.

本实施例中，步骤S2通过提取抽象语法树中的语法属性作为节点，控制流图中的控制流信息，程序依赖图中的数据流信息，以及函数调用关系图中的函数过程间关系基于开源图数据库New4j进行代码属性图的图谱构建，生成代码属性图，从而将代码分析问题转换为图搜索、遍历问题。In this embodiment, step S2 extracts the syntax attributes in the abstract syntax tree as nodes, the control flow information in the control flow graph, the data flow information in the program dependency graph, and the relationship between function processes in the function call graph based on open source The graph database New4j constructs graphs of code attribute graphs and generates code attribute graphs, thus transforming code analysis problems into graph search and traversal problems.

本实施例中，步骤S3所述基于图反向遍历的污点路径搜索的Webshell检测基于已生成的代码属性图，在图中搜索定位网页后门Webshell类型的输出点（sink）的函数节点，即与数据库、文件系统以及系统终端进行交互的敏感函数，网页后门Webshell中常见的sink节点如表1所示，随后基于sink节点进行输入点（source）节点的搜索，即搜索Web应用中提供获取用户输入数据的超全局变量。In this embodiment, the Webshell detection based on the taint path search of the graph reverse traversal described in step S3 is based on the generated code attribute graph, searching and locating the function node of the output point (sink) of the webpage backdoor Webshell type in the graph, that is, the same as Sensitive functions that interact with databases, file systems, and system terminals. Common sink nodes in web backdoor Webshells are shown in Table 1. Then search for input point (source) nodes based on sink nodes, that is, search for user input provided in web applications. The superglobal variable for data.

表1：Webshell中常见的敏感函数Table 1: Common sensitive functions in Webshell

类型type 敏感函数sensitive function 数据库交互类database interaction class mysql_query、mysqli_query等mysql_query, mysqli_query, etc. 文件系统交互类File System Interaction Class file_get_contents、readfile、file_put_content、fopen、fwrite、readdir等file_get_contents, readfile, file_put_content, fopen, fwrite, readdir, etc. 系统终端交互System Terminal Interaction system、shell_exec、exec、eval、assert、call_user_func、cmd_shell等system, shell_exec, exec, eval, assert, call_user_func, cmd_shell, etc.

如表2所示，由于Webshell的特殊性，存在大量将固定字符串传入sink点执行的网页后门Webshell，因此将字符串也纳入source节点搜索的范畴。As shown in Table 2, due to the particularity of Webshells, there are a large number of webpage backdoor Webshells that pass fixed strings into the sink point for execution, so the strings are also included in the scope of source node search.

表2：常见的输入点Table 2: Common entry points

Sourcesource 描述describe $_GET$_GET 包含所有的GET参数，也就是一个在URL中传递的参数的键/值表示。Contains all GET parameters, which is a key/value representation of the parameters passed in the URL. $_POST$_POST HTTP报文中内容部分的所有数据All data in the content part of the HTTP message $_COOKIE$_COOKIE HTTP报文头部的Cookie字段，用于记录用户的认证信息The Cookie field in the header of the HTTP message is used to record the user's authentication information $_REQUEST$_REQUEST 包含了GET、POST以及COOKIE三种请求方式Contains three request methods: GET, POST and COOKIE $_SERVER$_SERVER 包含不同的服务器属性相关值，例如服务器的IP地址等Contains different server attribute related values, such as the IP address of the server, etc. $_FILES$_FILES 包含文件上传的信息和内容Contains the information and content of the file upload

由于在代码属性图中包含语法属性以及函数调用关系，根据这两类关键属性，依据反向遍历的思想对sink节点以及变量的传递关系进行回溯，直至找到source节点，即在代码属性图中搜索一条污点路径，若样本中存在污点路径，并且污点路径中不存在特殊字符转义类型的函数节点，则将该样本判断为Webshell；若不存在污点路径，或污点路径中存在特殊字符转义类型函数，则判断为正常文件，以此进行Webshell的检测。Since the code attribute graph contains grammatical attributes and function call relationships, according to these two types of key attributes, according to the idea of reverse traversal, backtrack the sink node and the transfer relationship of variables until the source node is found, that is, search in the code attribute graph A taint path, if there is a taint path in the sample, and there is no special character escape type function node in the taint path, then the sample is judged as a Webshell; if there is no taint path, or there is a special character escape type in the taint path function, it is judged as a normal file, so as to detect the Webshell.

本发明第四方面公开了一种基于代码属性图的网页后门检测系统，其包括所述存储介质，并用于执行所述基于代码属性图的网页后门检测方法。The fourth aspect of the present invention discloses a webpage backdoor detection system based on a code attribute graph, which includes the storage medium, and is used to execute the method for detecting a webpage backdoor based on a code attribute graph.

综上，本发明提出的技术方案具备如下技术效果：In summary, the technical solution proposed by the present invention has the following technical effects:

（1）通过提取静态源代码分析方法中语法属性、控制流、数据流、过程间关系，提高数据的利用率；(1) Improve the utilization rate of data by extracting grammatical attributes, control flow, data flow, and relationship between processes in the static source code analysis method;

请注意，以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。以上实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。Please note that the various technical features of the above embodiments can be combined arbitrarily. For the sake of concise description, all possible combinations of the various technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features , should be considered as within the scope of this specification. The above examples only express several implementation modes of the present application, and the description thereof is relatively specific and detailed, but should not be construed as limiting the scope of the patent for the invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the scope of protection of the patent application should be based on the appended claims.

Claims

1. A webpage backdoor detection method based on a code attribute graph is characterized by comprising the following steps:

step S1: establishing an abstract syntax tree according to a test sample, and establishing a control flow graph, a program dependence graph and a function call relation graph according to the abstract syntax tree;

step S2: extracting the syntactic attributes in the abstract syntactic tree as nodes, respectively extracting control flow in a control flow graph, data flow in a program dependency graph and the relationship among function processes in a function call relationship graph, jointly combining to construct a code attribute graph, and storing the code attribute graph in a code attribute graph database;

and step S3: and searching a taint path based on the reverse traversal of the code attribute graph database, and performing webpage backdoor detection.

2. The method for detecting the backdoor of the webpage based on the code attribute map as claimed in claim 1, wherein the step S3 is specifically: searching and positioning a sink function node of a webpage backdoor type in the code attribute graph, and providing a super-global variable for acquiring user input data in Web application based on the sink function node.

3. The method for detecting the backdoor of the webpage based on the code attribute graph as claimed in claim 2, wherein the sink function node is a sensitive function interacting with a database, a file system and a system terminal.

4. The method for detecting the backdoor of the webpage based on the code attribute graph as claimed in claim 2, wherein the step S3 further comprises the steps of including syntax attributes and function call relations in the code attribute graph, and backtracking the transfer relations of sink function nodes and variables according to the reverse traversal of the two types of attributes until source nodes are found.

5. The method for detecting the backdoor of the webpage based on the code attribute map as claimed in claim 4, wherein a taint path is searched in the code attribute map, and the situation that no special character escape exists in the taint path is judged as Webshell.

6. The method for detecting the backdoor of the webpage based on the code attribute map as claimed in claim 1, wherein the step S2 is specifically as follows: the method comprises the steps of extracting grammar attributes in an abstract grammar tree to serve as nodes, and building a map of a code attribute graph based on control flow information in a control flow graph, data flow information in a program dependency graph and function process relation in a function call relation graph based on an open source graph database New4j to generate the code attribute graph.

7. A system for detecting a backdoor of a web page based on a code attribute map, comprising:

the first module is used for establishing an abstract syntax tree according to a test sample, and then establishing a control flow graph, a program dependence graph and a function call relation graph according to the abstract syntax tree;

a second module: the abstract syntax tree is used for extracting syntax attributes in the abstract syntax tree as nodes, respectively extracting control flow in a control flow graph, data flow in a program dependency graph and function process relation in a function call relation graph, jointly combining and constructing a code attribute graph, and storing the code attribute graph in a code attribute graph database;

a third module: and searching for a taint path based on the reverse traversal of the code attribute graph database, and performing webpage backdoor detection.

8. A computer storage medium storing computer program instructions that, when executed, implement the web page backdoor detection method of any of claims 1-6.

9. A web page backdoor detection system based on a code attribute map, the web page backdoor detection system comprising the computer storage medium of claim 8.