CN110532776B

CN110532776B - Android malicious software efficient detection method, system and medium based on runtime data analysis

Info

Publication number: CN110532776B
Application number: CN201910836444.1A
Authority: CN
Inventors: 吕品; 乔智; 许嘉; 李陶深
Original assignee: Guangxi University
Current assignee: Guangxi University
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2021-08-27
Anticipated expiration: 2039-09-05
Also published as: CN110532776A

Abstract

The invention discloses a method, a system and a medium for efficiently detecting Android malicious software based on data analysis during operation, wherein after the APP is operated, a simulator operates a tracking record on the behavior of the APP and generates the operation data of the APP; extracting the running data of the APP through a Heterogeneous Information Network (HIN) to obtain structured data of the running data of the APP, and forming a core matrix in a meta-path mode; and inputting the kernel matrix into a machine learning classifier trained in advance to obtain a detection result. The invention extracts the behavior data of the APP by utilizing a dynamic characteristic extraction technology, carries out structuring processing on the extracted behavior data of the APP through a Heterogeneous Information Network (HIN), forms a kernel matrix by the structured data in a meta-path mode, and trains by using a Support Vector Machine (SVM) classifier, thereby realizing less training time and higher accuracy.

Description

Android malicious software efficient detection method, system and medium based on runtime data analysis

Technical Field

The invention relates to the technical field of software and information security, in particular to a method, a system and a medium for efficiently detecting Android malicious software based on runtime data analysis.

Background

As a mobile platform with the highest market share, the Android system constructs an open ecosystem. Its openness has promoted the prosperity of the application market, but also has brought a great security threat to users due to the flooding of malicious software. A2018 Android malicious software topic report issued by a 360 Internet security center shows that: in 2018, in all years, a 360-degree internet security center captures about 434.2 million newly-added malware samples of a mobile terminal, about 1.2 million newly-added malware samples are added on average every day, the infection amount of the mobile terminal malware is monitored accumulatively and is about 1.1 hundred million people times, and the infection amount of the mobile terminal malware is about 29.2 million people times every day. Android malware detection has become a problem generally concerned by the industrial and academic circles, and the method for efficiently detecting the malware for the Android system research has very important significance.

At present, Android malicious software detection technologies can be roughly divided into two types, namely a static feature extraction technology and a dynamic feature extraction technology. Most of research work of static feature extraction is to perform decompilation on an APP and analyze a decompilated code. Some common open source tools are also commonly used for static feature analysis. Research on dynamic feature extraction focuses on monitoring APP behavioral data related to user privacy or sensitive API calls. Whether static feature extraction or dynamic feature extraction is adopted, the detection of the malicious software can be realized by combining a classification algorithm and the like.

At present, Android malicious software detection technologies can be roughly divided into two types, namely a static sign extraction technology and a dynamic feature extraction technology. In the static sign extraction, mostly, the APP is decompiled, and the decompiled code is analyzed. Dynamic feature extraction focuses on monitoring APP behavioral data related to user privacy or sensitive API calls. The training time is too long due to the fact that the amount of useless information of the static feature extraction is large. Therefore, how to reduce the time for training the model as much as possible on the premise of keeping a higher recognition accuracy is a key technical problem to be solved urgently at present.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a method, a system and a medium for efficiently detecting Android malicious software based on runtime data analysis.

In order to solve the technical problems, the invention adopts the technical scheme that:

an Android malicious software efficient detection method based on runtime data analysis comprises the following implementation steps:

1) acquiring a package name and a starting page name of an APP;

2) after running the APP based on the package name and the starting page name, simulating behavior operation of a person on the APP through an operation simulation tool, tracking and recording and generating operation data of the APP, wherein the operation data of the APP comprises calling of the APP to an API and calling information of the API to the API;

3) extracting the running data of the APP through a Heterogeneous Information Network (HIN) to obtain structured data of the running data of the APP, and forming a core matrix by the structured data of the running data of the APP in a meta-path mode; the heterogeneous information network HIN comprises two node types and two edge types, wherein the two node types are APP and API, the two edges are calling of the APP to the API and calling of the API to the API, and the relationship is formed by the calling times of the APP to the API and the calling times of the API to the API;

4) and inputting the core matrix into a pre-trained machine learning classifier to obtain a detection result of whether the APP is the malicious software, wherein the machine learning classifier establishes a mapping relation between the structured data of the operation data of the APP and the detection result of whether the APP is the malicious software through pre-training.

Optionally, the obtaining of the package name and the start page name of the APP in step 1) specifically means obtaining the package name and the start page name of the APP by decompiling the APP.

Optionally, running the APP in step 2) specifically means running the APP in a virtual machine.

Optionally, extracting the running data of the APP through the heterogeneous information network HIN in step 3) to obtain the structured data of the running data of the APP specifically means filtering out the call of the APP to the API, and the API does not call other APIs again, so that all the remaining relationships are API call sequences.

Optionally, the machine learning classifier in step 4) is a support vector machine classifier.

Optionally, step 4) is preceded by a step of training a support vector machine classifier, and the detailed steps include:

s1) extracting corresponding core matrixes respectively by executing the steps 1) to 3) aiming at various common APPs and malicious APPs, and attaching common or malicious labels to the obtained core matrixes, so as to establish a training sample data set and a test sample set;

s2) training the support vector machine classifier based on the training sample data in the training sample data set, and skipping to execute the next step after finishing the training in a specified amount or time;

s3) testing the support vector machine classifier based on the test sample data in the test sample set to obtain the classification accuracy of the support vector machine classifier after the training is finished;

s4) judging whether a training termination condition is met, wherein the training termination condition is that training of a specified amount or time is completed or the classification accuracy reaches a preset threshold value; skipping to execute step S2) if the training termination condition is not satisfied, otherwise ending and exiting if the training termination condition is satisfied.

In addition, the invention also provides an Android malicious software efficient detection system based on runtime data analysis, which comprises the following steps:

the package information acquisition program unit is used for acquiring the package name and the starting page name of the APP;

the operation data acquisition program unit is used for simulating behavior operation of a person on the APP through an operation simulation tool after the APP is operated, tracking and recording and generating operation data of the APP, wherein the operation data of the APP comprises calling of the APP to the API and calling information of the API to the API;

the structured data acquisition program unit is used for extracting the running data of the APP through a Heterogeneous Information Network (HIN) to obtain structured data of the running data of the APP, and forming a core matrix by the structured data of the running data of the APP in a meta-path mode; the heterogeneous information network HIN comprises two node types and two edge types, wherein the two node types are APP and API, the two edges are calling of the APP to the API and calling of the API to the API, and the relationship is formed by the calling times of the APP to the API and the calling times of the API to the API;

and the result classification program unit is used for inputting the core matrix into a pre-trained machine learning classifier to obtain a detection result of whether the APP is the malicious software, and the machine learning classifier establishes a mapping relation between the structured data of the operation data of the APP and the detection result of whether the APP is the malicious software through pre-training.

In addition, the invention also provides an Android malicious software efficient detection system based on runtime data analysis, which comprises computer equipment, wherein the computer equipment is programmed or configured to execute the steps of the Android malicious software efficient detection method based on runtime data analysis.

In addition, the invention also provides an Android malicious software efficient detection system based on runtime data analysis, which comprises computer equipment, wherein a storage medium of the computer equipment stores a computer program which is programmed or configured to execute the Android malicious software efficient detection method based on runtime data analysis.

In addition, the present invention also provides a computer readable storage medium having stored thereon a computer program programmed or configured to execute the runtime data analysis-based Android malware efficient detection method.

Compared with the prior art, the invention has the following advantages: the invention extracts the behavior data of the APP by utilizing a dynamic characteristic extraction technology, carries out structuring processing on the extracted behavior data of the APP through a Heterogeneous Information Network (HIN), forms a kernel matrix by the structured data in a meta-path mode, and trains by using a Support Vector Machine (SVM) classifier, thereby realizing less training time and higher accuracy. The feature extraction technology of the invention is a bright spot, which is different from the traditional dynamic feature extraction software, for example, TaintDroid and DroidBox only simulate the running of APP on a virtual machine, and observe whether malicious behaviors, such as file read-write operation, SMS short message and telephone information, occur in the virtual machine. The method provided by the invention focuses on the calling sequence of the API in the software, and can realize extraction of the calling condition of the API in the dynamic operation process of the APP based on tracking software such as TraceView and the like of Android. In addition, the invention also combines the heterogeneous information network and the malicious software monitoring, and utilizes the heterogeneous information network to represent the dynamically extracted information, thereby strengthening the relevance between data, making the malicious software more difficult to escape the detection and improving the monitoring accuracy. The training efficiency is also not achieved by the existing malicious software monitoring technology, the training time of the model is greatly reduced on the premise of keeping a higher accuracy, and the method is a more efficient and practical Android malicious software detection method.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of the training and using principles of the method according to the embodiment of the present invention.

FIG. 3 is an example of the content of a portion of an HTML document containing APP running data extracted according to an embodiment of the present invention.

FIG. 4 is an HTML document of a call relationship between APIs that generate APP running data in an embodiment of the present invention.

FIG. 5 is a diagram illustrating filtering of operational data for APP generation in an embodiment of the present invention.

FIG. 6 is a schematic diagram of training test time comparison between the method of the embodiment of the present invention and the prior art method.

FIG. 7 is a diagram illustrating the comparison between the occupancy rates of the memory and the CPU in the method of the embodiment of the present invention and the existing method.

FIG. 8 is a graphical illustration comparing the API quantities of an embodiment method of the present invention and a prior art method.

Detailed Description

As shown in fig. 1 and fig. 2, the implementation steps of the Android malware efficient detection method based on runtime data analysis in the embodiment include:

1) acquiring a package name and a start page name (Activity) of an APP (Android software);

In order to monitor the behavior of the APP with TraceView, it is necessary to know which processes to monitor, and the packet name of the APP can be used to distinguish the processes. In addition, in order to start the APP, the start page name of the APP needs to be known. In this embodiment, the obtaining of the package name and the start page name of the APP in step 1) specifically means obtaining the package name and the start page name of the APP by performing decompiling on the APP. As a specific implementation manner, a decompiling tool Apktool can be used to decompile the APP, and the obtained android document includes authority information and registered page information. And extracting the package name and the starting page name in the android manifest document. For example, the start page of an APP is an Activity component in a document that contains "identity.

In this embodiment, running the APP in step 2) specifically means running the APP in the virtual machine, and the security of the environment can be improved by means of the virtual machine, and the detection accuracy of the running data can also be improved.

In this embodiment, the following tools are used in step 2):

1. the Monkey tool is used as an operation simulation tool to simulate behavior operation of a human on the APP, and in the embodiment, the Monkey tool is used to randomly generate 50 times of operations including page switching, trackball and other random operations;

2. and the Android self-contained TraceView is used for starting trace monitoring and monitoring the running data of the APP in real time.

3. And the Dtracedump tool is used for converting the generated binary trace file into an HTML file so as to write Python script to extract the running data of the APP.

FIG. 3 is a partial content example of an HTML document containing APP running data extracted in the present embodiment. As can be seen from fig. 3, in this embodiment, the APP operation data includes information of calls of the APP to the API and calls of the API to the API, and all call relationships are organized into a tree, where the API No. 6 is "(topevel)", that is, a tree root, and may be understood as an APP node, so that the APIs No. 0, 1, 2, 3, 4, and 5 below 6 are all called by the APP, and these information are extracted. Line 3 in fig. 3, 112/112 represents API No. 0 being called 112 times by APP, API being called 112 times in total. Line 4 in fig. 3, 1/3 represents API No. 1 being called by APP 1 times for a total of 3 API calls. FIG. 4 is an HTML document illustrating the call relationship between APIs, wherein: API No. 2 under API No. 1 on line 3 means API No. 1 calls API No. 2; line 6 at 1/2 shows that API number 2 was called 1 time by API number 1. The feature extraction part of the death wedding extracts the APP and the API and the calling relationship graph between the API and the APP.

In this embodiment, the APP can be opened in the virtual machine by specifically using the ADB command, the packet name and the start Activity, then the monitoring of the APP is started by using TraceView, the Trace monitoring is finished after the APP is randomly operated for 50 times by using a Monkey tool, and a Trace file is extracted. The Trace file can only see some called APIs at this time, and the calling relation between the calling times and the APIs needs further analysis. Therefore, the DmTracedump is used to analyze the Trace to generate an HTML document. After obtaining the HTML document, the Python script is used to construct the heterogeneous information network HIN in this embodiment to obtain the features to be extracted for training from the HTML document.

In the heterogeneous information network HIN constructed in this embodiment, a node is composed of an APP and an API, and the information of the relationship is composed of the number of calls of the APP to the API and the number of calls between the API and the API. The method is combined with the characteristic that the semantic high abstraction of the heterogeneous information network HIN, and is utilized in the detection of the malicious software. After the heterogeneous information network is constructed, the embodiment measures the similarity of each APP by using the meta path, so as to detect the malware. In this embodiment, extracting the running data of the APP through the heterogeneous information network HIN in step 3) to obtain the structured data of the running data of the APP specifically means filtering out the call of the APP to the API, and the API does not call other APIs again, so that all the remaining relationships are API call sequences. Since the API that does not include the context call cannot express whether the APP is malicious in an actual experiment, as the API2, the API3, and the API4 in fig. 5 are all called only by the APP and do not have other calling information, and it is impossible to say that the APP is malicious by calling an API only by the APP, the present embodiment filters out these information (the filtering target is the call of the APP to the API and the API does not call other APIs any more), so that the overall technology becomes more efficient.

In this embodiment, the machine learning classifier in step 4) is a support vector machine classifier SVM, the meta-path is used to describe the formed heterogeneous information network, and the kernel matrix formed by the meta-path is input to the support vector machine classifier SVM for training. Step 4) also comprises a step of training the support vector machine classifier, and the detailed steps comprise:

In order to verify the Android malware efficient detection method based on runtime data analysis in this embodiment, an existing hind root method is selected and compared with the Android malware efficient detection method based on runtime data analysis (DyFex method for short) in this embodiment. The HinDroid method is based on static feature extraction, and analyzes the calling relationship between APIs of the APP installation files by decompiling the APP installation files. In this embodiment, the Android malware efficient detection method based on runtime data analysis is to collect data in the runtime of an APP and extract a call sequence of an API from the data. The comparison shows that the accuracy of the Android malicious software efficient detection method based on the runtime data analysis can reach 95.6%, and the accuracy of the comparison Hindroid method is 98.6%, which shows that the accuracy of the Android malicious software efficient detection method is very close to that of the Hindroid method. In addition, on the premise that a HinDroid method and an Android malware efficient detection method (DyFex method for short) based on runtime data analysis in the embodiment keep a similar accuracy, three aspects of comparison are performed from training test time (as shown in fig. 6), storage and CPU occupancy rates (as shown in fig. 7) and API number (as shown in fig. 8). As can be seen from fig. 6, 7 and 8, from the perspective of the training time and the testing time: the Android malware efficient detection method based on runtime data analysis in the embodiment is greatly reduced compared with the HinDroid method, and is only about 40% of the HinDroid method, and the number of APIs extracted by the Android malware efficient detection method based on runtime data analysis in the embodiment is also only 36.6% of the HinDroid method. Therefore, it can be proved that the training time is greatly reduced on the premise that the high recognition accuracy is maintained by the Android malicious software efficient detection method based on the runtime data analysis.

In addition, this embodiment also provides an Android malware efficient detection system based on runtime data analysis, including:

In addition, the embodiment also provides an Android malware efficient detection system based on runtime data analysis, which includes a computer device programmed or configured to execute the steps of the aforementioned Android malware efficient detection method based on runtime data analysis according to the embodiment.

In addition, the embodiment also provides an Android malware efficient detection system based on runtime data analysis, which includes a computer device, where a storage medium of the computer device stores a computer program that is programmed or configured to execute the aforementioned Android malware efficient detection method based on runtime data analysis according to the embodiment.

In addition, the present embodiment also provides a computer-readable storage medium, on which a computer program is stored, which is programmed or configured to execute the foregoing runtime data analysis-based Android malware efficient detection method according to the present embodiment.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A runtime data analysis-based Android malicious software efficient detection method is characterized by comprising the following implementation steps:

1) acquiring a package name and a starting page name of an APP;

3) extracting the running data of the APP through a Heterogeneous Information Network (HIN) to obtain structured data of the running data of the APP, and forming a core matrix by the structured data of the running data of the APP in a meta-path mode; the heterogeneous information network HIN comprises two node types and two edge types, wherein the two node types are APP and API, the two edges are calling of the APP to the API and calling of the API to the API, and the relationship is formed by the calling times of the APP to the API and the calling times of the API to the API; extracting the running data of the APP through the heterogeneous information network HIN to obtain the structured data of the running data of the APP specifically means filtering out the calling of the APP to the API, wherein the API does not call the relationships of other APIs, and the rest relationships are all API calling sequences;

2. The runtime data analysis-based Android malware efficient detection method as claimed in claim 1, wherein the obtaining of the package name and the start page name of the APP in step 1) specifically means obtaining of the package name and the start page name of the APP by decompiling the APP.

3. The runtime data analysis-based Android malware efficient detection method according to claim 1, wherein running the APP in step 2) specifically means running the APP in a virtual machine.

4. The efficient Android malware detection method based on runtime data analysis according to any one of claims 1-3, wherein the machine learning classifier in step 4) is a support vector machine classifier.

5. The runtime data analysis-based Android malware efficient detection method according to claim 4, characterized in that step 4) is preceded by a step of training a support vector machine classifier, and the detailed steps include:

6. An Android malware efficient detection system based on runtime data analysis is characterized by comprising:

the structured data acquisition program unit is used for extracting the running data of the APP through a Heterogeneous Information Network (HIN) to obtain structured data of the running data of the APP, and forming a core matrix by the structured data of the running data of the APP in a meta-path mode; the heterogeneous information network HIN comprises two node types and two edge types, wherein the two node types are APP and API, the two edges are calling of the APP to the API and calling of the API to the API, and the relationship is formed by the calling times of the APP to the API and the calling times of the API to the API; extracting the running data of the APP through the heterogeneous information network HIN to obtain the structured data of the running data of the APP specifically means filtering out the calling of the APP to the API, wherein the API does not call the relationships of other APIs, and the rest relationships are all API calling sequences;

7. An Android malware efficient detection system based on runtime data analysis, comprising a computer device, characterized in that the computer device is programmed or configured to execute the steps of the Android malware efficient detection method based on runtime data analysis of any one of claims 1 to 5.

8. An Android malware efficient detection system based on runtime data analysis, comprising a computer device, wherein a storage medium of the computer device stores a computer program programmed or configured to execute the Android malware efficient detection method based on runtime data analysis according to any one of claims 1 to 5.

9. A computer-readable storage medium, wherein the computer-readable storage medium has stored thereon a computer program programmed or configured to execute the runtime data analysis-based Android malware efficient detection method of any one of claims 1-5.