CN118277914A - A mobile application classification method based on dynamic and static combined multi-dimensional APK features - Google Patents
A mobile application classification method based on dynamic and static combined multi-dimensional APK features Download PDFInfo
- Publication number
- CN118277914A CN118277914A CN202311471891.4A CN202311471891A CN118277914A CN 118277914 A CN118277914 A CN 118277914A CN 202311471891 A CN202311471891 A CN 202311471891A CN 118277914 A CN118277914 A CN 118277914A
- Authority
- CN
- China
- Prior art keywords
- app
- features
- apk
- dynamic
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000003068 static effect Effects 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000004891 communication Methods 0.000 claims abstract description 16
- 230000006870 function Effects 0.000 claims abstract description 10
- 230000007246 mechanism Effects 0.000 claims abstract description 10
- 238000013101 initial test Methods 0.000 claims abstract description 4
- 230000004807 localization Effects 0.000 claims description 17
- 238000012552 review Methods 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 9
- 238000012216 screening Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 6
- 230000006872 improvement Effects 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000010224 classification analysis Methods 0.000 abstract description 2
- 238000011176 pooling Methods 0.000 description 4
- 238000013145 classification model Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/24765—Rule-based classification
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及APP分类分析技术领域,具体为一种基于动静结合多维度APK特征的移动应用分类方法。The present invention relates to the technical field of APP classification analysis, and specifically to a mobile application classification method based on dynamic and static combined multi-dimensional APK features.
背景技术Background technique
在APK(Android应用程序包文件)分类分析技术领域,近年来取得了显著的发展。具体包含以下几方面:In recent years, significant progress has been made in the field of APK (Android application package file) classification and analysis technology, including the following aspects:
一、基于机器学习和深度学习的APK分类方法:(1)将App名称输入互联网搜索引擎,对结果进行处理得到App文档;(2)基于向量空间模型抽取关键词分布特征,在此基础上采用浅层学习技术训练一个基分类器;(3)基于word2vec训练词向量,在此基础上采用卷积神经网络训练另一个基分类器;(4)设计一个协同学习框架,利用无标注样本对2个基分类器进行协同训练,并对训练结果进行融合得到最终的App分类器。仅利用App名称实现对App的个性化分类;仅需要少量有标注样本即可建立准确率较高的分类模型;设计的协同学习框架考虑了不同基分类器的性能不平衡性,可减少无标注样本中噪声数据的影响。1. APK classification method based on machine learning and deep learning: (1) Input the App name into the Internet search engine and process the results to obtain the App document; (2) Extract the keyword distribution features based on the vector space model, and use shallow learning technology to train a base classifier on this basis; (3) Train word vectors based on word2vec, and use convolutional neural network to train another base classifier on this basis; (4) Design a collaborative learning framework, use unlabeled samples to collaboratively train the two base classifiers, and fuse the training results to obtain the final App classifier. Only use the App name to achieve personalized classification of Apps; only a small number of labeled samples are needed to establish a classification model with high accuracy; the designed collaborative learning framework takes into account the performance imbalance of different base classifiers and can reduce the impact of noise data in unlabeled samples.
基于APP用户评论分类方法:获取APP用户评论数据并清洗和打标;建立SVTEO模型和NBTEO模型;所述SVTEO模型包括:提取Transformer模型中的Encoder结构部分,得到Trasformer-Encoder-Only层,在Trasformer-Encoder-Only层后连接池化层,池化层后并行连接线性层和支持向量层,得到SVTEO模型;所述NBTEO模型包括:提取Transformer模型中的Encoder结构部分,得到Trasformer-Encoder-Only层,在Trasformer-Encoder-Only层后连接池化层,池化层后并行连接线性层和朴素贝叶斯层,得到NBTEO模型;所述Trasformer-Encoder-Only层包括:Embedding层以及六层Encoder层;根据标签数据对SVTEO模型和NBTEO模型的线性层进行机器学习、深度学习同质化学习和SVTEO模型、NBTEO模型异质化和参数微调处理,并将处理后的SVTEO模型和NBTEO模型组成用户评论需求分类模型;将待分类的APP用户评论数据输入用户评论需求分类模型进行分类打标处理,得到APP用户评论数据的分类标签。Based on the APP user review classification method: obtain APP user review data and clean and mark it; establish the SVTEO model and the NBTEO model; the SVTEO model includes: extracting the Encoder structure part in the Transformer model to obtain the Trasformer-Encoder-Only layer, connecting the pooling layer after the Trasformer-Encoder-Only layer, and connecting the linear layer and the support vector layer in parallel after the pooling layer to obtain the SVTEO model; the NBTEO model includes: extracting the Encoder structure part in the Transformer model to obtain the Trasformer-Encoder-Only layer, -The Encoder-Only layer is connected to the pooling layer, and the pooling layer is connected to the linear layer and the Naive Bayes layer in parallel to obtain the NBTEO model; the Trasformer-Encoder-Only layer includes: an Embedding layer and a six-layer Encoder layer; according to the label data, the linear layers of the SVTEO model and the NBTEO model are subjected to machine learning, deep learning homogeneity learning, and SVTEO model and NBTEO model heterogeneity and parameter fine-tuning processing, and the processed SVTEO model and NBTEO model are combined into a user comment demand classification model; the APP user comment data to be classified is input into the user comment demand classification model for classification and labeling processing to obtain the classification labels of the APP user comment data.
例如申请号为“CN201810073847.0”的申请,APP基于关键词自动划分类别的方法,包括以下步骤:A、建立分类系统,并设置对应的关键词数据库,根据分类类别建立若干组关键词数据库,若干组关键词数据库存储在存储模块中;B、APP通过采集模块采集上传的信息内容,并将信息内容传输给服务器;C、服务器将接受到的信息内容通过匹配鉴定模块分别与若干组关键词数据库进行匹配;D、若信息内容与其中一组关键词数据库或多组关键词数据库匹配成功,则服务器将匹配成功的关键词数据库对应的分类类别反馈给APP,APP将信息内容通过执行模块划分到对应的分类类别,若信息内容与若干组关键词数据库匹配不成功,则服务器将匹配不成功的信息反馈给APP,APP通过执行模块将信息内容进行单独分类。For example, in the application with application number "CN201810073847.0", the method for automatically dividing APP into categories based on keywords includes the following steps: A. Establishing a classification system and setting a corresponding keyword database, establishing several groups of keyword databases according to the classification categories, and storing the several groups of keyword databases in the storage module; B. APP collects the uploaded information content through the collection module and transmits the information content to the server; C. The server matches the received information content with several groups of keyword databases respectively through the matching and identification module; D. If the information content matches one group of keyword databases or multiple groups of keyword databases successfully, the server will feedback the classification category corresponding to the successfully matched keyword database to the APP, and the APP will classify the information content into the corresponding classification category through the execution module. If the information content fails to match several groups of keyword databases, the server will feedback the unsuccessful matching information to the APP, and the APP will classify the information content separately through the execution module.
现有APP分类技术利用的维度少,主要是利用名称、LOGO、关键词等少数静态源码特征进行分析,而在此类静态的特征中对分类的表示能力较弱,不能够充分和准确的识别APP的实际业务分类,特别是在对虚假仿冒、提供违规内容等类型的APP难以通过有限的静态源码特征维度进行准确的识别。Existing APP classification technology uses few dimensions, mainly relying on a few static source code features such as name, LOGO, keywords, etc. for analysis. However, the classification representation ability of such static features is relatively weak, and it is not possible to fully and accurately identify the actual business classification of the APP. In particular, it is difficult to accurately identify APPs that are counterfeit or provide illegal content through limited static source code feature dimensions.
针对上述问题,所以需要一种基于动静结合多维度APK特征的移动应用分类方法。In view of the above problems, a mobile application classification method based on dynamic and static combined multi-dimensional APK features is needed.
发明内容Summary of the invention
本发明的目的在于提供一种基于动静结合多维度APK特征的移动应用分类方法。本发明提取APP的源码特征(包括APP名称、LOGO、包名、签名HASH、签名OWNER、本地化配置、布局文件、权限、服务声明、SDK、静态库)、流量(包括主服务域名、IP、路径、头信息、请求内容)、页面内容(页面显示内容、快照)等APP的动态和静态多维度特征,针对通信等类型的APP分类识别场景,抽取共性技术特征或内容特征组合,设计了动静特征相结合的APP特征提取方法,构建了包含规则匹配模型和评分排序模型的两级特定类型APP分类识别模型,从而在海量APP数据中实现对移动应用的有效分类识别。The purpose of the present invention is to provide a mobile application classification method based on dynamic and static combined multi-dimensional APK features. The present invention extracts APP source code features (including APP name, LOGO, package name, signature HASH, signature OWNER, localization configuration, layout file, permission, service declaration, SDK, static library), traffic (including main service domain name, IP, path, header information, request content), page content (page display content, snapshot) and other dynamic and static multi-dimensional features of APP, extracts common technical features or content feature combinations for APP classification and identification scenarios such as communication, designs APP feature extraction methods combining dynamic and static features, and constructs a two-level specific type APP classification and identification model including rule matching model and scoring and sorting model, thereby realizing effective classification and identification of mobile applications in massive APP data.
本发明是这样实现的:The present invention is achieved in that:
本发明提供一种基于动静结合多维度APK特征的移动应用分类方法,具体按以下步骤执行:The present invention provides a mobile application classification method based on dynamic and static combined multi-dimensional APK features, which is specifically performed in the following steps:
S1:进行APP特征构建,基于主流手机应用商店、互联网小型分发平台、APP传播页面对APP的信息进行采集,具体通过APP所提供的功能或呈现的信息内容,识别APP的业务分类,采集通信类的信息,形成初始的测试数据集; S1 : Construct APP features, collect APP information based on mainstream mobile application stores, small Internet distribution platforms, and APP dissemination pages, identify APP business categories through the functions provided by APP or the information content presented, collect communication information, and form an initial test data set;
S2:基于APP源码进行分析,获取APP的静态源码特征、动态流量和页面特征数据,具体包括名称、流量和内容信息;名称包括LOGO、包名、签名HASH、签名OWNER、本地化配置、布局文件、权限、服务声明、SDK、静态库,流量包括域名、IP、路径、头信息、请求内容;内容具体包括页面显示内容、快照信息。S 2 : Analyze the APP source code to obtain the static source code features, dynamic traffic and page feature data of the APP, including name, traffic and content information; the name includes LOGO, package name, signature HASH, signature OWNER, localization configuration, layout file, permission, service declaration, SDK, static library, and the traffic includes domain name, IP, path, header information, and request content; the content specifically includes page display content and snapshot information.
对APP源码分析具体是通过解析APK包文件,通过Android的设计规范和输出要求构建信息提取功能,获取到名称、LOGO、包名、签名HASH、签名OWNER、本地化配置、布局文件、权限、服务声明、SDK、静态库信息;The APP source code analysis is specifically to parse the APK package file, build information extraction function according to Android design specifications and output requirements, and obtain the name, LOGO, package name, signature HASH, signature OWNER, localization configuration, layout file, permissions, service declaration, SDK, and static library information;
通过运行分析构建自动化控制程序在安卓设备环境中,自动化安装、运行APP,并提取APP的通信流量和页面呈现内容信息。分析通信类特点时,具体包括功能、权限、布局、内容数据方面的特性数据,构建规则匹配规则数据集和评分排序规则集;By running analysis and building an automated control program, the app is automatically installed and run in the Android device environment, and the communication traffic and page presentation content information of the app are extracted. When analyzing the characteristics of the communication category, the specific characteristic data in terms of functions, permissions, layout, and content data are included, and the rule matching rule data set and the scoring and sorting rule set are built;
根据APP的业务特点或技术特点,配置不同维度的特征,具体的APP的特征包括关键词;通信类APP的特征包括通讯录、好友、发送,并且具备数据传输、数据加密特定的SDK和静态库。Features of different dimensions are configured according to the business or technical characteristics of the APP. The features of specific APPs include keywords; the features of communication APPs include address book, friends, and sending, and they have specific SDKs and static libraries for data transmission and data encryption.
S3:进行建立规则匹配模型和匹配机制,具体通过构建定时扫描程序,对格式化存储的数据按照对象形式进行抽取,并通过预设的各分类规则匹配模型进行识别和研判,若有任意命中即打上分类标签,进入评分排序模型的等待状态;获取APP的各类静态源码特征、动态流量特征和动态内容特征,并对全量数据做数据清洗、数据格式化存储; S3 : Establish a rule matching model and matching mechanism. Specifically, by building a timed scanning program, extract the formatted and stored data in the form of objects, and identify and judge through the preset classification rule matching models. If there is any hit, it will be labeled with a classification label and enter the waiting state of the scoring and sorting model; obtain various static source code features, dynamic traffic features and dynamic content features of the APP, and perform data cleaning and data formatting and storage on the full amount of data;
其中静态源码特征包含名称、LOGO、包名、签名HASH、签名OWNER、本地化配置、布局文件、权限、服务声明、SDK、静态库信息;The static source code features include name, LOGO, package name, signature HASH, signature OWNER, localization configuration, layout file, permissions, service declaration, SDK, and static library information;
动态流量特征包括域名、IP、路径、头信息、请求内容信息;动态内容特征包含如页面显示内容、页面快照、页面布局文件信息。Dynamic traffic features include domain name, IP, path, header information, and request content information; dynamic content features include page display content, page snapshots, and page layout file information.
S4:建立评分排序模型和筛选机制模型,对具备业务分类标签并且是待审核状态的APP,构建定时扫描程序,对该部分APP进行扫描,并通过规则进行过滤;S 4 : Establish a scoring and sorting model and a screening mechanism model, build a timed scanning program for apps that have business classification labels and are in a pending review state, scan these apps, and filter them according to the rules;
S5:将大量的APP,通过规则匹配模型进行对APP规则匹配,通过筛选机制模型进行分支筛选,输出结果对APP形成分类。S 5 : A large number of APPs are matched with APP rules through a rule matching model, and branch screening is performed through a screening mechanism model, and the output results are used to classify the APPs.
进一步,在步骤S4中,具体规则为对APK名称、本地化配置、流量的域名、IP、路径、头信息、请求内容进行关键词特征匹配,对LOGO、包名、签名HASH、签名OWNER进行数据集特征匹配,为APK每个维度计算分值,分值根据特征的匹配程度、权重计算得出。Further, in step S4 , the specific rules are to perform keyword feature matching on the APK name, localization configuration, traffic domain name, IP, path, header information, and request content, and to perform data set feature matching on the LOGO, package name, signature HASH, and signature OWNER, and to calculate a score for each dimension of the APK. The score is calculated based on the matching degree and weight of the features.
分值和维度,对每个APK,通过汇总A和B两个分值,其中汇总APK名称、本地化配置内容属性的分值用A表示,汇总包名、签名、LOGO等维度分值用B表示,同时记录命中的维度数,A1表示A中命中的维度数,B1表示B中命中的维度数。Score and dimension: For each APK, two scores A and B are summarized. The score of the APK name and localized configuration content attributes is represented by A, and the score of the package name, signature, LOGO and other dimensions is represented by B. The number of hit dimensions is recorded at the same time. A1 represents the number of hit dimensions in A, and B1 represents the number of hit dimensions in B.
判断符合分类要求的APK,若B1大于1,即APK在包名、签名、logo多个维度上匹配特征规则,则判定是预期识别分类APK;Determine the APK that meets the classification requirements. If B1 is greater than 1, that is, the APK matches the feature rules in multiple dimensions such as package name, signature, and logo, then it is determined to be the expected identification classification APK;
在步骤S5中,若B1等于1且A1大于1,则进行人工审核处理;因为APK名称和本地化配置匹配了多个维度规则,需要进一步审核确定分类。特征规则提取和合并:将上一步判断为相应分类APK进行特征规则提取,以便将来用于规则的更新和改进,将提取的特征规则合并到步骤S4的规则集中。以丰富特征规则。In step S5 , if B1 is equal to 1 and A1 is greater than 1, manual review is performed; because the APK name and localization configuration match multiple dimensional rules, further review is required to determine the classification. Feature rule extraction and merging: Extract feature rules from the APKs that are judged as corresponding categories in the previous step, so that they can be used for future rule updates and improvements, and merge the extracted feature rules into the rule set of step S4 to enrich the feature rules.
进一步,本发明提供一种计算机可读存储介质,存储介质存储有计算机程序,所述计算机程序被主控制器执行时实现如上述中的任一项所述的方法。Furthermore, the present invention provides a computer-readable storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a main controller, the method described in any one of the above is implemented.
进一步,本发明的核心原理具体是通过APP基础静态源码特征、动态流量特征、动态内容特征构建两层分类识别模型,实现对APP的分类识别和准确研判,核心流程包括:Furthermore, the core principle of the present invention is to build a two-layer classification and recognition model through the basic static source code features, dynamic traffic features, and dynamic content features of the APP to achieve classification and accurate judgment of the APP. The core process includes:
1、提取特定类型APP的特征,包括静态源码特征(含名称、LOGO、包名、签名HASH、签名OWNER、本地化配置、布局文件、权限、服务声明、SDK、静态库)、动态流量特征(域名、IP、路径、头信息、请求内容)、内容特征(页面显示内容、快照)等十八种类型;1. Extract features of specific types of apps, including static source code features (including name, LOGO, package name, signature HASH, signature OWNER, localization configuration, layout file, permissions, service declaration, SDK, static library), dynamic traffic features (domain name, IP, path, header information, request content), content features (page display content, snapshot) and other 18 types;
2、构建规则匹配模型,基于上一步所提取的APP特征作为特定分类的识别匹配特征,在海量APP数据中进行检测并标记业务分类;2. Build a rule matching model, based on the APP features extracted in the previous step as the identification and matching features of specific categories, and detect and mark business categories in massive APP data;
3、构建评分排序模型,对上一步所发现的APP进一步识别研判,按照静态源码特征、动态流量特征、动态内容特征的多维度组合规则,制定评分标准,设置输出阈值,输出准确的分类数据;3. Build a scoring and ranking model to further identify and judge the APPs found in the previous step. According to the multi-dimensional combination rules of static source code features, dynamic traffic features, and dynamic content features, formulate scoring standards, set output thresholds, and output accurate classification data;
4、优化和补充特征,基于人工的方式对系统标记的数据进行审核,输出最终数据,并将该部分特征数据补充到步骤S3-S4中。4. Optimize and supplement features, manually review the data marked by the system, output the final data, and supplement this part of the feature data to steps S 3 -S 4 .
与现有技术相比,本发明的有益效果是:Compared with the prior art, the present invention has the following beneficial effects:
1、本发明在海量APP分析场景下,能够迅速的对各预设分类的APP进行识别。1. In the scenario of massive APP analysis, the present invention can quickly identify APPs of various preset categories.
2、对具有显著技术特征或内容特征的APP具有较高的识别准确率,降低人工审核参与度。2. It has a higher recognition accuracy rate for apps with significant technical or content features, reducing the involvement of manual review.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施方式的技术方案,下面将对实施方式中所需要使用的附图作简单地介绍,理解,以下附图仅示出了本发明的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for use in the embodiments will be briefly introduced below. It is understood that the following drawings only show certain embodiments of the present invention and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without creative work.
图1是本发明的整体方法流程图;FIG1 is a flow chart of the overall method of the present invention;
图2是本发明的APP分类执行流程图。FIG. 2 is a flowchart of the APP classification execution of the present invention.
具体实施方式Detailed ways
为使本发明实施方式的目的、技术方案和优点更加清楚,下面将结合本发明实施方式中的附图,对本发明实施方式中的技术方案进行清楚、完整地描述,显然,所描述的实施方式是本发明一部分实施方式,而不是全部的实施方式。基于本发明中的实施方式,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施方式,都属于本发明保护的范围。因此,以下对在附图中提供的本发明的实施方式的详细描述并非旨在限制要求保护的本发明的范围,而是仅仅表示本发明的选定实施方式。基于本发明中的实施方式,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施方式,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, rather than all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without making creative work belong to the scope of protection of the present invention. Therefore, the following detailed description of the embodiments of the present invention provided in the drawings is not intended to limit the scope of the invention claimed for protection, but merely represents the selected embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without making creative work belong to the scope of protection of the present invention.
请参阅图1-2,一种基于动静结合多维度APK特征的移动应用分类方法,具体按以下步骤执行:Please refer to Figure 1-2 for a mobile application classification method based on dynamic and static combined multi-dimensional APK features. The specific steps are as follows:
S1:进行APP特征构建,基于主流手机应用商店、互联网小型分发平台、APP传播页面对APP的信息进行采集,具体通过APP所提供的功能或呈现的信息内容,识别APP的业务分类,采集通信类的信息,形成初始的测试数据集; S1 : Construct APP features, collect APP information based on mainstream mobile application stores, small Internet distribution platforms, and APP dissemination pages, identify APP business categories through the functions provided by APP or the information content presented, collect communication information, and form an initial test data set;
S2:基于APP源码进行分析,获取APP的静态源码特征、动态流量和页面特征数据,具体包括名称、流量和内容信息;名称包括LOGO、包名、签名HASH、签名OWNER、本地化配置、布局文件、权限、服务声明、SDK、静态库,流量包括域名、IP、路径、头信息、请求内容;内容具体包括页面显示内容、快照信息。S 2 : Analyze the APP source code to obtain the static source code features, dynamic traffic and page feature data of the APP, including name, traffic and content information; the name includes LOGO, package name, signature HASH, signature OWNER, localization configuration, layout file, permission, service declaration, SDK, static library, and the traffic includes domain name, IP, path, header information, and request content; the content specifically includes page display content and snapshot information.
对APP源码分析具体是通过解析APK包文件,通过Android的设计规范和输出要求构建信息提取功能,获取到名称、LOGO、包名、签名HASH、签名OWNER、本地化配置、布局文件、权限、服务声明、SDK、静态库信息;The APP source code analysis is specifically to parse the APK package file, build information extraction function according to Android design specifications and output requirements, and obtain the name, LOGO, package name, signature HASH, signature OWNER, localization configuration, layout file, permissions, service declaration, SDK, and static library information;
通过运行分析构建自动化控制程序在安卓设备环境中,自动化安装、运行APP,并提取APP的通信流量和页面呈现内容信息。分析通信类特点时,具体包括功能、权限、布局、内容数据方面的特性数据,构建规则匹配规则数据集和评分排序规则集;By running analysis and building an automated control program, the app is automatically installed and run in the Android device environment, and the communication traffic and page presentation content information of the app are extracted. When analyzing the characteristics of the communication category, the specific characteristic data in terms of functions, permissions, layout, and content data are included, and the rule matching rule data set and the scoring and sorting rule set are built;
根据APP的业务特点或技术特点,配置不同维度的特征,具体的APP的特征包括关键词;通信类APP的特征包括通讯录、好友、发送,并且具备数据传输、数据加密特定的SDK和静态库。Features of different dimensions are configured according to the business or technical characteristics of the APP. The features of specific APPs include keywords; the features of communication APPs include address book, friends, and sending, and they have specific SDKs and static libraries for data transmission and data encryption.
S3:进行建立规则匹配模型和匹配机制,具体通过构建定时扫描程序,对格式化存储的数据按照对象形式进行抽取,并通过预设的各分类规则匹配模型进行识别和研判,若有任意命中即打上分类标签,进入评分排序模型的等待状态;获取APP的各类静态源码特征、动态流量特征和动态内容特征,并对全量数据做数据清洗、数据格式化存储; S3 : Establish a rule matching model and matching mechanism. Specifically, by building a timed scanning program, extract the formatted and stored data in the form of objects, and identify and judge through the preset classification rule matching models. If there is any hit, it will be labeled with a classification label and enter the waiting state of the scoring and sorting model; obtain various static source code features, dynamic traffic features and dynamic content features of the APP, and perform data cleaning and data formatting and storage on the full amount of data;
其中静态源码特征包含名称、LOGO、包名、签名HASH、签名OWNER、本地化配置、布局文件、权限、服务声明、SDK、静态库信息;The static source code features include name, LOGO, package name, signature HASH, signature OWNER, localization configuration, layout file, permissions, service declaration, SDK, and static library information;
动态流量特征包括域名、IP、路径、头信息、请求内容信息;动态内容特征包含如页面显示内容、页面快照、页面布局文件信息。特征维度如表1;Dynamic traffic features include domain name, IP, path, header information, and request content information; dynamic content features include page display content, page snapshot, and page layout file information. The feature dimensions are shown in Table 1;
表1静态源码特征维度表Table 1 Static source code feature dimension table
S4:建立评分排序模型和筛选机制模型,对具备业务分类标签并且是待审核状态的APP,构建定时扫描程序,对该部分APP进行扫描,并通过规则进行过滤;S 4 : Establish a scoring and sorting model and a screening mechanism model, build a timed scanning program for apps that have business classification labels and are in a pending review state, scan these apps, and filter them according to the rules;
S5:将大量的APP,通过规则匹配模型进行对APP规则匹配,通过筛选机制模型进行分支筛选,输出结果对APP形成分类。S 5 : A large number of APPs are matched with APP rules through a rule matching model, and branch screening is performed through a screening mechanism model, and the output results are used to classify the APPs.
进一步,在步骤S4中,具体规则为对APK名称、本地化配置、流量的域名、IP、路径、头信息、请求内容进行关键词特征匹配,对LOGO、包名、签名HASH、签名OWNER进行数据集特征匹配,为APK每个维度计算分值,分值根据特征的匹配程度、权重计算得出。权重设计依据APK的技术特性和内容特性,其中技术特性权重高于内容特性,技术特性根据特征属性的关联度决定权重值,权重参考样例如表2;Further, in step S 4 , the specific rules are to match the keyword features of APK name, localization configuration, traffic domain name, IP, path, header information, and request content, and match the data set features of LOGO, package name, signature HASH, and signature OWNER, and calculate the score for each dimension of APK. The score is calculated based on the matching degree and weight of the features. The weight design is based on the technical characteristics and content characteristics of APK, where the weight of technical characteristics is higher than that of content characteristics. The weight value of technical characteristics is determined according to the correlation of feature attributes. The weight reference example is shown in Table 2;
表2权重参考样例Table 2 Weight reference examples
分值和维度,对每个APK,通过汇总A和B两个分值,其中汇总APK名称、本地化配置内容属性的分值用A表示,汇总包名、签名、LOGO等维度分值用B表示,同时记录命中的维度数,A1表示A中命中的维度数,B1表示B中命中的维度数。Score and dimension: For each APK, two scores A and B are summarized. The score of the APK name and localized configuration content attributes is represented by A, and the score of the package name, signature, LOGO and other dimensions is represented by B. The number of hit dimensions is recorded at the same time. A1 represents the number of hit dimensions in A, and B1 represents the number of hit dimensions in B.
判断符合分类要求的APK,若B1大于1,即APK在包名、签名、logo多个维度上匹配特征规则,则判定是预期识别分类APK;分值计算的划定标准如表3;To determine the APK that meets the classification requirements, if B1 is greater than 1, that is, the APK matches the feature rules in multiple dimensions such as package name, signature, and logo, then it is determined to be the expected identification classification APK; the criteria for calculating the score are shown in Table 3;
表3分值计算的划定标准Table 3: Standards for calculating the score
在步骤S5中,若B1等于1且A1大于1,则进行人工审核处理;因为APK名称和本地化配置匹配了多个维度规则,需要进一步审核确定分类。特征规则提取和合并:将上一步判断为相应分类APK进行特征规则提取,以便将来用于规则的更新和改进,将提取的特征规则合并到步骤S4的规则集中。以丰富特征规则。In step S5 , if B1 is equal to 1 and A1 is greater than 1, manual review is performed; because the APK name and localization configuration match multiple dimensional rules, further review is required to determine the classification. Feature rule extraction and merging: Extract feature rules from the APKs that are judged as corresponding categories in the previous step, so that they can be used for future rule updates and improvements, and merge the extracted feature rules into the rule set of step S4 to enrich the feature rules.
本实施例中,本发明提供一种计算机可读存储介质,存储介质存储有计算机程序,所述计算机程序被主控制器执行时实现如上述中的任一项所述的方法。In this embodiment, the present invention provides a computer-readable storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a main controller, the method described in any one of the above is implemented.
以上所述仅为本发明的优选实施方式而已,并不用于限制本发明,对于本领域的技术人员来说,本发明有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention has various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311471891.4A CN118277914B (en) | 2023-11-07 | 2023-11-07 | A mobile application classification method based on dynamic and static combined multi-dimensional APK features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311471891.4A CN118277914B (en) | 2023-11-07 | 2023-11-07 | A mobile application classification method based on dynamic and static combined multi-dimensional APK features |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118277914A true CN118277914A (en) | 2024-07-02 |
CN118277914B CN118277914B (en) | 2025-01-24 |
Family
ID=91647616
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311471891.4A Active CN118277914B (en) | 2023-11-07 | 2023-11-07 | A mobile application classification method based on dynamic and static combined multi-dimensional APK features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118277914B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9152694B1 (en) * | 2013-06-17 | 2015-10-06 | Appthority, Inc. | Automated classification of applications for mobile devices |
CN107133248A (en) * | 2016-02-29 | 2017-09-05 | 阿里巴巴集团控股有限公司 | The sorting technique and device of a kind of application program |
CN110245273A (en) * | 2019-06-21 | 2019-09-17 | 武汉绿色网络信息服务有限责任公司 | A method and corresponding device for acquiring APP service feature database |
CN112257032A (en) * | 2019-10-21 | 2021-01-22 | 国家计算机网络与信息安全管理中心 | Method and system for determining APP responsibility subject |
CN112464232A (en) * | 2020-11-21 | 2021-03-09 | 西北工业大学 | Android system malicious software detection method based on mixed feature combination classification |
-
2023
- 2023-11-07 CN CN202311471891.4A patent/CN118277914B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9152694B1 (en) * | 2013-06-17 | 2015-10-06 | Appthority, Inc. | Automated classification of applications for mobile devices |
CN107133248A (en) * | 2016-02-29 | 2017-09-05 | 阿里巴巴集团控股有限公司 | The sorting technique and device of a kind of application program |
CN110245273A (en) * | 2019-06-21 | 2019-09-17 | 武汉绿色网络信息服务有限责任公司 | A method and corresponding device for acquiring APP service feature database |
CN112257032A (en) * | 2019-10-21 | 2021-01-22 | 国家计算机网络与信息安全管理中心 | Method and system for determining APP responsibility subject |
CN112464232A (en) * | 2020-11-21 | 2021-03-09 | 西北工业大学 | Android system malicious software detection method based on mixed feature combination classification |
Non-Patent Citations (2)
Title |
---|
MARTINA LINDORFER 等: "MARVIN: Efficient and Comprehensive Mobile App Classification Through Static and Dynamic Analysis", 《2015 IEEE 39TH ANNUAL INTERNATIONAL COMPUTERS, SOFTWARE & APPLICATIONS CONFERENCE》, 24 September 2015 (2015-09-24), pages 422 - 433 * |
吴月明 等: "图卷积网络的抗混淆安卓恶意软件检测", 《软件学报》, vol. 34, no. 6, 1 June 2023 (2023-06-01), pages 2526 - 2542 * |
Also Published As
Publication number | Publication date |
---|---|
CN118277914B (en) | 2025-01-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI424325B (en) | Systems and methods for organizing collective social intelligence information using an organic object data model | |
CN102054015B (en) | System and method for organizing community intelligence information using an organic object data model | |
CN102982153B (en) | A kind of information retrieval method and device thereof | |
CN103246644B (en) | Method and device for processing Internet public opinion information | |
CN112541077B (en) | Processing method and system for power grid user service evaluation | |
US10387805B2 (en) | System and method for ranking news feeds | |
CN105787025A (en) | Network platform public account classifying method and device | |
CN112528294A (en) | Vulnerability matching method and device, computer equipment and readable storage medium | |
CN111522901A (en) | Method and device for processing address information in text | |
CN108717459B (en) | A kind of mobile application defect positioning method of user oriented comment information | |
CN113486664A (en) | Text data visualization analysis method, device, equipment and storage medium | |
CN113312476A (en) | Automatic text labeling method and device and terminal | |
CN112182184B (en) | Audit database-based accurate matching search method | |
CN118133221A (en) | A privacy data classification and grading method | |
CN110852082B (en) | Synonym determination method and device | |
CN111736804A (en) | A method and device for identifying key functions of App based on user comments | |
CN112685618A (en) | User feature identification method and device, computing equipment and computer storage medium | |
CN114064893A (en) | A kind of abnormal data auditing method, device, equipment and storage medium | |
CN113742576A (en) | Cross-platform based content recommendation method, device, equipment and storage medium | |
CN105868271B (en) | Surname statistical method and device | |
CN118277914A (en) | A mobile application classification method based on dynamic and static combined multi-dimensional APK features | |
CN103377199B (en) | Information processor and information processing method | |
CN103186573B (en) | A kind of method, demand of definite search need intensity are known method for distinguishing and device thereof | |
CN112115362B (en) | A programming information recommendation method and device based on similar code recognition | |
CN118822636B (en) | A method and system for extracting advertisement information based on automated interaction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |