[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114996287B - Automatic equipment identification and capacity expansion method based on feature library - Google Patents

Automatic equipment identification and capacity expansion method based on feature library Download PDF

Info

Publication number
CN114996287B
CN114996287B CN202210695817.XA CN202210695817A CN114996287B CN 114996287 B CN114996287 B CN 114996287B CN 202210695817 A CN202210695817 A CN 202210695817A CN 114996287 B CN114996287 B CN 114996287B
Authority
CN
China
Prior art keywords
feature
vector
unknown
equipment
iot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210695817.XA
Other languages
Chinese (zh)
Other versions
CN114996287A (en
Inventor
赵金凤
吴小东
奚培锋
郭曦泽
程睿远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Electrical Apparatus Research Institute Group Co Ltd
Original Assignee
Shanghai Electrical Apparatus Research Institute Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Electrical Apparatus Research Institute Group Co Ltd filed Critical Shanghai Electrical Apparatus Research Institute Group Co Ltd
Priority to CN202210695817.XA priority Critical patent/CN114996287B/en
Publication of CN114996287A publication Critical patent/CN114996287A/en
Application granted granted Critical
Publication of CN114996287B publication Critical patent/CN114996287B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an automatic equipment identification and capacity expansion method based on a feature library. After the unknown equipment is accessed, the method can compare the similarity with the existing equipment characteristics in the equipment library, and the known equipment is identified by combining equipment identification rules and an identification algorithm model. And for the unknown equipment, under a feature vector space model, the invention performs classification management on the unknown equipment type according to a similarity criterion, locks or reduces the range of the equipment to be identified, automatically generates equipment feature information and stores the equipment feature information into a designated area, and automatically updates the equipment feature information into an equipment library after manual intervention confirmation. Compared with the fully manual extraction of the device characteristics, the workload of heterogeneous edge device access is reduced by at least half of the original workload. Therefore, the invention can solve the problem that the equipment feature library in the current stage can not be automatically expanded, and better improves the management efficiency of the access and configuration of the edge equipment.

Description

一种基于特征库的设备自动识别和扩容方法A device automatic identification and expansion method based on feature library

技术领域Technical Field

本发明涉及一种基于特征库的设备自动识别和扩容方法,属于工控自动化领域。The invention relates to a device automatic identification and capacity expansion method based on a feature library, and belongs to the field of industrial control automation.

背景技术Background technique

随着物联网技术的迅速发展,越来越多的边缘设备向着网络化、智能化的方向发展,边缘端联网设备接入的工作量迅速增多。现有边缘设备识别方法主要是基于特征库中设备特征信息进行识别,且只能对设备特征库中已知设备进行识别,未在设备库内的设备无法进行识别。With the rapid development of Internet of Things technology, more and more edge devices are developing towards networking and intelligence, and the workload of edge networking devices is increasing rapidly. The existing edge device identification method is mainly based on the device feature information in the feature library, and can only identify known devices in the device feature library. Devices not in the device library cannot be identified.

近年来,越来越多未知的、私有的网络通信协议设备接连不断的出现,边缘设备的多样化和异构发展趋势不仅使协议识别分析的难度增加,而且协议特征提取的工作量急速增长,给制造企业自动化升级改造带来了前所未有的挑战。同时,随着接入设备类型的增加,需要不断对设备特征库进行升级。然而,目前设备特征库缺乏自动更新机制,对如此大量的设备特征提取工作是极其繁重和枯燥的,需要大量的人工完成。In recent years, more and more unknown and private network communication protocol devices have emerged one after another. The diversification and heterogeneous development trend of edge devices has not only increased the difficulty of protocol identification and analysis, but also the workload of protocol feature extraction has increased rapidly, bringing unprecedented challenges to the automation upgrade and transformation of manufacturing enterprises. At the same time, with the increase in the types of access devices, the device feature library needs to be continuously upgraded. However, the current device feature library lacks an automatic update mechanism. The work of extracting features from such a large number of devices is extremely heavy and boring, and requires a lot of manual work.

边缘异构设备的识别是建立物联网连接的重要前提,由于不同类别的边缘设备的协议、性能等千差万别,各不相同,因此针对边缘设备的识别应当采取分类的策略,而进行分类管理的第一步就是对设备类型的准确识别,在网络空间中快速、准确地识别出设备,细粒度地判断其设备属性,既有助于设备库不断扩容,支持更多设备接入,又能减轻技术人员提取设备特征的工作量,提高工作效率。The identification of heterogeneous edge devices is an important prerequisite for establishing IoT connections. Since the protocols and performance of different types of edge devices vary greatly, a classification strategy should be adopted for the identification of edge devices. The first step in classification management is to accurately identify the device type, quickly and accurately identify the device in cyberspace, and judge its device attributes in a fine-grained manner. This will not only help to continuously expand the device library and support the access of more devices, but also reduce the workload of technicians in extracting device features and improve work efficiency.

申请号为202110974559.4的发明专利申请提出了一种配电网物联终端设备实时探测识别方法与系统。该专利申请更多关注电力配电网领域,依据现有的配电网物联终端设备基础信息库,通过对比量化方法,实现对在线终端设备的识别,增加配电网运行的可信度和透明度。但设备基础信息库缺乏自动更新机制。The invention patent application with application number 202110974559.4 proposes a real-time detection and identification method and system for distribution network IoT terminal equipment. This patent application focuses more on the field of power distribution network. Based on the existing basic information database of distribution network IoT terminal equipment, it realizes the identification of online terminal equipment through comparative quantitative methods, thereby increasing the credibility and transparency of distribution network operation. However, the basic information database of equipment lacks an automatic update mechanism.

申请号为202010187111.3的发明专利申请提出了一种终端设备识别系统及其方法。该专利申请对于设备特征库中的未知设备,创建监控模块与设备之间的对应关系,实现终端设备快速识别和配置,提高终端设备批量组网及集成管理的效率,未涉及设备库更新机制。The invention patent application with application number 202010187111.3 proposes a terminal device identification system and method. For unknown devices in the device feature library, the patent application creates a corresponding relationship between the monitoring module and the device, realizes rapid identification and configuration of the terminal device, and improves the efficiency of batch networking and integrated management of terminal devices. It does not involve the device library update mechanism.

申请号为202011643313.0的发明专利申请提出了一种特征库更新方法、装置、网络设备及可读存储介质。该专利申请更多关注网络安全技术领域,通过在网络设备的共享内存的指定数据结构中,加载并编译用于替换第一特征库的第二特征库,并设置同步锁,改善在对特征库更新期间的网络安全问题。但其特征库更新方法不适用于工控领域边缘设备自动识别和特征库扩容的需求。The invention patent application with application number 202011643313.0 proposes a feature library update method, device, network device and readable storage medium. This patent application focuses more on the field of network security technology. By loading and compiling a second feature library used to replace the first feature library in a specified data structure of the shared memory of the network device, and setting a synchronization lock, the network security problem during the update of the feature library is improved. However, its feature library update method is not suitable for the needs of automatic identification of edge devices and feature library expansion in the industrial control field.

发明内容Summary of the invention

本发明要解决的技术问题是:现有边缘设备识别方法中,设备特征库缺乏自动更新机制,但随着接入设备类型的增加,需要不断对设备特征库进行升级。The technical problem to be solved by the present invention is that in the existing edge device identification method, the device feature library lacks an automatic update mechanism, but as the types of access devices increase, the device feature library needs to be continuously upgraded.

为了解决上述技术问题,本发明的技术方案是提供了一种基于特征库的设备自动识别和扩容方法,其特征在于,包括以下步骤:In order to solve the above technical problems, the technical solution of the present invention is to provide a method for automatic device identification and expansion based on a feature library, which is characterized by comprising the following steps:

步骤1、将物联网设备的特征报文抽象成为一个由特征词构成的词频向量,经过特征工程处理后,将物联网设备信息转化为多维特征向量的形式,基于已接入设备业务的积累,在云端建立设备特征库;Step 1: Abstract the characteristic message of the IoT device into a word frequency vector composed of characteristic words. After feature engineering processing, the IoT device information is converted into a multi-dimensional feature vector. Based on the accumulation of connected device services, a device feature library is established in the cloud.

步骤2、当新的物联网设备上线后,通过样本采集模块获取物联网设备的HTTP响应包作为原始样本;Step 2: When a new IoT device comes online, the HTTP response packet of the IoT device is obtained as the original sample through the sample collection module;

步骤3、由特征提取模块提取原始样本的样本特征:Step 3: The feature extraction module extracts the sample features of the original sample:

特征提取模块提取HTTP响应包中能够反映物联网设备的信息,随后利用特征工程得到与之对应的向量化的词向量信息作为样本特征;The feature extraction module extracts information that can reflect IoT devices from the HTTP response package, and then uses feature engineering to obtain the corresponding vectorized word vector information as sample features;

步骤4、由数据预处理模块对特征提取模块提取的样本特征进行预处理,将文本类型的样本特征转化为数值类型的样本特征,从而将物联网设备信息转化为多维特征向量;Step 4: The data preprocessing module preprocesses the sample features extracted by the feature extraction module, converts the sample features of the text type into the sample features of the numerical type, and thus converts the IoT device information into a multi-dimensional feature vector;

步骤5、算法识别模块以当前上线物联网设备的多维特征向量作为输入,将该多维特征向量与设备特征库中已知类型的标记物联网设备的多维特征向量进行特征匹配,若当前上线物联网设备的多维特征向量与设备特征库中已知类型的任意标记物联网设备的多维特征向量一致,则当前上线物联网设备属于已知设备,实现对当前上线物联网设备的识别,否则,当前上线物联网设备属于未知设备,算法识别模块使用改进约束种子K-means识别算法进行向量相似度计算,对当前上线物联网设备进行识别分类,该改进约束种子K-means识别算法使用两个多维特征向量的余弦相似度来度量相似度,并基于余弦相似度利用K-means识别算法进行聚类;聚类时,当未知设备所对应的多维特征向量与某一已知设备类型的簇的聚类中心的余弦相似度值大于给定阈值ε时,则将未知设备归入该簇,当前未知设备的设备类型为簇所对应的设备类型,并基于当前未知设备的多维特征向量生成对应的设备和通信特征存入指定区域;当未知设备所对应的多维特征向量与所有簇的聚类中心的余弦相似度值都不大于给定阈值ε时,说明当前未知设备属于新的设备类别,基于当前未知设备的多维特征向量自动新建新的设备类型后,将当前未知设备归入新的设备类型,再基于当前未知设备的多维特征向量生成对应的设备和通信特征存入指定区域;Step 5. The algorithm recognition module uses the multidimensional feature vector of the current online IoT device as input, and performs feature matching on the multidimensional feature vector with the multidimensional feature vector of the IoT device of known type in the device feature library. If the multidimensional feature vector of the current online IoT device is consistent with the multidimensional feature vector of any IoT device of known type in the device feature library, the current online IoT device belongs to a known device, and the current online IoT device is recognized. Otherwise, the current online IoT device belongs to an unknown device. The algorithm recognition module uses the improved constrained seed K-means recognition algorithm to calculate vector similarity and recognize and classify the current online IoT device. The improved constrained seed K-means recognition algorithm uses the cosine similarity of two multidimensional feature vectors to measure the similarity, and based on the cosine similarity Clustering is performed using the K-means recognition algorithm; during clustering, when the cosine similarity value between the multidimensional feature vector corresponding to the unknown device and the cluster center of a cluster of a known device type is greater than a given threshold ε, the unknown device is classified into the cluster, the device type of the current unknown device is the device type corresponding to the cluster, and the corresponding device and communication features are generated based on the multidimensional feature vector of the current unknown device and stored in a designated area; when the cosine similarity value between the multidimensional feature vector corresponding to the unknown device and the cluster centers of all clusters is not greater than a given threshold ε, it indicates that the current unknown device belongs to a new device category, and a new device type is automatically created based on the multidimensional feature vector of the current unknown device, and the current unknown device is classified into the new device type, and then the corresponding device and communication features are generated based on the multidimensional feature vector of the current unknown device and stored in a designated area;

步骤6、人工读取存入指定区域的设备和通信特征,并获取新的设备类型,对未知设备的生产厂家、设备类别、型号、通信特征等信息进行人工校对,并人工确认对于未知设备的分类以及新的设备类型是否正确;待人工进行干预确认后,实现对未知设备的识别,再将新的设备类型、已识别的未知设备的设备和通信特征自动更新到设备特征库中,从而实现设备特征库的半自动扩容。Step 6: Manually read the device and communication characteristics stored in the designated area, obtain the new device type, manually check the manufacturer, device category, model, communication characteristics and other information of the unknown device, and manually confirm whether the classification of the unknown device and the new device type are correct; after manual intervention and confirmation, the unknown device is identified, and then the new device type and the device and communication characteristics of the identified unknown device are automatically updated to the device feature library, thereby realizing semi-automatic expansion of the device feature library.

优选地,所述步骤2包括以下步骤:Preferably, step 2 comprises the following steps:

步骤201、样本采集模块在整个IP地址空间中进行端口扫描,获取无标记的未知物联网设备的IP地址,加入设备特征库中已知类型的标记物联网设备的IP地址后形成设备IP地址集;Step 201: The sample collection module performs port scanning in the entire IP address space to obtain the IP addresses of untagged unknown IoT devices, and adds the IP addresses of tagged IoT devices of known types in the device feature library to form a device IP address set;

步骤202、样本采集模块向设备IP地址集中所有IP地址发送请求,获取完整的HTTP响应包头部作为对应上线物联网设备的原始样本。Step 202: The sample collection module sends a request to all IP addresses in the device IP address set to obtain a complete HTTP response packet header as an original sample of the corresponding online IoT device.

优选地,所述步骤3包括以下步骤:Preferably, step 3 comprises the following steps:

步骤301、特征提取模块统计HTTP响应包中的头字段总数,并去除冗余信息;Step 301, the feature extraction module counts the total number of header fields in the HTTP response packet and removes redundant information;

步骤302、特征提取模块从所有头字段中选取出现频率最高的字段作为设备特征信息,随后经过特征工程处理得到与之对应的向量化的词向量信息,该向量化的词向量信息即为原始样本的样本特征。Step 302: The feature extraction module selects the field with the highest frequency of occurrence from all header fields as device feature information, and then obtains the corresponding vectorized word vector information through feature engineering processing. The vectorized word vector information is the sample feature of the original sample.

优选地,所述步骤5中,向量X和向量Y的余弦相似度cosθ采用下式计算:Preferably, in step 5, the cosine similarity cosθ between vector X and vector Y is calculated using the following formula:

式中,xi、yi分别为向量X和向量Y中的第i个元素。Where x i and y i are the i-th elements in vector X and vector Y respectively.

在未知设备接入后,本发明能够与设备库中已有设备特征进行相似度对比,采用设备识别规则和识别算法模型相结合,对于已知设备进行识别。而对未知设备,在特征向量空间模型下,本发明按相似性准则对未知设备类型进行分类管理,锁定或缩小待识别设备的范围,自动生成设备特征信息并存入指定区域,待人工进行干预确认后,自动更新到设备库中。相比全人工提取设备特征,异构边缘设备接入的工作量减轻至少原来的一半。因此,本发明能够克服现阶段设备特征库无法自动扩容的问题,更好地提升边缘设备接入和配置的管理效率。After the unknown device is connected, the present invention can compare the similarity with the existing device features in the device library, and use a combination of device identification rules and identification algorithm models to identify known devices. For unknown devices, under the feature vector space model, the present invention classifies and manages unknown device types according to similarity criteria, locks or narrows the scope of devices to be identified, automatically generates device feature information and stores it in a designated area, and automatically updates it to the device library after manual intervention and confirmation. Compared with fully manual extraction of device features, the workload of heterogeneous edge device access is reduced by at least half of the original. Therefore, the present invention can overcome the problem that the current device feature library cannot be automatically expanded, and better improve the management efficiency of edge device access and configuration.

具体而言,与现有技术相比,本发明具有如下有益效果:Specifically, compared with the prior art, the present invention has the following beneficial effects:

(1)未知设备上线后,云端驱动扫描获得设备的原始样本的信息,对特征数值进行向量化处理,并以设备的多维特征向量作为输入,采用改进约束种子K-means识别算法进行向量相似度计算,根据设备特征的相似度程度,对未知设备进行分组,缩小设备范围。(1) After the unknown device goes online, the cloud driver scans and obtains the original sample information of the device, vectorizes the feature values, and uses the multi-dimensional feature vector of the device as input to calculate the vector similarity using the improved constrained seed K-means recognition algorithm. According to the similarity of the device features, the unknown devices are grouped to narrow the device range.

(2)设备库扩容,由原来全部由人工进行未知设备特征提取,变成机器识别算法先对未知设备进行分组定义,并自动生成未知设备特征存入指定区域,缩小范围,再由人工干预,实现设备库半自动扩容,极大的减轻现场人员的工作量。(2) Expansion of the equipment library. The original manual extraction of unknown equipment features has been changed to a machine recognition algorithm that first defines the unknown equipment by grouping, and then automatically generates unknown equipment features and stores them in a designated area to narrow the scope. Manual intervention is then required to achieve semi-automatic expansion of the equipment library, greatly reducing the workload of on-site personnel.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明提供的设备库扩容方法示意图;FIG1 is a schematic diagram of a device library expansion method provided by the present invention;

图2为本发明所采用的识别算法流程图。FIG2 is a flow chart of the recognition algorithm used in the present invention.

具体实施方式Detailed ways

下面结合具体实施例,进一步阐述本发明。应理解,这些实施例仅用于说明本发明而不用于限制本发明的范围。此外应理解,在阅读了本发明讲授的内容之后,本领域技术人员可以对本发明作各种改动或修改,这些等价形式同样落于本申请所附权利要求书所限定的范围。The present invention will be further described below in conjunction with specific embodiments. It should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention. In addition, it should be understood that after reading the content taught by the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms fall within the scope limited by the appended claims of the application equally.

传统基于现有特征库规则方法识别设备存在新设备识别类型有限、扩展性差等问题。综合物联网设备信息的特点,在向量空间模型下,本发明采用一种改进约束种子K-means识别算法,在原有K-means识别算法的基础上,引入余弦距离函数作为相似性度量函数,余弦距离更关注方向上的差异,而对绝对的数值不敏感,这种优化算法弥补了不同设备样本间可能存在的度量标准不统一的问题,选取余弦距离作为距离衡量标准提升设备聚类效果,提高新设备类别的发现能力。The traditional device identification method based on the existing feature library rule method has the problems of limited new device identification types and poor scalability. Taking into account the characteristics of IoT device information, under the vector space model, the present invention adopts an improved constrained seed K-means recognition algorithm. On the basis of the original K-means recognition algorithm, the cosine distance function is introduced as the similarity measurement function. The cosine distance pays more attention to the difference in direction and is insensitive to the absolute value. This optimization algorithm makes up for the problem of inconsistent metric standards between different device samples. The cosine distance is selected as the distance measurement standard to improve the device clustering effect and the ability to discover new device categories.

基于上述原理,本发明提供的一种基于特征库的设备自动识别和扩容方法主要包括以下几个关键点:一是云端设备库的配置;二是边缘设备的识别;三是改进K-means识别算法对未知设备的分类管理,具体包括以下步骤:Based on the above principles, the method for automatic device identification and expansion based on a feature library provided by the present invention mainly includes the following key points: first, the configuration of the cloud device library; second, the identification of edge devices; third, the classification management of unknown devices by improving the K-means recognition algorithm, which specifically includes the following steps:

步骤1、将物联网设备的特征报文抽象成为一个由特征词构成的词频向量,经过特征工程处理后,将物联网设备信息转化为多维特征向量的形式,基于已接入设备业务的积累,在云端建立设备特征库。Step 1: Abstract the characteristic message of the IoT device into a word frequency vector composed of characteristic words. After feature engineering processing, the IoT device information is converted into a multi-dimensional feature vector. Based on the accumulation of connected device services, a device feature library is established in the cloud.

步骤2、当物联网设备上线后,通过样本采集模块获取物联网设备的原始样本,包括以下步骤:Step 2: After the IoT device is online, the original sample of the IoT device is obtained through the sample collection module, including the following steps:

步骤201、样本采集模块在整个IP地址空间中进行端口扫描,获取无标记的未知物联网设备的IP地址,加入设备特征库中已知类型的标记物联网设备的IP地址后形成设备IP地址集;Step 201: The sample collection module performs port scanning in the entire IP address space to obtain the IP addresses of untagged unknown IoT devices, and adds the IP addresses of tagged IoT devices of known types in the device feature library to form a device IP address set;

步骤202、样本采集模块向设备IP地址集中所有IP地址发送请求,获取完整的HTTP响应包头部作为对应上线物联网设备的原始样本。Step 202: The sample collection module sends a request to all IP addresses in the device IP address set to obtain a complete HTTP response packet header as an original sample of the corresponding online IoT device.

步骤3、由特征提取模块提取原始样本的样本特征:Step 3: The feature extraction module extracts the sample features of the original sample:

特征提取模块提取HTTP响应包中能够反映物联网设备的信息作为样本特征,包括以下步骤:The feature extraction module extracts information that can reflect the IoT device from the HTTP response packet as sample features, including the following steps:

步骤301、特征提取模块统计HTTP响应包中的头字段总数,并去除冗余信息,以降低计算复杂度,提高设备识别效率;Step 301: The feature extraction module counts the total number of header fields in the HTTP response packet and removes redundant information to reduce computational complexity and improve device recognition efficiency.

步骤302、特征提取模块从所有头字段中选取出现频率最高的字段作为设备特征信息,随后经过特征工程处理得到与之对应的向量化的词向量信息,该向量化的词向量信息即为原始样本的样本特征。Step 302: The feature extraction module selects the field with the highest frequency of occurrence from all header fields as device feature information, and then obtains the corresponding vectorized word vector information through feature engineering processing. The vectorized word vector information is the sample feature of the original sample.

步骤4、由数据预处理模块对特征提取模块提取的样本特征进行预处理,将文本类型的样本特征转化为数值类型的样本特征,从而将物联网设备信息转化为多维特征向量。Step 4: The data preprocessing module preprocesses the sample features extracted by the feature extraction module, converts the text type sample features into numerical type sample features, and thus converts the IoT device information into a multi-dimensional feature vector.

本发明通过数据预处理模块将文本内容转化为多维向量空间中的向量,而多维向量空间中两个向量的相似度可以用来表示所对应的文本内容的相似度。因此,本发明所提供的算法识别模块采用改进K-means识别算法,实现设备类别识别。The present invention converts text content into vectors in a multidimensional vector space through a data preprocessing module, and the similarity of two vectors in the multidimensional vector space can be used to represent the similarity of the corresponding text content. Therefore, the algorithm recognition module provided by the present invention adopts an improved K-means recognition algorithm to realize device category recognition.

步骤5、算法识别模块以当前上线物联网设备的多维特征向量作为输入,将该多维特征向量与设备特征库中已知类型的标记物联网设备的多维特征向量进行特征匹配,若当前上线物联网设备的多维特征向量与设备特征库中已知类型的任意标记物联网设备的多维特征向量一致,则当前上线物联网设备属于已知设备,实现对当前上线物联网设备的识别,否则,当前上线物联网设备属于未知设备,算法识别模块使用改进约束种子K-means识别算法进行向量相似度计算,对当前上线物联网设备进行识别分类。Step 5. The algorithm recognition module takes the multidimensional feature vector of the current online IoT device as input, and performs feature matching on the multidimensional feature vector with the multidimensional feature vector of the tagged IoT device of known types in the device feature library. If the multidimensional feature vector of the current online IoT device is consistent with the multidimensional feature vector of any tagged IoT device of known types in the device feature library, the current online IoT device is a known device, and the current online IoT device is recognized. Otherwise, the current online IoT device is an unknown device, and the algorithm recognition module uses the improved constrained seed K-means recognition algorithm to perform vector similarity calculation to identify and classify the current online IoT device.

传统K-means识别算法,多采用向量之间的欧式距离作为衡量指标。假设物联网设备一和物联网设备二所对应的多维特征向量分别为X和Y,且向量X和向量Y为n维向量,则向量X和向量Y分别表示为X(x1,x2,…,xn)和Y(y1,y2,…,yn),采用下式计算向量X和向量Y之间的欧式绝对距离d(X,Y)*Traditional K-means recognition algorithms often use the Euclidean distance between vectors as a measurement indicator. Assuming that the multidimensional feature vectors corresponding to IoT device 1 and IoT device 2 are X and Y, respectively, and vector X and vector Y are n-dimensional vectors, then vector X and vector Y are represented as X(x 1 ,x 2 ,…,x n ) and Y(y 1 ,y 2 ,…,y n ) respectively, and the Euclidean absolute distance d(X,Y) * between vector X and vector Y is calculated using the following formula:

xi、yi分别为向量X和向量Y中的第i个元素。x i , y i are the i-th elements in vector X and vector Y respectively.

但欧式绝对距离更多地反映个体数值特征的绝对差异,并不适用于物联网设备信息的数据集。However, the Euclidean absolute distance reflects more the absolute differences in individual numerical features and is not suitable for datasets of IoT device information.

本发明的改进约束种子K-means识别算法采用对比多维特征向量的余弦相似度,通过衡量投射到一个多维空间中的两个向量之间夹角的余弦值来度量它们之间的相似度。向量X和向量Y的余弦相似度cosθ采用下式计算:The improved constrained seed K-means recognition algorithm of the present invention uses the cosine similarity of the comparison multidimensional feature vectors to measure the similarity between two vectors by measuring the cosine value of the angle between them projected into a multidimensional space. The cosine similarity cosθ of vector X and vector Y is calculated using the following formula:

当向量X和向量Y的夹角为0°时,余弦相似度cosθ的值是1;当向量X和向量Y夹角为90°时,余弦相似度cosθ的值为0;当向量X和向量Y指向完全相反时,余弦相似度cosθ的值为-1。没有归一化时,余弦相似度值的范围在[-1,1]之间,该值越趋近于1,代表两个向量的方向越接近;值越趋近于-1,它们的方向相反;值接近于0,表示两个向量近乎于正交。When the angle between vector X and vector Y is 0°, the value of cosine similarity cosθ is 1; when the angle between vector X and vector Y is 90°, the value of cosine similarity cosθ is 0; when vector X and vector Y point in completely opposite directions, the value of cosine similarity cosθ is -1. Without normalization, the range of cosine similarity values is between [-1,1]. The closer the value is to 1, the closer the directions of the two vectors are; the closer the value is to -1, the opposite their directions; the closer the value is to 0, the more orthogonal the two vectors are.

欧氏距离和余弦相似度具有不同的计算方法和衡量特征,两个相似文本可能由于本身包含数据量的差异在欧式距离上相距甚远,但它们之间却具有较小的夹角,因而具有很高的余弦相似度。欧氏距离更多地反映个体数值特征的绝对差异,而余弦相似度更关注于向量方向上的差异,而对绝对的数值不敏感,这一特点弥补了不同设备样本间可能存在的度量标准不统一的问题。Euclidean distance and cosine similarity have different calculation methods and measurement characteristics. Two similar texts may be far apart in Euclidean distance due to the difference in the amount of data they contain, but they have a small angle between them and thus have a high cosine similarity. Euclidean distance reflects more of the absolute difference in individual numerical features, while cosine similarity focuses more on the difference in the direction of the vector and is insensitive to the absolute value. This feature makes up for the problem of inconsistent metric standards that may exist between samples from different devices.

因此,本发明的改进约束种子K-means识别算法选用余弦相似度对于两个物联网设备所对应的多维特征向量做相似性分析,对设备进行识别。Therefore, the improved constrained seed K-means recognition algorithm of the present invention selects cosine similarity to perform similarity analysis on the multi-dimensional feature vectors corresponding to two IoT devices to identify the devices.

例如:物联网设备一和物联网设备二所对应的多维特征向量分别是向量A和向量B,A=(7,0,5,3,10,0,1,0,0)和B=(3,0,2,1,4,0,0,0,1),则有:For example, the multidimensional feature vectors corresponding to IoT device 1 and IoT device 2 are vector A and vector B, A = (7, 0, 5, 3, 10, 0, 1, 0, 0) and B = (3, 0, 2, 1, 4, 0, 0, 0, 1), then:

A·B=7×3+0×0+5×2+3×1+……0×1=74A·B=7×3+0×0+5×2+3×1+……0×1=74

由此可知,选用余弦相似度对于物联网设备一和物联网设备二所对应的多维特征向量做相似性分析,发现二者高度相似,与实际情况相符。故选取余弦相似度作为度量未知设备与设备特征库中已知类型的标记物联网设备的相似程度,并依据余弦相似对未知设备进行分类。It can be seen that the cosine similarity is used to perform similarity analysis on the multi-dimensional feature vectors corresponding to IoT device 1 and IoT device 2, and it is found that the two are highly similar, which is consistent with the actual situation. Therefore, cosine similarity is selected as a measure of the similarity between unknown devices and known types of marked IoT devices in the device feature library, and unknown devices are classified based on cosine similarity.

(a)当未知设备所对应的多维特征向量与设备特征库中任意类型的标记物联网设备所对应簇的聚类中心的多维特征向量的余弦相似度值大于给定阈值ε时(本实施例中,ε=0.9),则将未知设备归入该簇,当前未知设备的类型为簇所对应的类型,并基于当前未知设备的多维特征向量生成对应的设备和通信特征存入指定区域。(a) When the cosine similarity value between the multidimensional feature vector corresponding to the unknown device and the multidimensional feature vector of the cluster center corresponding to any type of labeled IoT device in the device feature library is greater than a given threshold ε (in this embodiment, ε = 0.9), the unknown device is classified into the cluster, the type of the current unknown device is the type corresponding to the cluster, and the corresponding device and communication features are generated based on the multidimensional feature vector of the current unknown device and stored in the designated area.

(b)当未知设备所对应的多维特征向量与设备特征库中所有类型的标记物联网设备所对应的所有簇的聚类中心的多维特征向量的余弦相似度值都不大于给定阈值ε时,说明当前未知设备属于新的设备类别,基于当前未知设备的多维特征向量自动新建新的设备类型后,将当前未知设备归入新的设备类型,再基于当前未知设备的多维特征向量生成对应的设备和通信特征存入指定区域;(b) When the cosine similarity value between the multidimensional feature vector corresponding to the unknown device and the multidimensional feature vectors of the cluster centers of all clusters corresponding to all types of labeled IoT devices in the device feature library is not greater than a given threshold ε, it indicates that the current unknown device belongs to a new device category. After automatically creating a new device type based on the multidimensional feature vector of the current unknown device, the current unknown device is classified into the new device type, and then the corresponding device and communication features are generated based on the multidimensional feature vector of the current unknown device and stored in the designated area;

步骤6、人工读取存入指定区域的设备和通信特征,并获取新的设备类型,对未知设备的生产厂家、设备类别、型号、通信特征等信息进行人工校对,并人工确认对于未知设备的分类以及新的设备类型是否正确。待人工进行干预确认后,实现对未知设备的识别,再将新的设备类型、已识别的未知设备的设备和通信特征自动更新到设备特征库中,从而实现设备特征库的半自动扩容,极大地减轻现场人员的工作量。Step 6: Manually read the device and communication features stored in the designated area, obtain the new device type, manually check the manufacturer, device category, model, communication features and other information of the unknown device, and manually confirm whether the classification of the unknown device and the new device type are correct. After manual intervention and confirmation, the unknown device is identified, and the new device type, the device and communication features of the identified unknown device are automatically updated to the device feature library, thereby realizing the semi-automatic expansion of the device feature library, greatly reducing the workload of on-site personnel.

Claims (4)

1.一种基于特征库的设备自动识别和扩容方法,其特征在于,包括以下步骤:1. A method for automatic device identification and expansion based on a feature library, characterized in that it comprises the following steps: 步骤1、将物联网设备的特征报文抽象成为一个由特征词构成的词频向量,经过特征工程处理后,将物联网设备信息转化为多维特征向量的形式,基于已接入设备业务的积累,在云端建立设备特征库;Step 1: Abstract the characteristic message of the IoT device into a word frequency vector composed of characteristic words. After feature engineering processing, the IoT device information is converted into a multi-dimensional feature vector. Based on the accumulation of connected device services, a device feature library is established in the cloud. 步骤2、当新的物联网设备上线后,通过样本采集模块获取物联网设备的HTTP响应包作为原始样本;Step 2: When a new IoT device comes online, the HTTP response packet of the IoT device is obtained as the original sample through the sample collection module; 步骤3、由特征提取模块提取原始样本的样本特征:Step 3: The feature extraction module extracts the sample features of the original sample: 特征提取模块提取HTTP响应包中能够反映物联网设备的信息,随后利用特征工程得到与之对应的向量化的词向量信息作为样本特征;The feature extraction module extracts information that can reflect IoT devices from the HTTP response package, and then uses feature engineering to obtain the corresponding vectorized word vector information as sample features; 步骤4、由数据预处理模块对特征提取模块提取的样本特征进行预处理,将文本类型的样本特征转化为数值类型的样本特征,从而将物联网设备信息转化为多维特征向量;Step 4: The data preprocessing module preprocesses the sample features extracted by the feature extraction module, converts the sample features of the text type into the sample features of the numerical type, and thus converts the IoT device information into a multi-dimensional feature vector; 步骤5、算法识别模块以当前上线物联网设备的多维特征向量作为输入,将该多维特征向量与设备特征库中已知类型的标记物联网设备的多维特征向量进行特征匹配,若当前上线物联网设备的多维特征向量与设备特征库中已知类型的任意标记物联网设备的多维特征向量一致,则当前上线物联网设备属于已知设备,实现对当前上线物联网设备的识别,否则,当前上线物联网设备属于未知设备,算法识别模块使用改进约束种子K-means识别算法进行向量相似度计算,对当前上线物联网设备进行识别分类,该改进约束种子K-means识别算法使用两个多维特征向量的余弦相似度来度量相似度,并基于余弦相似度利用K-means识别算法进行聚类;聚类时,当未知设备所对应的多维特征向量与某一已知设备类型的簇的聚类中心的余弦相似度值大于给定阈值ε时,则将未知设备归入该簇,当前未知设备的设备类型为簇所对应的设备类型,并基于当前未知设备的多维特征向量生成对应的设备和通信特征存入指定区域;当未知设备所对应的多维特征向量与所有簇的聚类中心的余弦相似度值都不大于给定阈值ε时,说明当前未知设备属于新的设备类别,基于当前未知设备的多维特征向量自动新建新的设备类型后,将当前未知设备归入新的设备类型,再基于当前未知设备的多维特征向量生成对应的设备和通信特征存入指定区域;Step 5. The algorithm recognition module uses the multidimensional feature vector of the current online IoT device as input, and performs feature matching on the multidimensional feature vector with the multidimensional feature vector of the IoT device of known type in the device feature library. If the multidimensional feature vector of the current online IoT device is consistent with the multidimensional feature vector of any IoT device of known type in the device feature library, the current online IoT device belongs to a known device, and the current online IoT device is recognized. Otherwise, the current online IoT device belongs to an unknown device. The algorithm recognition module uses the improved constrained seed K-means recognition algorithm to calculate vector similarity and recognize and classify the current online IoT device. The improved constrained seed K-means recognition algorithm uses the cosine similarity of two multidimensional feature vectors to measure the similarity, and based on the cosine similarity Clustering is performed using the K-means recognition algorithm; during clustering, when the cosine similarity value between the multidimensional feature vector corresponding to the unknown device and the cluster center of a cluster of a known device type is greater than a given threshold ε, the unknown device is classified into the cluster, the device type of the current unknown device is the device type corresponding to the cluster, and the corresponding device and communication features are generated based on the multidimensional feature vector of the current unknown device and stored in a designated area; when the cosine similarity value between the multidimensional feature vector corresponding to the unknown device and the cluster centers of all clusters is not greater than a given threshold ε, it indicates that the current unknown device belongs to a new device category, and a new device type is automatically created based on the multidimensional feature vector of the current unknown device, and the current unknown device is classified into the new device type, and then the corresponding device and communication features are generated based on the multidimensional feature vector of the current unknown device and stored in a designated area; 步骤6、人工读取存入指定区域的设备和通信特征,并获取新的设备类型,对未知设备的生产厂家、设备类别、型号、通信特征信息进行人工校对,并人工确认对于未知设备的分类以及新的设备类型是否正确;待人工进行干预确认后,实现对未知设备的识别,再将新的设备类型、已识别的未知设备的设备和通信特征自动更新到设备特征库中,从而实现设备特征库的半自动扩容。Step 6: Manually read the device and communication characteristics stored in the designated area, obtain the new device type, manually check the manufacturer, device category, model, and communication characteristic information of the unknown device, and manually confirm whether the classification of the unknown device and the new device type are correct; after manual intervention and confirmation, the unknown device is identified, and then the new device type and the device and communication characteristics of the identified unknown device are automatically updated to the device feature library, thereby realizing semi-automatic expansion of the device feature library. 2.如权利要求1所述的一种基于特征库的设备自动识别和扩容方法,其特征在于,所述步骤2包括以下步骤:2. The method for automatic device identification and capacity expansion based on a feature library according to claim 1, wherein step 2 comprises the following steps: 步骤201、样本采集模块在整个IP地址空间中进行端口扫描,获取无标记的未知物联网设备的IP地址,加入设备特征库中已知类型的标记物联网设备的IP地址后形成设备IP地址集;Step 201: The sample collection module performs port scanning in the entire IP address space to obtain the IP addresses of untagged unknown IoT devices, and adds the IP addresses of tagged IoT devices of known types in the device feature library to form a device IP address set; 步骤202、样本采集模块向设备IP地址集中所有IP地址发送请求,获取完整的HTTP响应包头部作为对应上线物联网设备的原始样本。Step 202: The sample collection module sends a request to all IP addresses in the device IP address set to obtain a complete HTTP response packet header as an original sample of the corresponding online IoT device. 3.如权利要求1所述的一种基于特征库的设备自动识别和扩容方法,其特征在于,所述步骤3包括以下步骤:3. The method for automatic device identification and capacity expansion based on a feature library according to claim 1, wherein step 3 comprises the following steps: 步骤301、特征提取模块统计HTTP响应包中的头字段总数,并去除冗余信息;Step 301, the feature extraction module counts the total number of header fields in the HTTP response packet and removes redundant information; 步骤302、特征提取模块从所有头字段中选取出现频率最高的字段作为设备特征信息,随后经过特征工程处理得到与之对应的向量化的词向量信息,该向量化的词向量信息即为原始样本的样本特征。Step 302: The feature extraction module selects the field with the highest frequency of occurrence from all header fields as device feature information, and then obtains the corresponding vectorized word vector information through feature engineering processing. The vectorized word vector information is the sample feature of the original sample. 4.如权利要求1所述的一种基于特征库的设备自动识别和扩容方法,其特征在于,所述步骤5中,向量X和向量Y的余弦相似度cosθ采用下式计算:4. The method for automatic device identification and expansion based on a feature library according to claim 1, wherein in step 5, the cosine similarity cosθ between vector X and vector Y is calculated using the following formula: 式中,xi、yi分别为向量X和向量Y中的第i个元素。Where x i and y i are the i-th elements in vector X and vector Y respectively.
CN202210695817.XA 2022-06-20 2022-06-20 Automatic equipment identification and capacity expansion method based on feature library Active CN114996287B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210695817.XA CN114996287B (en) 2022-06-20 2022-06-20 Automatic equipment identification and capacity expansion method based on feature library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210695817.XA CN114996287B (en) 2022-06-20 2022-06-20 Automatic equipment identification and capacity expansion method based on feature library

Publications (2)

Publication Number Publication Date
CN114996287A CN114996287A (en) 2022-09-02
CN114996287B true CN114996287B (en) 2024-04-16

Family

ID=83035107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210695817.XA Active CN114996287B (en) 2022-06-20 2022-06-20 Automatic equipment identification and capacity expansion method based on feature library

Country Status (1)

Country Link
CN (1) CN114996287B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116401369B (en) * 2023-06-07 2023-08-11 佰墨思(成都)数字技术有限公司 Entity identification and classification method for biological product production terms

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110380989A (en) * 2019-07-26 2019-10-25 东南大学 The polytypic internet of things equipment recognition methods of network flow fingerprint characteristic two-stage
CN110381126A (en) * 2019-07-02 2019-10-25 山东建筑大学 Electrical equipment recognition methods, system, equipment and medium based on edge calculations
CN112383431A (en) * 2020-11-13 2021-02-19 武汉虹旭信息技术有限责任公司 Method and device for identifying data of internet of things in internet
CN112667750A (en) * 2019-09-30 2021-04-16 中兴通讯股份有限公司 Method and device for determining and identifying message category
CN112685063A (en) * 2020-12-30 2021-04-20 北京天融信网络安全技术有限公司 Feature library updating method and device, network equipment and readable storage medium
WO2021104444A1 (en) * 2019-11-27 2021-06-03 华为技术有限公司 Data flow classification method, apparatus and system
CN113408014A (en) * 2020-03-17 2021-09-17 南宁富桂精密工业有限公司 Terminal equipment identification system and method thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110381126A (en) * 2019-07-02 2019-10-25 山东建筑大学 Electrical equipment recognition methods, system, equipment and medium based on edge calculations
CN110380989A (en) * 2019-07-26 2019-10-25 东南大学 The polytypic internet of things equipment recognition methods of network flow fingerprint characteristic two-stage
CN112667750A (en) * 2019-09-30 2021-04-16 中兴通讯股份有限公司 Method and device for determining and identifying message category
WO2021104444A1 (en) * 2019-11-27 2021-06-03 华为技术有限公司 Data flow classification method, apparatus and system
CN113408014A (en) * 2020-03-17 2021-09-17 南宁富桂精密工业有限公司 Terminal equipment identification system and method thereof
CN112383431A (en) * 2020-11-13 2021-02-19 武汉虹旭信息技术有限责任公司 Method and device for identifying data of internet of things in internet
CN112685063A (en) * 2020-12-30 2021-04-20 北京天融信网络安全技术有限公司 Feature library updating method and device, network equipment and readable storage medium

Also Published As

Publication number Publication date
CN114996287A (en) 2022-09-02

Similar Documents

Publication Publication Date Title
CN110888849B (en) An online log parsing method, system and electronic terminal device thereof
CN111506599B (en) Method and system for identifying industrial control equipment based on rule matching and deep learning
CN111339297B (en) Network asset anomaly detection method, system, medium and equipment
WO2020147317A1 (en) Method, apparatus, and device for determining network anomaly behavior, and readable storage medium
CN107346286B (en) A Software Defect Prediction Method Based on Kernel Principal Component Analysis and Extreme Learning Machine
CN104239553A (en) Entity recognition method based on Map-Reduce framework
CN106033426A (en) Image retrieval method based on latent semantic minimum hash
CN113537321B (en) Network flow anomaly detection method based on isolated forest and X mean value
WO2022056955A1 (en) Uncertain graph-based community discovery method
CN113328985A (en) Passive Internet of things equipment identification method, system, medium and equipment
CN114996287B (en) Automatic equipment identification and capacity expansion method based on feature library
CN111782817A (en) An information system-oriented knowledge graph construction method, device and electronic device
CN117478390A (en) Network intrusion detection method based on improved density peak clustering algorithm
CN115934393A (en) Equipment defect correlation analysis method, device, computer equipment and storage medium
CN115442393A (en) Industrial internet identification system-based stock Internet of things equipment configuration method and device
CN116910592B (en) Log detection method and device, electronic equipment and storage medium
CN116032741A (en) Equipment identification method and device, electronic equipment and computer storage medium
CN115017441A (en) Asset classification method and device, electronic equipment and storage medium
CN114548104A (en) Few-sample entity identification method and model based on feature and category intervention
CN118520397A (en) Data anomaly detection method and system for distributed network
CN114925764B (en) Engineering management file classification and identification method and system based on big data
CN115022049A (en) A method, electronic device and storage medium for detecting out-of-distribution network traffic data based on calculating Mahalanobis distance
CN115546496A (en) A method and device for identifying IoT devices in active detection scenarios
CN115186138A (en) A method and terminal for comparison of distribution network data
CN111125197A (en) MIC and MP based data set abnormal data processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant