CN107845407A

CN107845407A - Based on filtering type and improve the human body physiological characteristics selection algorithm for clustering and being combined

Info

Publication number: CN107845407A
Application number: CN201710733507.1A
Authority: CN
Inventors: 陈波; 俞洁; 高秀娥; 郑庆国; 白旭飞
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2017-08-24
Filing date: 2017-08-24
Publication date: 2018-03-27

Abstract

The invention discloses a kind of human body physiological characteristics selection algorithm being combined based on filtering type and improvement cluster, including：S1：Impedance model is selected, collects fisrt feature parameter and second feature supplemental characteristic structure initial characteristicses collection and final optimal subset；S2：Filter algorithm is introduced, for each feature in the data that are collected into；S3：Feature set is ranked up from big to small according to HSIC value；S4：The feature of K before ranking is added in feature set, parameter uncorrelated to body composition is filtered off using Filter algorithms, builds initial data set；S5：According to clustering algorithm by dataset construction feature sparse graph；S6：Redundancy feature in cluster is screened using improved clustering algorithm；The human body physiological characteristics selection algorithm that the application establishes can improve human body composition precision of prediction, and more efficiently detection means is provided for body composition Study and clinical practice.

Description

Human Physiological Feature Selection Algorithm Based on the Combination of Filtering and Improved Clustering

技术领域technical field

本发明属于生物信息学领域，尤其涉及一种基于过滤式和改进聚类相结合的人体生理特征选择算法。The invention belongs to the field of bioinformatics, in particular to a human physiological feature selection algorithm based on the combination of filtering and improved clustering.

背景技术Background technique

人体体成分的均衡状态对维持机体内环境的稳定有着重要作用，是影响人体健康的重要因素。当疾病发生时，人体体成分的变化往往会早于疾病的临床症状。因此，可利用人体体成分的变化对高血压、血脂异常、代谢综合征等疾病进行相关性预测。然而，影响人体体成分的相关参数众多，参数之间存在高度非线性、冗余、不相关等特点。The balance of body composition plays an important role in maintaining the stability of the internal environment of the body, and is an important factor affecting human health. When disease occurs, changes in body composition often precede clinical symptoms of the disease. Therefore, changes in body composition can be used to predict the correlation of diseases such as hypertension, dyslipidemia, and metabolic syndrome. However, there are many relevant parameters that affect human body composition, and there are characteristics such as high nonlinearity, redundancy, and irrelevance among the parameters.

现有Wrapper算法去除冗余特征,该方法能够得到较好的泛性能，但由于算法复杂度高，不适合用于大规模数据集；Filter算法根据准则计算结果赋予每一个特征一个权重值，计算效率较高，但是该方法没有充分考虑到特征间的冗余性，选择的特征子集很可能存在大量的冗余；聚类方法把体成分参数数据对象化分成多个组或簇，使得簇内对象有很高的相似性，根据每个簇与中心点距离，进行判断，有效地筛去冗余特征,但不能有效的筛选出不相关特征。鉴于此,在对体成分高维数据分析之前,有必要提出一种新的数据降维处理的方法。The existing Wrapper algorithm removes redundant features. This method can obtain better general performance, but due to the high complexity of the algorithm, it is not suitable for large-scale data sets; the Filter algorithm assigns a weight value to each feature according to the calculation results of the criteria, and calculates The efficiency is high, but this method does not fully consider the redundancy between features, and the selected feature subset is likely to have a large amount of redundancy; the clustering method divides the body component parameter data into multiple groups or clusters, making the cluster The inner objects have a high similarity. Judgment is made according to the distance between each cluster and the center point, and redundant features are effectively screened out, but irrelevant features cannot be effectively screened out. In view of this, before analyzing high-dimensional data of body composition, it is necessary to propose a new method for data dimensionality reduction.

发明内容Contents of the invention

针对现有技术的不足，本发明提出了基于过滤式和改进聚类相结合的人体生理特征选择算法，首先使用Filter特征选择算法去除与体成分类别不相关的特征，然后采用M-Chameleon 特征聚类的方法去除冗余特征，使得Filter特征选择算法和特征聚类的优点都得以最大的发挥。这样建立的人体体成分预测模型可提高人体体成分预测精度，为人体体成分研究和临床应用提供更为有效的检测手段。Aiming at the deficiencies of the prior art, the present invention proposes a human physiological feature selection algorithm based on the combination of filtering and improved clustering. First, the Filter feature selection algorithm is used to remove features that are not related to the body composition category, and then M-Chameleon feature clustering is used to The class method removes redundant features, so that the advantages of Filter feature selection algorithm and feature clustering can be maximized. The human body composition prediction model established in this way can improve the prediction accuracy of human body composition, and provide more effective detection means for human body composition research and clinical application.

为实现上述目的，本发明提供了基于过滤式和改进聚类相结合的人体生理特征选择算法，包括：In order to achieve the above object, the present invention provides a human body physiological feature selection algorithm based on the combination of filtering formula and improved clustering, including:

S1：选择阻抗模型，收集第一特征参数和第二特征参数数据构建初始特征集与最终最优子集，并将初始特征集与最终最优子集初始化为空集；S1: Select the impedance model, collect the first characteristic parameter and the second characteristic parameter data to construct the initial feature set and the final optimal subset, and initialize the initial feature set and the final optimal subset to an empty set;

进一步地，采用人体体成分分析仪(INBODY)测量的体成分数据作为数据集，记为T＝(O,F,C)，其中，O是数据样本集合，F是选择特征集合，C是体成分类别；将对人体体成分有重要影响的参数集如体重、身高、年龄、性别、人体各段阻抗值等作为第一特征参数，各段阻抗的倒数1/R_i、平方R_i ²、R_iR_j作为第二特征参数。其中，阻抗值选择INBODY中 1KHZ阻抗参数，第一特征参数(R1、R2、R3、R4、R5、A、H、W，其中A是年龄、H 是身高、W是体重)和第二特征参数(1/R1、1/R2、1/R3、1/R4、1/R5、R1R2、R1R3、R1R4、 R1R5、R2R3、R2R4、R2R5、R3R4、R3R5、R5R4等)作为原始特征参数集，记为 F＝{f₁,f₂,…,f_m}；体成分类别集C中包括体脂肪量(BFM)、总水量(TBW)。Further, the body composition data measured by the human body composition analyzer (INBODY) is used as the data set, which is recorded as T=(O, F, C), where O is the data sample set, F is the selected feature set, and C is the body Composition categories; the parameter sets that have an important impact on human body composition, such as weight, height, age, gender, and the impedance value of each segment of the human body, are used as the first characteristic parameter, and the reciprocal 1/R _i , square R _i ² , R _i R _j is used as the second characteristic parameter. Among them, the impedance value selects the 1KHZ impedance parameter in INBODY, the first characteristic parameter (R1, R2, R3, R4, R5, A, H, W, where A is age, H is height, W is weight) and the second characteristic parameter (1/R1, 1/R2, 1/R3, 1/R4, 1/R5, R1R2, R1R3, R1R4, R1R5, R2R3, R2R4, R2R5, R3R4, R3R5, R5R4, etc.) as the original feature parameter set, denoted as F={f ₁ ,f ₂ ,...,f _m }; the body composition category set C includes body fat mass (BFM) and total water mass (TBW).

S2：引入过滤算法(Filter)，对于收集到的数据中每个特征，计算在体成分类别C下的HSIC值，该值表征了生理特征与体成分类别的相关性大小；S2: Introduce a filtering algorithm (Filter), and calculate the HSIC value under the body composition category C for each feature in the collected data, which represents the correlation between physiological characteristics and body composition categories;

进一步地，对于每一个特征{f₁,f₂,…,f_m}∈F，定义一个非线性特征映射φ:该映射可以将特征点f₁,f₂,…,f_m映射到再生核Hilbert空间中，核函数为：式中：空间上的内积。类似的，定义一个体成分类别映射ψ:将体成分指标C空间映射到再生核Hilbert空间记为中,核函数为：此外，定义特征与体成分类别的互协方差算子为：式中的表示张量积，和表示期望。对于每个特征{f₁,f₂,…,f_m}∈F，计算在体成分类别c下的HSIC值(HSIC是一种基于核的独立性度量方法，通过在再生核Hilbert空间上定义互协方差算子，并通过对算子范数的经验估计得到独立性判断准则，可以用来衡量两个数据分布间的相似性，广泛用于特征选择、降维中)，该值表征了生理特征与体成分类别的相关性大小：Further, for each feature {f ₁ ,f ₂ ,…,f _m }∈F, define a nonlinear feature map φ: This mapping can map the feature points f ₁ , f ₂ ,...,f _m to the regenerated kernel Hilbert space , the kernel function is: In the formula: space inner product on . Similarly, define a body component category map ψ: Map the C space of the body composition index to the Hilbert space of the regenerating nucleus and write it as , the kernel function is: In addition, the cross-covariance operator that defines the feature and body component categories is: in the formula represents the tensor product, and Express expectations. For each feature {f ₁ ,f ₂ ,…,f _m }∈F, calculate the HSIC value under the body composition category c (HSIC is a kernel-based independence measure method, defined by regenerating kernel Hilbert space Cross-covariance operator, and the independence judgment criterion is obtained through the empirical estimation of the operator norm, which can be used to measure the similarity between two data distributions, and is widely used in feature selection and dimensionality reduction), this value represents Correlation magnitude of physiological characteristics and body composition categories:

对于某个特征f和体成分类别c，HSIC的值越大说明c对f的依赖性越强。For a certain feature f and body composition category c, the larger the value of HSIC, the stronger the dependence of c on f.

S3：将特征集按照HSIC的值从大到小进行排序；S3: Sort the feature set according to the value of HSIC from large to small;

S4：将排名前K的特征加入到特征集中，利用Filter算法滤去与体成分不相关参数，构建初始数据集；S4: Add the top K features to the feature set, use the Filter algorithm to filter out parameters not related to body composition, and construct an initial data set;

S5：根据聚类算法(M-chameleon)将数据集构造特征稀疏图。RI为特征间相互连接的边集，RC为特征间的相近度，初始化期望簇的数目k；S5: According to the clustering algorithm (M-chameleon), the data set is constructed into a feature sparse map. RI is the edge set connected to each other between features, RC is the similarity between features, and the number k of expected clusters is initialized;

进一步地，Chameleon使用凝聚层次聚类法，根据K-最邻近图的方法来构造特征稀疏图，图中的每一个顶点代表一个数据对象，在这两个顶点之间存在一条边，利用边的加权可以反映对象的相似度，算法原理如图1。特征子簇的相似度依据两点评估：1)簇中对象的互联情况；2)簇的邻近性。如果两个特征簇的互联性很高，且距离很近，距离较远的特征簇就会被合并替代。根据两个特征簇的相对互联度RI和相对近似度RC来决定他们两个特征间的相似度。给定归一化且经过Filter过滤后的特征数据集F＝{f₁,f₂,…,f_m}，数据集簇F 被划分成子簇f₁和f₂，把F二分成f₁和f₂而被切断的边的权重最小，特征子簇f₁和f₂之间的相对互联性越大。两个特征簇f₁和f₂的相对互联度RI(f₁,f₂)定义为特征簇f₁和f₂之间的相对互联度，关于两个簇f₁和f₂的内部互联度规范化，即：Furthermore, Chameleon uses the agglomerative hierarchical clustering method to construct a feature sparse graph according to the K-nearest neighbor graph method. Each vertex in the graph represents a data object, and there is an edge between these two vertices. Using the edge Weighting can reflect the similarity of objects, and the principle of the algorithm is shown in Figure 1. The similarity of feature subclusters is evaluated based on two points: 1) the interconnection of objects in the cluster; 2) the proximity of the clusters. If the interconnectivity of two feature clusters is high and the distance is very close, the feature clusters that are farther away will be merged and replaced. The similarity between the two features is determined according to the relative interconnection RI and relative proximity RC of the two feature clusters. Given a normalized and filtered feature dataset F={f ₁ ,f ₂ ,…,f _m }, the dataset cluster F is divided into subclusters f ₁ and f ₂ , and F is divided into f ₁ and f 2 f ₂ and the cut edge has the smallest weight, the greater the relative interconnectivity between the feature subclusters f ₁ and f ₂ . The relative interconnection degree RI(f ₁ , f ₂ ) of two feature clusters f ₁ and f ₂ is defined as the relative interconnection degree between feature clusters f ₁ and f ₂ , and the internal interconnection degree of two clusters f ₁ and f ₂ Normalization, that is:

其中，是包含f₁和f₂的簇的边割，同理，或是将f₁(或f₂)划分成大致相等的两部分的边割的最小和。in, is the edge cut of the cluster containing f ₁ and f ₂ , similarly, or is the minimum sum of edge cuts that divide f ₁ (or f ₂ ) into two approximately equal parts.

两个特征簇f₁和f₂的相对近似度RC(f₁,f₂)定义为f₁和f₂之间的绝对近似度，关于两个特征簇f₁和f₂的内部近似度的规范化，即：The relative approximation RC(f ₁ , f ₂ ) of two feature clusters f ₁ and f ₂ is defined as the absolute approximation between f ₁ and f ₂ , with respect to the internal approximation of the two feature clusters f ₁ and f ₂ Normalization, that is:

其中，是连接f₁顶点和f₂顶点的边的平均权重，(或)是最小二分簇 f₁(或f₂)的边的平均权重。通过特征子簇f₁和f₂的相对互联性和相对近似度来决定两个子簇之间的相似度。in, is the average weight _of the edges connecting vertices f1 and vertices _f2 , (or ) is the average weight of the edges of the smallest bipartite cluster f ₁ (or f ₂ ). The similarity between _two subclusters is determined by the relative interconnectivity and relative proximity _of the feature subclusters f1 and f2.

S6：利用改进的聚类算法筛选簇中冗余特征；S6: Use the improved clustering algorithm to screen redundant features in the cluster;

S61：计算簇与簇之间的距离并对其进行排序，判断样本子簇数目h是否等于初始化期望簇数目k；S62：若不等则选择相似度函数值最大的两个子簇进行合并、若相等则结束； S63：重新计算新子簇的相对近似度RC，遍历所有子簇，是否所有子簇两两之间都尝试合并；S64：若所有子簇都尝试合并，返回S61；否则将相似度函数最小的两个子簇进行合并后返回S63；S65：选择HSIC值最大的特征进行组合。S61: Calculate the distance between clusters and sort them, and judge whether the number of sample sub-clusters h is equal to the number of initialization expected clusters k; S62: If not, select the two sub-clusters with the largest similarity function value to merge, if End if equal; S63: Recalculate the relative approximation RC of the new subcluster, traverse all subclusters, whether all subclusters try to merge between each other; S64: If all subclusters try to merge, return to S61; otherwise, similar The two sub-clusters with the smallest degree function are merged and returned to S63; S65: Select the feature with the largest HSIC value to combine.

S7：从每个特征簇中选择一个HSIC值最大的特征组合成最优特征集。S7: Select a feature with the largest HSIC value from each feature cluster to form an optimal feature set.

本发明由于采用以上技术方案，能够取得如下的技术效果：根据人体生理特征参数的特点，提出了基于Filter和聚类相结合的人体特征参数选择算法，使用Hilbert-Schmidt依赖性准则的特征过滤法剔除了与类别不相关的特征，将改进的Chameleon聚类用于特征选择中并进行优化改进，很好的去除了冗余特征，有效的选择出用于构造体成分模型的最优特征参数集，解决人体生理特征参数多且冗余的问题，为人体体成分研究和临床应用提供更为有效的检测手段。Due to the adoption of the above technical scheme, the present invention can obtain the following technical effects: according to the characteristics of human physiological characteristic parameters, a human characteristic parameter selection algorithm based on the combination of Filter and clustering is proposed, and the characteristic filtering method using the Hilbert-Schmidt dependence criterion The features that are not related to the category are eliminated, and the improved Chameleon clustering is used in feature selection and optimized to improve, which removes redundant features well and effectively selects the optimal feature parameter set for constructing the body composition model , solve the problem of many and redundant human physiological characteristic parameters, and provide more effective detection means for human body composition research and clinical application.

附图说明Description of drawings

图1为Chameleon聚类算法原理图；Figure 1 is a schematic diagram of the Chameleon clustering algorithm;

图2为改进后的Chameleon算法原理图；Figure 2 is a schematic diagram of the improved Chameleon algorithm;

图3为人体特征参数选择过程；Fig. 3 is the selection process of human body feature parameters;

图4为1KHZ频段下使用过滤算法所得特征参数与BFM相关度；Figure 4 shows the correlation between the characteristic parameters and BFM obtained by using the filtering algorithm in the 1KHZ frequency band;

图5为250KHZ频段下使用过滤算法所得特征参数与BFM相关度；Figure 5 shows the correlation between the characteristic parameters and BFM obtained by using the filtering algorithm in the 250KHZ frequency band;

图6为500KHZ频段下使用过滤算法所得特征参数与BFM相关度Figure 6 shows the correlation between the characteristic parameters and BFM obtained by using the filtering algorithm in the 500KHZ frequency band

图7为使用过滤算法后对参数聚类个数分析；Figure 7 is an analysis of the number of parameter clusters after using the filtering algorithm;

图8为不同样本数量聚成四类时特征参数与BFM指标距离情况；Figure 8 shows the distance between characteristic parameters and BFM indicators when different sample sizes are clustered into four categories;

图9为BFM模型预测值与实际值对比情况；Figure 9 is the comparison between the predicted value of the BFM model and the actual value;

图10为BFM模型预测值相对误差对比情况。Figure 10 shows the comparison of the relative error of the predicted value of the BFM model.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚，下面结合附图和具体实施例对本发明进行详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

采用INBODY测量的体成分数据作为数据集，记为T＝(O,F,C)；将对人体体成分有重要影响的参数集如体重、身高、年龄、性别、人体各段阻抗值等作为第一特征参数，各段阻抗的倒数1/R_i、平方R_i ²、R_iR_j作为第二特征参数。INBODY测量频段有1KHZ、250KHZ、500KHZ三个频段，本文分别在上述三频段、不同样本量情况下研究体成分与特征参数的关系。其中，选择第一特征参数(R₁、R₂、R₃、R₄、R₅、A、H、W)和第二特征参数 1/R₁、1/R₂、1/R₃、1/R₄、1/R₅、R₁R₂、R₁R₃、R₁R₄、R₁R₅、R₂R₃、R₂R₄、R₂R₅、 R₃R₄、R₃R₅、R₄R₅作为原始特征参数集，记为F＝{f₁,f₂,…,f_m}；体成分类别集C中包括体脂肪量(BFM)、总水量(TBW)；体成分类别集C中包括体脂肪量(BFM)、总水量 (TBW)。下表列出部分样本数据集。The body composition data measured by INBODY is used as the data set, which is recorded as T=(O, F, C); the parameter sets that have an important impact on the body composition of the human body, such as weight, height, age, gender, and the impedance value of each segment of the human body, are used as The first characteristic parameter, the reciprocal 1/R _i , square R _i ² , and R _i R _j of the impedance of each segment are used as the second characteristic parameter. There are three frequency bands for INBODY measurement: 1KHZ, 250KHZ, and 500KHZ. This paper studies the relationship between body composition and characteristic parameters in the above three frequency bands and different sample sizes. Among them, select the first characteristic parameter (R ₁ , R ₂ , R ₃ , R ₄ , R ₅ , A, H, W) and the second characteristic parameter 1/R ₁ , 1/R ₂ , 1/R ₃ , 1 /R ₄ , 1/R ₅ , R ₁ R ₂ , R ₁ R ₃ , R ₁ R ₄ , R ₁ R ₅ , R ₂ R ₃ , R ₂ R ₄ , R ₂ R ₅ , R ₃ R ₄ , R ₃ R ₅ , R ₄ R ₅ are used as the original feature parameter set, recorded as F={f ₁ ,f ₂ ,…,f _m }; body composition category set C includes body fat mass (BFM), total water mass (TBW) ; Body composition category set C includes body fat mass (BFM) and total water mass (TBW). The following table lists some sample datasets.

然而，影响人体体成分的相关参数众多，参数之间存在高度非线性、冗余、不相关等特点。鉴于上述问题，有必要提出一种对数据进行降维处理的方法，以解决上述特征参数冗余、不相关的问题。聚类方法把体成分参数数据对象化分成多个组或簇，使得簇内对象有很高的相似性，根据每个簇与中心点距离，进行判断，有效地筛去冗余特征。同时，在对体成分高维数据分析之前通过减少特征数目的步骤消去与所需特征不相关的属性；However, there are many relevant parameters that affect human body composition, and there are characteristics such as high nonlinearity, redundancy, and irrelevance among the parameters. In view of the above problems, it is necessary to propose a method for dimensionality reduction of data to solve the problem of redundant and irrelevant characteristic parameters above. The clustering method divides the body composition parameter data object into multiple groups or clusters, so that the objects in the cluster have a high similarity, and judge according to the distance between each cluster and the center point, and effectively screen out redundant features. At the same time, before analyzing the high-dimensional data of volume components, the attributes that are not related to the required features are eliminated by reducing the number of features;

因此，首先应对数据进行过滤算法。给定原始特征集F＝{f₁,f₂,…,f_m},数据样本集 O＝{o₁,o₂,…,o_n}，人体体成分BFM、TBW，对前100人样本在1KHZ、250KHZ、500KHZ 三个频段下运行过滤算法，下图列出了对体成分BFM运行算法后所得过滤后的特征参数相关度。Therefore, the filtering algorithm should be applied to the data first. Given the original feature set F={f ₁ ,f ₂ ,…,f _m }, data sample set O={o ₁ ,o ₂ ,…,o _n }, human body composition BFM, TBW, for the first 100 human samples Run the filtering algorithm in the three frequency bands of 1KHZ, 250KHZ, and 500KHZ. The following figure lists the correlation of the filtered characteristic parameters after running the algorithm on the body composition BFM.

式中：空间上的内积。类似的，定义一个体成分类别映射ψ:将体成分指标C空间映射到再生核Hilbert空间记为中,相应的核函数为：In the formula: space inner product on . Similarly, define a body component category map ψ: Map the C space of the body composition index to the Hilbert space of the regenerating nucleus and write it as , the corresponding kernel function is:

核函数可以计算两个特征点在特征空间投影之间的内积，而不用显式计算具体的映射无需付出维数所隐含的计算代价。因此可定义特征与体成分类别的互协方差算子为：The kernel function can calculate the inner product between two feature points in the feature space projection without explicitly calculating the specific mapping There is no computational cost implied by the dimensionality. Therefore, the cross-covariance operator of feature and body component category can be defined as:

上式中的表示张量积，和表示期望^[16]，可将这个协方差的平方的范数称为HSIC：其表达式为^[14]：in the above formula represents the tensor product, and Indicates the expectation ^[16] , the norm of the square of the covariance can be Called HSIC: its expression is ^[14] :

利用filter算法对不同阻抗下体成分BFM运行后既可以得到相关度情况，如图4、图 5、图6所示，由上三个图可以看出，当阻抗频段逐渐增加时，阻抗的数值也在不断减小、各个特征参数所包含的BFM信息量逐步减少。根据置信区间80％作为筛选，选取特征参数，汇总对不同频段下运行filter算法后的特征如下表2所示：After using the filter algorithm to run the BFM of different impedances, the correlation can be obtained, as shown in Figure 4, Figure 5, and Figure 6. From the above three figures, it can be seen that when the impedance frequency band gradually increases, the value of the impedance also increases. The amount of BFM information contained in each feature parameter is gradually decreasing. According to the confidence interval of 80% as a filter, select the characteristic parameters, and summarize the characteristics after running the filter algorithm in different frequency bands, as shown in Table 2 below:

表2：不同频段下运行filter算法后的特征Table 2: Features after running the filter algorithm in different frequency bands

由表2可知，本文算法很大程度上减少了原始特征集的数目，250KHZ频段特征聚集较多。因此选取中间阻抗频段250KHZ进行过滤后的特征进行聚类分析，筛选出冗余信息。It can be seen from Table 2 that the algorithm in this paper greatly reduces the number of original feature sets, and the 250KHZ frequency band has more features. Therefore, the filtered features of the middle impedance frequency band 250KHZ are selected for cluster analysis to screen out redundant information.

在进行聚类之前，首先要判断聚成几类，将筛选后的特征参数分别求出不同聚类情况下所包含的信息个数，如图7所示，分析可知将特征参数与体成分分成4类能较好表示所选特征信息。在样本数量为20人、40人、60人、80人、100人时，如图8可知，聚类变化情况不大，1/R₄,1/R₅,聚成一类，A，H，W,R₅,R₄聚成一类，R₄R₅，R₁R₂,R₂ ²,R₁ ²，R₅ ²聚成一类，R₂R₃，R₁R₃聚成一类。Filter算法后得到的特征参数经过聚4类后可去除与聚类中心BFM较远的1/R₄, R₄,R₁ ²，R₁R₃。表3列出经过Filter和聚类算法后特征参数选择情况。Before clustering, it is first necessary to determine how many clusters are clustered, and to obtain the number of information contained in different clustering situations from the filtered feature parameters, as shown in Figure 7, the analysis shows that the feature parameters and body components are divided into The four categories can better represent the selected feature information. When the number of samples is 20, 40, 60, 80, and 100, as shown in Figure 8, the clustering changes little, 1/R ₄ , 1/R ₅ , clustered into one class, A, H, W, R ₅ , R ₄ are grouped into one group, R ₄ R ₅ , R ₁ R ₂ , R ₂ ² , R ₁ ² , R ₅ ² are grouped into one group, R ₂ R ₃ , R ₁ R ₃ are grouped into one group. The characteristic parameters obtained after the Filter algorithm can be clustered into 4 categories to remove 1/R ₄ , R ₄ , R ₁ ² , and R ₁ R ₃ that are far away from the cluster center BFM. Table 3 lists the selection of feature parameters after Filter and clustering algorithm.

表3：经过Filter和聚类算法后特征参数Table 3: Feature parameters after Filter and clustering algorithm

表4列出了使用三种特征选择方法得到对于体成分BFM预测的候选特征集及时间复杂度；Table 4 lists the use of three feature selection methods to obtain candidate feature sets and time complexity for body composition BFM prediction;

表4：最优特征集及复杂度比较Table 4: Optimal feature set and complexity comparison

从表4可知，在数据集维数相同的情况下，使用本发明算法所得到的候选特征集的数目及其时间复杂度均小于Filter和Wrapper、mRMR特征选择算法；As can be seen from Table 4, under the same situation of data set dimension, use the number of the candidate feature set that the algorithm of the present invention obtains and its time complexity are all less than Filter and Wrapper, mRMR feature selection algorithm;

为验证本特征选择算法性能的优劣，对于体成分(BFM)，分别使用mRMR、Filter和Wrapper组合式特征选择算法与本特征选择算法进行特征选择，为准确的衡量上述候选特征集在给定体成分BFM下的优劣程度，将样本集中的前80个作为训练样本集，记为 T₁＝{(x₁,y₁),(x₂,y₂),…,(x₈₀,y₈₀)}，后20个作为测试样本集In order to verify the performance of this feature selection algorithm, for body composition (BFM), use mRMR, Filter and Wrapper combined feature selection algorithm and this feature selection algorithm for feature selection, in order to accurately measure the above candidate feature sets in a given The pros and cons of body composition under BFM, the first 80 samples in the sample set are used as the training sample set, recorded as T ₁ ={(x ₁ ,y ₁ ),(x ₂ ,y ₂ ),…,(x ₈₀ ,y ₈₀ )}, the last 20 as the test sample set

T₂＝{(x₈₁,y₈₁),(x₈₂,y₈₂),…,(x₁₀₀,y₁₀₀)}，其中x_i∈R^l为输入的特征参数值，作为自变量， y_i∈R为实际的体成分值，作为因变量；使用SPSS软件中的多元线性回归对T₁进行训练。表5所示为利用上述特征集对BFM进行回归建模得到的模型汇总：T ₂ ＝{(x ₈₁ ,y ₈₁ ),(x ₈₂ ,y ₈₂ ),…,(x ₁₀₀ ,y ₁₀₀ )}, where x _i ∈ R ^l is the input characteristic parameter value, as an independent variable, y _i ∈R is the actual body composition value as the dependent variable _; T1 is trained using multiple linear regression in SPSS software. Table 5 shows the model summary obtained by regression modeling of BFM using the above feature sets:

表5：模型汇总(改一下)Table 5: Model summary (change)

a.预测变量:(常量),W,S,A,R₃,1/R₂,1/R₁,1/R₃,R₄ ²,R₄R₅,R₅ ² a. Predictor variables: (constant), W, S, A, R ₃ , 1/R ₂ , 1/R ₁ , 1/R ₃ , R ₄ ² , R ₄ R ₅ , R ₅ ²

b.预测变量:(常量),1/R₃,W,S,R₂ ²,R₄ ²,R₄R₅,R₅ ²,1/R₁,R₅ b. Predictor variables: (constant), 1/R ₃ ,W,S,R ₂ ² ,R ₄ ² ,R ₄ R ₅ ,R ₅ ² ,1/R ₁ ,R ₅

c.预测变量:(常量),A，H，W,R₅,R₁R₂,R₂R₃,R₄R₅,1/R₅,R₂ ²,R₅ ²，c. Predictor variables: (constant), A, H, W, R ₅ , R ₁ R ₂ , R ₂ R ₃ , R ₄ R ₅ , 1/R ₅ , R ₂ ² , R ₅ ² ,

根据表5可知，模型1、2、3中的生理特征集与BFM的相关性分别为0.927、0.906、0.978，因此，使用本文算法所获得的特征集与体成分的相关性最强；According to Table 5, the correlations between the physiological feature sets and BFM in models 1, 2, and 3 are 0.927, 0.906, and 0.978, respectively. Therefore, the feature sets obtained by using the algorithm in this paper have the strongest correlation with body composition;

根据得到的各模型回归系数，列出预测方程：According to the obtained regression coefficients of each model, the prediction equation is listed:

BFM₁＝0.041*W+0.126*S+0.523*A-0.212*R₃+0.171*1/R₁+0.126*1/R₂+0.179*1/R₃+0.132R² ₄+0.13R₄R₅+0.127R² ₅-8.56(1)BFM ₁ ＝0.041*W+0.126*S+0.523*A-0.212*R ₃ +0.171*1/R ₁ +0.126*1/R ₂ +0.179*1/R ₃ +0.132R ² ₄ +0.13R ₄ R ₅ +0.127R ² ₅ -8.56(1)

BFM₂＝0.313*W-0.044*S-0.125*1/R₃+0.108*1/R₁+0.016*R₄ ²-0.01R₂ ²+0.071R₅ ²+0.072R₄R₅-0.526R₅+5.674 (2)BFM ₂ ＝0.313*W-0.044*S-0.125*1/R ₃ +0.108*1/R ₁ +0.016*R ₄ ² -0.01R ₂ ² +0.071R ₅ ² +0.072R ₄ R ₅ -0.526R ₅ +5.674 (2)

BFM₃＝-0.464*A-0.15*H+0.122*W-0.143*R₅+0.129*R₁R₂+0.122*R₂R₃-0.134*R₄R₅+0.145*1/R₅+0.129*R₂ ²-0.141*R₅ ² (3)BFM ₃ ＝-0.464*A-0.15*H+0.122*W-0.143*R ₅ +0.129*R ₁ R ₂ +0.122*R ₂ R ₃ -0.134*R ₄ R ₅ +0.145*1/R ₅ +0.129 *R ₂ ² -0.141*R ₅ ² (3)

使用得到的预测模型对测试集T₂进行预测，并与实际值进行比较，得到BFM模型预测值与实际值对比图9以及误差分析图10。由图10可知，使用本文特征选择算法获取的特征构建的预测模型的精确度较高，其预测相对误差小于0.12。结果表明，基于filter和聚类相结合的人体生理特征选择算法获取的特征集与体成分显示了良好的相关性，可以提高体成分预测模型的拟合精度，减小预测误差。Use the obtained prediction model to predict the test set T ₂ and compare it with the actual value to obtain the comparison between the predicted value and the actual value of the BFM model in Figure 9 and the error analysis Figure 10. It can be seen from Figure 10 that the prediction model constructed using the features obtained by the feature selection algorithm in this paper has a high accuracy, and its prediction relative error is less than 0.12. The results show that the feature set obtained by the human physiological feature selection algorithm based on filter and clustering shows a good correlation with body composition, which can improve the fitting accuracy of the body composition prediction model and reduce the prediction error.

相较于现有技术，本发明提供一种基于Filter和聚类相结合的人体生理特征选择算法。使用Hilbert-Schmidt依赖性准则的特征过滤法剔除了与类别不相关的特征，将改进的 Chameleon聚类用于特征选择中并进行优化改进，很好的去除了冗余特征，有效的选择出用于构造体成分模型的最优特征参数集，解决人体生理特征参数多且冗余的问题；这样建立的人体体成分预测模型可提高人体体成分预测精度，为人体体成分研究和临床应用提供更为有效的检测手段。Compared with the prior art, the present invention provides a human physiological feature selection algorithm based on the combination of Filter and clustering. The feature filtering method using the Hilbert-Schmidt dependency criterion eliminates features that are not related to the category, and uses the improved Chameleon clustering for feature selection and optimizes the improvement, which removes redundant features well and effectively selects them. Based on constructing the optimal characteristic parameter set of the body composition model, it solves the problem of many and redundant human physiological characteristic parameters; the human body composition prediction model established in this way can improve the prediction accuracy of human body composition, and provide more information for human body composition research and clinical application. as an effective means of detection.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，根据本发明的技术方案及其发明构思加以等同替换或改变，都应涵盖在本发明的保护范围之内。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone familiar with the technical field within the technical scope disclosed in the present invention, according to the technical solution of the present invention Any equivalent replacement or change of the inventive concepts thereof shall fall within the protection scope of the present invention.

Claims

1. Based on the combination of filtering and improved clustering, the human physiological feature selection algorithm is characterized in that, comprising:

S1: Select the impedance model, collect the first characteristic parameter and the second characteristic parameter data to construct the initial feature set and the final optimal subset, and initialize the initial feature set and the final optimal subset as empty sets;

S2: Introduce a filtering algorithm to calculate the HSIC value under the body composition category for each feature in the collected data, which represents the correlation between physiological characteristics and body composition categories;

S3: Sort the feature set according to the value of HSIC from large to small;

S4: Add the top K features to the feature set, use the filtering algorithm to filter out parameters that are not related to the body composition, and construct the initial data set;

S5: According to the clustering algorithm, the data set is constructed into a feature sparse graph, RI is the edge set connected to each other between features, RC is the similarity between features, and the number k of expected clusters is initialized;

S6: Use the improved clustering algorithm to screen redundant features in the cluster;

S7: Select a feature with the largest HSIC value from each feature cluster to form an optimal feature set.

2. according to claim 1 based on the human body physiological feature selection algorithm that filtering formula combines with improved clustering, it is characterized in that, adopt the body composition data that human body composition analyzer measures as data set, be denoted as T=(O, F, C), where O is the data sample set, F is the selected feature set, and C is the body composition category; the parameter set that has an important influence on the body composition of the human body is used as the first feature parameter, and the reciprocal of the impedance of each segment is 1/R _i , square R _i ² , R _i R _j as the second characteristic parameter; wherein, the impedance value is selected from the 1KHZ impedance parameter in the human body composition analyzer, and the first characteristic parameter R1, R2, R3, R4, R5, A, H, W And the second feature parameter 1/R1, 1/R2, 1/R3, 1/R4, 1/R5, R1R2, R1R3, R1R4, R1R5, R2R3, R2R4, R2R5, R3R4, R3R5, R5R4 as the original feature parameter set, Recorded as F={f ₁ ,f ₂ ,···,f _m }, f ₁ ,f ₂ ,···,f _m are feature points, where A is age, H is height, W is weight, body composition Category C includes body fat mass and total water mass.

3. According to claim 1 or 2, based on the human body physiological feature selection algorithm combined with filtering and improved clustering, it is characterized in that, for each feature in the collected data, calculate the HSIC value under the body composition category C ,Specifically:

For each feature {f ₁ ,f ₂ ,···,f _m }∈F, define a nonlinear feature map This mapping maps feature points f ₁ , f ₂ ,···,f _m to the regenerating kernel Hilbert space , the body composition index C space is mapped to the regenerating kernel Hilbert space as , the kernel function is: For each feature {f ₁ , f ₂ ,···,f _m }∈F, the HSIC value under the body composition category C is calculated.

4. according to claim 3 based on the human body physiological feature selection algorithm that filtering formula combines with improved clustering, it is characterized in that, the mutual covariance operator of definition feature and body composition category is: in the formula represents the tensor product, and Indicates expectation; HSIC value characterizes the correlation between physiological characteristics and body composition category:

For a certain feature f and body composition category c, the larger the value of HSIC, the stronger the dependence of c on f.

5. according to claim 1 based on the human body physiological feature selection algorithm that filtering formula combines with improved clustering, it is characterized in that, use Chameleon agglomerative hierarchical clustering method, construct feature sparse graph according to the method of K-nearest neighbor graph, Each vertex in the graph represents a data object, and there is an edge between the two vertices. The weight of the edge can reflect the similarity of the object. The similarity of the feature sub-cluster is evaluated based on two points: 1) the Interconnection; 2) Proximity of clusters: According to the relative interconnection RI and relative proximity RC of two feature clusters, the similarity between their two features is determined.

6. According to claim 4, based on the human body physiological feature selection algorithm combined with filtering formula and improved clustering, it is characterized in that, the feature data set F={f ₁ , f given normalization and filtered using filtering algorithm ₂ ,···,f _m }, the data set cluster F is divided into subclusters f ₁ and f ₂ , and the weight of the edge that is cut off when F is divided into f ₁ and f ₂ is the smallest, and the feature subclusters f ₁ and f ₂ The greater the relative interconnectedness between; the relative interconnectedness RI(f ₁ , f ₂ ) of two feature clusters f ₁ and f ₂ is defined as the relative interconnectedness between feature clusters f ₁ and f ₂ , for two clusters The internal interconnectivity of f ₁ and f ₂ is normalized, namely:

in, is the edge cut of the cluster containing f ₁ and f ₂ , similarly, or is the minimum sum _of edge cuts that divide f1 or f2 into _two roughly equal parts.

7. according to claim 5 based on the human body physiological feature selection algorithm that filtering formula combines with improved clustering, it is characterized in that,

The relative approximation RC(f ₁ , f ₂ ) of two feature clusters f ₁ and f ₂ is defined as the relative approximation between f ₁ and f ₂ , about the internal approximation of two feature clusters f ₁ and f ₂ Normalization, that is:

<mrow><mi>R</mi><mi>C</mi><mrow><mo>(</mo><msub><mi>f</mi><mn>1</mn></msub><mo>,</mo><msub><mi>f</mi><mn>2</mn></msub><mo>)</mo></mrow><mo>=</mo><mfrac><msub><mover><mi>S</mi><mo>&OverBar;</mo></mover><mrow><msub><mi>EC</mi><mrow><mo>{</mo><msub><mi>f</mi><mn>1</mn></msub><mo>,</mo><msub><mi>f</mi><mn>2</mn></msub><mo>}</mo></mrow></msub></mrow></msub><mrow><mfrac><mrow><mo>|</mo><msub><mi>f</mi><mn>1</mn></msub><mo>|</mo></mrow><mrow><mo>|</mrow>mo><msub><mi>f</mi><mn>1</mn></msub><mo>|</mo><mo>+</mo><mo>|</mo><msub><mi>f</mi><mn>2</mn></msub><mo>|</mo></mrow></mfrac><msub><mover><mi>S</mi><mo>&OverBar;</mo></mover><mrow><msub><mi>EC</mi><msub><mi>f</mi><mn>1</mn></msub></msub></mrow></msub><mo>+</mo><mfrac><mrow><mo>|</mo><msub><mi>f</mi><mn>1</mn></msub><mo>|</mo></mrow><mrow><mo>|</mo><msub><mi>f</mi><mn>1</mn></msub><mo>|</mo><mo>+</mo><mo>|</mo><msub><mi>f</mi><mn>2</mn></msub><mo>|</mo></mrow></mfrac><msub><mover><mi>S</mi><mo>&OverBar;</mo></mover><mrow><msub><mi>EC</mi><msub><mi>f</mi><mn>2</mn></msub></msub></mrow></msub></mrow></mfrac></mrow>

in, is the average weight _of the edges connecting vertices f1 and vertices _f2 , or is the average weight of the edge of the smallest bipartite cluster f ₁ or f ₂ ;

The similarity between _two subclusters is determined by the relative interconnectivity and relative proximity _of the feature subclusters f1 and f2.

8. according to claim 1, based on the human body physiological feature selection algorithm combined with filtering and improved clustering, it is characterized in that the improved clustering algorithm is that all feature sub-clusters are traversed and attempted to merge and replace, by merging After subclustering, evaluate the quality of feature selection, and try to merge existing subclusters; the specific steps are:

S61: Calculate the distance between clusters and sort them, and judge whether the number of sample sub-clusters h is equal to the number of initialized expected clusters k;

S62: If not equal, select the two sub-clusters with the largest similarity function value for merging, and if they are equal, end;

S63: Recalculate the relative approximation RC of the new subcluster, traverse all subclusters, and check whether all subclusters are tried to merge in pairs;

S64: If all sub-clusters try to merge, return to S61; otherwise, merge the two sub-clusters with the smallest similarity function and return to S63;

S65: Select the feature with the largest HSIC value for combination.