CN108154189A

CN108154189A - Grey relational cluster method based on LDTW distances

Info

Publication number: CN108154189A
Application number: CN201810022935.8A
Authority: CN
Inventors: 代劲; 何雨虹; 宋娟; 吴朝文
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2018-06-12

Abstract

The present invention relates to the field of mining, specifically a gray relational clustering method based on LDTW distance, comprising: processing the original data set to obtain a preprocessed sequence; constructing the maximum value of each dimension in the preprocessed sequence to form a reference sequence ; Calculate the LDTW distance between the preprocessed sequence and the reference sequence and its curved path length; calculate the gray correlation degree between the preprocessed sequence and the reference sequence based on the LDTW distance; determine the critical value interval according to the result of the gray correlation degree, Divide the critical value interval into a plurality of critical intervals, if the gray correlation degree of two sequences falls in the same critical interval, then cluster the two sequences into one class, the present invention reduces the similarity measure between the two sequences. Errors can help biologists study the function of proteins.

Description

Gray Relational Clustering Method Based on LDTW Distance

技术领域technical field

本发明属于数据挖掘领域，特别涉及一种基于有限弯曲长度下的动态时间弯曲距离(DTW under limited warping path length，LDTW)的生物特征数据灰关联聚类方法。The invention belongs to the field of data mining, in particular to a gray relational clustering method for biometric data based on dynamic time warping distance (DTW under limited warping path length, LDTW) under limited warping length.

背景技术Background technique

随着生物学数据库规模的大量增长，人们越来越多的利用计算机程序自动进行分类处理。生物信息学给人类带来了巨大的希望，同时也为数据分析家带来机遇和挑战。上百亿的数据涌入公共数据库，依靠实验的方法分析这些数据既费时且昂贵。于是，找到有效快速的计算方法自动分析这些数据十分必要。庞大的生物信息数据库对数据挖掘技术提出了许多颇具挑战性的问题，也提供了广阔的机遇。聚类分析技术是数据挖掘中重要而且常用的技术，应用聚类技术分析生物数据，可以帮助我们研究基因、蛋白质的性质功能，为探索生物的奥秘提供帮助。With the massive increase in the scale of biological databases, more and more people use computer programs to automatically classify. Bioinformatics brings great hope to human beings, but also brings opportunities and challenges to data analysts. Tens of billions of data are pouring into public databases, and relying on experimental methods to analyze these data is time-consuming and expensive. Therefore, it is necessary to find effective and fast calculation methods to automatically analyze these data. The huge biological information database poses many challenging problems to data mining technology, and also provides broad opportunities. Clustering analysis technology is an important and commonly used technology in data mining. Applying clustering technology to analyze biological data can help us study the properties and functions of genes and proteins, and provide help for exploring the mysteries of biology.

聚类分析已经成为数据挖掘研究领域的一个非常活跃的研究课题。它已被成功地应用于生命科学中各领域的研究，如有效地将不同的基因序列集进行有效的划分、功能基因识别、推断出物种的系统发育树、对蛋白质物理化学性质进行聚类可以预测其功能等，成为后基因组时代功能基因研究的重要工具由于其应用的广泛性，出现了大量可用的聚类分析软件更加方便了其推广和应用。Cluster analysis has become a very active research topic in the field of data mining research. It has been successfully applied to research in various fields of life sciences, such as effectively dividing different gene sequence sets, identifying functional genes, inferring species phylogenetic trees, and clustering protein physical and chemical properties. Predicting its function, etc., has become an important tool for functional gene research in the post-genome era. Due to its wide application, a large number of available cluster analysis software has appeared to facilitate its promotion and application.

邓聚龙教授在1982年提出灰色系统理论(参考文献：刘思峰，杨英杰，吴利丰.灰色系统理论及其应用[M].科学出版社，2014.)已成为一门新兴的结构体系，包括灰度、区间灰数、灰色方程、灰色矩阵等基础体系。其中，灰关联聚类的核心在于序列间灰关联度的计算，因此能否选择恰当的灰关联分析方法对于聚类效果来说至关重要。灰色关联分析作为灰色系统理论的一个十分活跃的分支，是将系统因素的离散行为观测值转化为分段连续的折线，进而根据折线的几何特征构造测度关联程度的模型。灰关联度是确定因素间关联关系的尺度指标，其中邓聚龙提出的邓氏灰关联度的应用较为广泛(参考文献：邓聚龙.灰色系统理论教程[M].华中理工大学出版社,1990.)。生物工程中，每个生物的采集数据存在很多不确定性因素，它可以看成一个灰色系统。传统的灰关联度模型不能有效地处理数据缺失，弯曲和漂移等不确定现象，通常采用删除较长序列数据、均值、GM(1，1)模型预测等方法进行补齐，导致不确定性信息的进一步放大，造成不必要的信息损失。因此，迫切需要一种针对不确定序列的动态关联分析模型。而LDWT距离通过添加的序列间的连接总数进行采用动态时间规整(Dynamic Time Warping，DTW)距离计算。该方法无需补齐序列数据，并且在对相似程度的判断中依然以数据序列间距离矩阵的最短路径最为依据。所以基于LDTW的灰关联度模型应用于生物聚类中，可以有效地对生物序列进行聚类，从站挖掘它们潜在信息，同时获得新的生物学知识。Professor Deng Julong proposed gray system theory in 1982 (references: Liu Sifeng, Yang Yingjie, Wu Lifeng. Gray system theory and its application [M]. Science Press, 2014.) has become an emerging structural system, including grayscale, Basic systems such as interval gray numbers, gray equations, and gray matrices. Among them, the core of gray relational clustering lies in the calculation of the gray relational degree between sequences, so whether to choose the appropriate gray relational analysis method is very important for the clustering effect. As a very active branch of gray system theory, gray relational analysis transforms the discrete behavior observations of system factors into piecewise continuous polylines, and then constructs a model to measure the degree of correlation according to the geometric characteristics of polylines. Gray relational degree is a scale index to determine the relational relationship between factors, among which Deng's gray relational degree proposed by Deng Julong is widely used (Reference: Deng Julong. Gray System Theory Tutorial [M]. Huazhong University of Science and Technology Press, 1990.). In bioengineering, there are many uncertain factors in the collected data of each organism, which can be regarded as a gray system. The traditional gray relational degree model cannot effectively deal with uncertain phenomena such as missing data, bending and drifting, and usually uses methods such as deleting long sequence data, mean values, and GM (1,1) model predictions to complete them, resulting in uncertain information Further amplification, resulting in unnecessary loss of information. Therefore, a dynamic correlation analysis model for uncertain sequences is urgently needed. The LDWT distance is calculated by using the dynamic time warping (Dynamic Time Warping, DTW) distance through the total number of connections between the added sequences. This method does not need to complete the sequence data, and still uses the shortest path of the distance matrix between data sequences as the most basis in judging the degree of similarity. Therefore, the gray relational degree model based on LDTW is applied to biological clustering, which can effectively cluster biological sequences, mine their potential information from the station, and obtain new biological knowledge at the same time.

目前，传统的邓氏灰关联度中的误差来源主要是因为它采用的是绝对值之差来代表两个序列之间的距离，现有学者对两个序列之间距离的计算方法进行了改进，提出了采用动态时间规整DTW来代替绝对值之差，但DTW距离长期以来一直存在病理性的对齐问题，而大多数现有的解决方案，都基本上对弯曲的路径施加了严格的约束，但很可能会错过正确对齐。参见文献：Dai J,Hu F,Liu X.Research on grey incidence measurementmethod based on dynamic time warping distance[J].Journal of Grey System,2015,27(1):117-126。由此可知，现有的方法仍然存在缺陷，导致我们在挖掘其中的潜在信息时出现偏差。At present, the source of error in the traditional Deng's gray relational degree is mainly because it uses the difference of absolute value to represent the distance between two sequences. Existing scholars have improved the calculation method of the distance between two sequences , proposed to use dynamic time warping DTW to replace the difference in absolute value, but the DTW distance has long had pathological alignment problems, and most existing solutions basically impose strict constraints on curved paths, But it is likely to miss the correct alignment. See literature: Dai J, Hu F, Liu X. Research on gray incidence measurement method based on dynamic time warping distance [J]. Journal of Gray System, 2015, 27(1): 117-126. It can be seen that there are still defects in the existing methods, which lead to deviations when we mine the potential information.

发明内容Contents of the invention

本发明针对现有技术的问题，提供基于LDTW距离的灰关联聚类方法。本发明应用于数据挖掘中，对数据进行聚类，对聚类对象属性之间的差异性进行量的分析，可以将具有相似功能的蛋白质聚为一类，从而为生物学家研究蛋白质的功能提供了帮助。Aiming at the problems of the prior art, the present invention provides a gray relational clustering method based on LDTW distance. The present invention is applied to data mining, clusters data, and quantitatively analyzes the differences between the attributes of clustered objects, and can group proteins with similar functions into one group, thereby providing biologists with the ability to study protein functions Offered help.

本发明的基于LDTW距离的灰关联聚类方法，如图1所示，包括：The gray relational clustering method based on LDTW distance of the present invention, as shown in Figure 1, comprises:

S1、将原始数据集进行处理，得到预处理后的序列；S1. Process the original data set to obtain a preprocessed sequence;

S2、将预处理后的序列中每个维度的最大值构成参考序列；S2. The maximum value of each dimension in the preprocessed sequence constitutes a reference sequence;

S3、计算预处理后的序列与参考序列的LDTW距离及其弯曲路径长度λ；S3. Calculate the LDTW distance between the preprocessed sequence and the reference sequence and its curved path length λ;

S4、计算基于LDTW距离的预处理后的序列与参考序列间的灰关联度；S4. Calculating the gray relational degree between the preprocessed sequence based on the LDTW distance and the reference sequence;

S5、根据灰关联度的结果取定临界值ε，按照临界值ε将灰关联度区间划分为多个临界区间，若两个序列的灰关联度落在同一临界区间，则将所述两个序列聚为一类。S5. Determine the critical value ε according to the results of the gray relational degree, and divide the gray relational degree interval into a plurality of critical intervals according to the critical value ε. If the gray relational degrees of the two sequences fall in the same critical interval, the two Sequences are clustered into one class.

优选地，步骤S1将原始数据集进行处理，得到预处理后的序列，使之化为数量大体相近的无量纲数据。Preferably, in step S1, the original data set is processed to obtain a preprocessed sequence, which is transformed into dimensionless data of approximately similar quantity.

所述将原始数据集进行处理，得到预处理后的序列包括：利用原始序列，将所述原始序列进行预处理，使所述原始序列化为无量纲数据，所述无量纲数据为预处理后的序列；预处理后的序列具体包括：The processing of the original data set to obtain the preprocessed sequence includes: using the original sequence to preprocess the original sequence so that the original sequence is converted into dimensionless data, and the dimensionless data is preprocessed sequence; the preprocessed sequence specifically includes:

原始序列：X′_i＝(x′_i(1),x′_i(2),...,x′_i(n))；Original sequence: X′ _i =(x′ _i (1),x′ _i (2),...,x′ _i (n));

预处理后的序列：X_i＝(x_i(1),x_i(2),...,x_i(n))；Preprocessed sequence: Xi ₌ ( _xi (1), _i (2),..., _i (n));

其中，X′_i表示第i个原始序列，x′_i(j)表示第i个原始序列的第j维度的数据值；X_i表示第i个预处理后的序列，x_i(j)表示第i个预处理后的序列的第j维度的值；Among them, X' _i represents the i-th original sequence, x' _i (j) represents the data value of the j-th dimension of the i-th original sequence; _Xi represents the i-th preprocessed sequence, and x _i (j) represents The value of the j-th dimension of the i-th preprocessed sequence;

其中，x′_max(j)表示原始序列中第j维度中所有序列的最大值，x′_min(j)表示原始序列中第j维度中所有序列的最小值，i∈{1,2,...,I}；j∈{1,2,...,n}；I为序列个数，n为维度个数。in, x′ _max (j) represents the maximum value of all sequences in the jth dimension of the original sequence, x′ _min (j) represents the minimum value of all sequences in the jth dimension of the original sequence, i∈{1,2,... ,I}; j∈{1,2,...,n}; I is the number of sequences, n is the number of dimensions.

所述将预处理后的序列中每个维度的最大值构成参考序列包括：The maximum value of each dimension in the preprocessed sequence to form a reference sequence includes:

其中，表示参考序列的第j维度的值；x_max(j)表示预处理后的序列中第j维度的最大值；j∈{1,2,...,n}，n为维度个数。in, Indicates the value of the jth dimension of the reference sequence; x _max (j) indicates the maximum value of the jth dimension in the preprocessed sequence; j∈{1,2,...,n}, n is the number of dimensions.

优选地，计算预处理后的序列与参考序列的LDTW距离及路径长度λ包括：Preferably, calculating the LDTW distance and path length λ between the preprocessed sequence and the reference sequence includes:

预处理后的序列为X_i＝(x_i(1),x_i(2),...,x_i(n))；参考序列为X₀＝{x₀(1),x₀(2),...,x₀(n)}；The preprocessed sequence is X _i =( _xi (1), _xi (2),..., _xi (n)); the reference sequence is X ₀ ={x ₀ (1),x ₀ (2 ),...,x ₀ (n)};

进一步的，当需要计算LDTW距离以及路径长度和灰关联度时，预处理后的序列X_i＝(x_i(1),x_i(2),...,x_i(n))都需要将缺失数据的维度去掉，得到新的预处理后的序列X_i＝(x_i(1),x_i(2),...,x_i(m))；Further, when it is necessary to calculate the LDTW distance, path length and gray correlation degree, the preprocessed sequence X _i =( _xi (1), _xi (2),..., _xi (n)) needs Remove the dimensions of the missing data to obtain a new preprocessed sequence X _i =(xi ₍ 1), _i (2),..., _i (m));

优选的，将预处理后的序列的缺失数据的维度去掉，得到新的预处理后的序列：X_i＝(x_i(1),x_i(2),...,x_i(m))；计算新的预处理后的序列与参考序列的LDTW距离；则预处理后的序列X_i与参考序列X₀间的距离矩阵为：Preferably, the dimension of the missing data of the preprocessed sequence is removed to obtain a new preprocessed sequence: X _i =( _xi (1), _xi (2),..., _xi (m) ); Calculate the LDTW distance between the new preprocessed sequence and the reference sequence; then the distance matrix between the preprocessed sequence _Xi and the reference sequence X ₀ is:

其中，dis(x_i(m),x₀(n))＝|x_i(m)-x₀(n)|，dis(x_i(m),x₀(n))是第n维度的第i个序列中的分量值x_i(m)与参考序列的第n维度分量x₀(n)之间的距离；Among them, dis( _xi (m), x ₀ (n))=| _xi (m)-x ₀ (n)|, dis( _xi (m), x ₀ (n)) is the nth dimension The distance between the component value x _i (m) in the i-th sequence and the n-th dimension component x ₀ (n) of the reference sequence;

新的预处理后的序列X_i与参考序列X₀的LDTW距离包括：The LDTW distance between the new preprocessed sequence X _i and the reference sequence X ₀ includes:

D(X_i(m),X₀(n),s)表示X₀和X_i的LDTW距离，min表示取最小值，D(x_i(m),x₀(n-1),s-1)表示x₀(n-1)到x_i(m)的距离；D(x_i(m-1),x₀(n),s-1)表示x₀(n)到x_i(m-1)的距离；D(x_i(m-1),x₀(n-1),s-1)表示x₀(n-1)到x_i(m-1)的距离；i∈{1,2,...,I}，s表示步长，I为序列个数；m、n均为维度个数，m≤n。D(X _i (m), X ₀ (n), s) represents the LDTW distance between X ₀ and X _i , min represents the minimum value, D( _xi (m), x ₀ (n-1), s- 1) indicates the distance from x ₀ (n-1) to x _i (m); D( _xi (m-1), x ₀ (n), s-1) indicates the distance from x ₀ (n) to x _i (m) -1) distance; D(x _i (m-1), x ₀ (n-1), s-1) represents the distance from x ₀ (n-1) to x _i (m-1); i∈{ 1,2,...,I}, s represents the step size, I is the number of sequences; m and n are the number of dimensions, m≤n.

进一步的，所述路径长度λ包括：计算所述预处理后的序列与参考序列的LDTW距离时，LDTW距离对弯曲路径长度施加约束；考虑弯曲路径的当前长度作为附加因素；其中走过的矩阵的单元格为路径长度。Further, the path length λ includes: when calculating the LDTW distance between the preprocessed sequence and the reference sequence, the LDTW distance imposes constraints on the length of the curved path; considering the current length of the curved path as an additional factor; the matrix traveled The cells of are path lengths.

进一步的，LDTW是通过限制弯曲路径的长度。因此，将弯曲路径长度作为一个附加因素，为了更直观地描述弯曲路径长度，用步数表示，步数＝路径长度–1，用“s”表示步数。Further, LDTW works by limiting the length of the curved path. Therefore, taking the length of the curved path as an additional factor, in order to describe the length of the curved path more intuitively, it is represented by the number of steps, the number of steps = path length – 1, and the number of steps is represented by "s".

优选地，步骤S4计算基于LDTW距离的预处理后的序列与参考序列间的灰关联度包括：Preferably, step S4 calculates the gray correlation degree between the preprocessed sequence based on the LDTW distance and the reference sequence including:

给定序列X₀＝{x₀(1),x₀(2),...,x₀(m)}与X_i＝(x_i(1),x_i(2),...,x_i(n))，优选的，将预处理后的序列X_i的缺失数据的维度去掉，得到新的预处理后的序列：X_i＝(x_i(1),x_i(2),...,x_i(m))；例如，预处理后的序列X_i＝(x_i(1),x_i(2),...,x_i(n))中的x_i(2)为无数据，则将x_i(3)的数据变为新的x_i(2)；依次类推；得到新的预处理后的序列X_i＝(x_i(1),x_i(2),...,x_i(m))；m≤n；计算新的预处理后的序列与参考序列的灰关联度；具体包括：A given sequence X ₀ ={x ₀ (1),x ₀ (2),...,x ₀ (m)} and X _i =(x _i (1),x _i (2),..., x _i (n)), preferably, the dimension of the missing data of the preprocessed sequence _Xi is removed to obtain a new preprocessed sequence: Xi ₌ ( _xi (1), x _i (2), ..., _xi (m)); for example, the preprocessed sequence _Xi = ( _xi (1),xi ₍ 2),..., _xi (n)) in the x _i (2 ₎ is no _data , then _change the data of x _i (3) into new x _i (2); ,..., _xi (m)); m≤n; Calculate the gray correlation degree between the new preprocessed sequence and the reference sequence; specifically include:

则两序列间的灰关联度定义如下：Then the gray correlation degree between two sequences is defined as follows:

其中，表示序列X₀的元素的值和序列X_i中的元素的值的两级最小差，表示序列X₀的元素的值和序列X_i中的元素的值的两级最大差，ξ为分辨系数，且ξ∈[0,1]，LDTW(X₀,X_i)表示X₀和X_i的在有限长度下的动态时间弯曲距离，λ是LDTW(X₀,X_i)所对应的路径长度，i∈{1,2,...,I}，I为序列个数；t₀表示第t₀个维度，t_i表示第t_i个维度；1<t₀≤n，1<t_i≤m，m、n均为维度个数，m≤n。in, Denotes the two-level minimum difference between the value of an element of sequence X ₀ and the value of an element in sequence X _i , Indicates the two-level maximum difference between the value of the element of the sequence X ₀ and the value of the element in the sequence _Xi , ξ is the resolution coefficient, and ξ∈[0,1], LDTW(X ₀ ,X _i ) represents X ₀ and X The dynamic time warping distance of _i under finite length, λ is the path length corresponding to LDTW(X ₀ ,X _i ), i∈{1,2,...,I}, I is the number of sequences; t ₀ Indicates the t _0th dimension, t _i indicates the t _i th dimension; 1<t ₀ ≤n, 1<t _i ≤m, m and n are the number of dimensions, m≤n.

为验证基于LDTW灰关联度的正确性，可从是否满足灰关联公理进行证明；其中，灰关联公理包括：In order to verify the correctness of the gray relational degree based on LDTW, it can be proved from whether the gray relational axiom is satisfied; among them, the gray relational axiom includes:

(1)规范性：(1) Normative:

记Δ(X₀,X_i)＝LDTW(X₀,X_i)/λ，因为故：Write Δ(X ₀ ,X _i )=LDTW(X ₀ ,X _i )/λ, because Therefore:

0<γ(X₀,X_i)≤1；0<γ(X ₀ ,X _i )≤1;

(2)整体性：(2) Integrity:

由于动态时间弯曲灰关联度仅是序列X₀与X_i之间关联程度的度量，未考虑其他因素，故这里没有整体性的问题；Since the gray relational degree of dynamic time warping is only a measure of the degree of correlation between the sequence X ₀ and _Xi without considering other factors, there is no overall problem here;

(3)偶对对称性：(3) Pair symmetry:

由LDTW距离性质可知，γ(X₀,X_i)＝γ(X_i,X₀)；According to the LDTW distance property, γ(X ₀ ,X _i )=γ(X _i ,X ₀ );

(4)接近性：(4) Proximity:

LDTW(X₀,X_i)反映了序列X₀与X_i之间的距离，距离越小则γ(X₀,X_i)越大，从而序列X₀与X_i越接近。LDTW(X ₀ ,X _i ) reflects the distance between sequence X ₀ and X _i , the smaller the distance, the larger γ(X ₀ ,X _i ), and thus the closer sequence X ₀ is to X _i .

经证明，本发明采用的基于LDTW距离的灰关联聚类方法同时满足灰色关联定理所要求的四定理。It has been proved that the gray relational clustering method based on the LDTW distance adopted in the present invention simultaneously satisfies the four theorems required by the gray relational theorem.

优选地，所述根据灰关联度的结果，取定合适的临界值ε，将临界值ε划分为多个临界区间，即[0,ε]，[ε,2ε]，...，[1-ε,1]，若两个序列的灰关联度落在同一临界区间，计算两个序列的灰关联度值，若两个序列的灰关联度落在同一临界区间，则将所述两个序列聚为一类；具体包括：预处理后的序列X_i中假设任意两个序列表示为：Preferably, according to the results of the gray relational degree, an appropriate critical value ε is determined, and the critical value ε is divided into a plurality of critical intervals, namely [0, ε], [ε, 2ε], ..., [1 -ε,1], if the gray relational degrees of the two sequences fall in the same critical interval, calculate the gray relational values of the two sequences, and if the gray relational degrees of the two sequences fall in the same critical interval, the two Sequences are clustered into one category; specifically include: in the preprocessed sequence _Xi, assume that any two sequences are expressed as:

X_p＝{x_p(1),x_p(2),...,x_p(n)}； _Xp = { _xp (1), _xp (2),..., _xp (n)};

X_q＝{x_q(1),x_q(2),...,x_q(n)}X _q ＝{x _q (1),x _q (2),...,x _q (n)}

其中，X_p表示第p个序列，X_q表示第q个序列；x_p(n)表示第p个序列的第n个数据；x_q(n)表示第q个序列的第n个数据；p∈{1,2,...,I}，q∈{1,2,...,I}；利用临界值ε将[0,1]划分为多个临界区间，即[0,ε]，[ε,2ε]，...，[1-ε,1]；当两个序列X_p和X_q的灰关联度的值在同一区间时，则将所述两个序列X_p和X_q聚为一类。Among them, X _p represents the p-th sequence, X _q represents the q-th sequence; x _p (n) represents the n-th data of the p-th sequence; x _q (n) represents the n-th data of the q-th sequence; p∈{1,2,...,I}, q∈{1,2,...,I}; use the critical value ε to divide [0,1] into multiple critical intervals, namely [0,ε ], [ε, 2ε], ..., [1-ε, 1]; when the values of the gray relational degrees of the two sequences X _p and X _q are in the same interval, then the two sequences X _p and X _q clustered into one class.

本发明为解决了由于传统的邓氏灰关联度，对于数据缺失时需要删除较长序列，需要利用GM(1，1)模型预测等方法对缺失的数据进行补齐，从而造成原始数据被破坏，导致最后灰关联度的测量有误。提出新的灰关联度计算方法，降低了两个序列之间的相似度量的误差，为生物学家研究蛋白质的功能提供帮助。The present invention solves the problem that due to the traditional Deng’s gray relational degree, long sequences need to be deleted when data is missing, and the missing data needs to be supplemented by methods such as GM (1, 1) model prediction, thereby causing the original data to be destroyed , resulting in an error in the measurement of the final gray relational degree. A new calculation method of gray relational degree is proposed, which reduces the error of the similarity measure between two sequences and provides help for biologists to study the function of proteins.

附图说明Description of drawings

图1是本发明基于LDTW距离的灰关联聚类方法优选实施例流程示意图；Fig. 1 is the schematic flow chart of the preferred embodiment of the gray relational clustering method based on LDTW distance of the present invention;

图2是一个关于DTW产生奇异点的例子；Figure 2 is an example of singular points generated by DTW;

图3是本发明与传统灰关联度的纯度的比较(无缺失数据)；Fig. 3 is the comparison (no missing data) of the present invention and the purity of traditional gray correlation degree;

图4是本发明与传统灰关联度的熵的比较(无缺失数据)；Fig. 4 is the comparison (no missing data) of the present invention and the entropy of traditional gray correlation degree;

图5是本发明与现有灰关联度DTW灰关联度的纯度的比较(无缺失数据)；Fig. 5 is the comparison (no missing data) of the purity of the present invention and existing gray relational degree DTW gray relational degree;

图6是本发明与现有灰关联度DTW灰关联度的熵的比较(无缺失数据)；Fig. 6 is the comparison (no missing data) of the present invention and the entropy of existing gray relational degree DTW gray relational degree;

图7是是本发明与传统灰关联度的纯度的比较(缺失数据)；Fig. 7 is the comparison (missing data) of the purity of the present invention and traditional gray correlation degree;

图8是本发明与传统灰关联度的熵的比较(缺失数据)；Fig. 8 is the comparison (missing data) of the present invention and the entropy of traditional gray correlation degree;

图9是本发明与现有灰关联度DTW灰关联度的纯度的比较(缺失数据)；Fig. 9 is the comparison (missing data) of the purity of the present invention and existing gray relational degree DTW gray relational degree;

图10是本发明与现有灰关联度DTW灰关联度的熵的比较(缺失数据)；Fig. 10 is the comparison (missing data) of the present invention and the entropy of existing gray relational degree DTW gray relational degree;

具体实施方式Detailed ways

下面结合具体实施例以及具体实验数据集对本发明基于LDTW距离的灰关联聚类方法作进一步阐述。本发明的一种基于LDTW距离的灰关联聚类方法，如图1所示，包括以下步骤：The gray relational clustering method based on LDTW distance of the present invention will be further elaborated below in combination with specific embodiments and specific experimental data sets. A kind of gray correlation clustering method based on LDTW distance of the present invention, as shown in Figure 1, comprises the following steps:

作为一种可实现方式，所述获取数据集的预处理后的数据，可采用以下方式实现：As an implementable manner, the acquisition of the preprocessed data of the data set may be implemented in the following manner:

设各序列的初值象为：Let the initial value image of each sequence be:

X_p'＝{x_p(1)/x_p(1),x_p(2)/x_p(1),…,x_p(n)/x_p(1)}(0≤p≤I,x_p(1)≠0)X _p '＝{x _p (1)/x _p (1),x _p (2)/x _p (1),...,x _p (n)/x _p (1)}(0≤p≤I, x _p (1)≠0)

其中，I为序列的个数，n为维度的个数。整个X_p'序列表示第p个序列在1,2,...,n维度上的初值象的值，q∈{1,2,..,n}，x_p(q)表示第p个序列的第q个初值象。Among them, I is the number of sequences, and n is the number of dimensions. The entire X _p 'sequence represents the value of the initial image of the p-th sequence in the 1,2,...,n dimension, q∈{1,2,...,n}, x _p (q) represents the p-th sequence The qth initial value image of the sequence.

始点零化象：一般在获取初值像后，依次计算比较序列和参考序列相对应的值的差，但是前提条件是两序列的单位一致。Initial point zeroing image: Generally, after obtaining the initial value image, the difference between the values corresponding to the comparison sequence and the reference sequence is calculated sequentially, but the prerequisite is that the units of the two sequences are consistent.

X_p'⁰＝{|x_p'⁰(1)-x₀'⁰(1)|,|x_p'⁰(2)-x₀'⁰(2)|,…,|x_p'⁰(n)-x₀'⁰(n)|}(1≤p≤I)X _p ' ⁰ ＝{|x _p ' ⁰ (1)-x ₀ ' ⁰ (1)|,|x _p ' ⁰ (2)-x ₀ ' ⁰ (2)|,…,|x _p ' ⁰ ( n)-x ₀ ' ⁰ (n)|}(1≤p≤I)

其中，I为序列的个数，n为维度的个数；整个序列表示第p个序列在1,2,...,n维度上的始点零化像的值，q∈{1,2,..,n}，x_p'⁰(q)表示第p个序列的第q个始点零化象；其中，序列的初值象和始点零化像都是在对数据集进行预处理。Among them, I is the number of sequences, and n is the number of dimensions; the entire sequence represents the value of the initial point zeroization image of the p-th sequence on dimensions 1, 2,..., n, q∈{1,2, ..,n}, x _p ' ⁰ (q) represents the qth initial zeroing image of the pth sequence; where, the initial value image and the initial zeroing image of the sequence are both preprocessing the data set.

实施例1Example 1

如表1所示，是一个计算基于LDTW距离的灰关联度的一个具体算例，是关于我国2001年-2005年国内生产总值X₀以及第一产业X₁、第二产业X₂、第三产业产值X₃的数据；As shown in _Table ₁ , it is _a specific example of calculating the gray relational degree based on LDTW distance. The output value of the tertiary industry X ₃ data;

生产总值X₀以及第一产业X₁、第二产业X₂、第三产业产值X₃的序列的初值像为分别对应为：The initial values of the sequence of gross production value X ₀ , primary industry X ₁ , secondary industry X ₂ , and tertiary industry output value X ₃ correspond to:

X′₀＝(1,1.0966,1.2379,1.4576,1.6691)X' ₀ = (1,1.0966,1.2379,1.4576,1.6691)

X′₁＝(1,0.0452,1.1032,1.3548,1.4903)X' ₁ = (1,0.0452,1.1032,1.3548,1.4903)

X′₂＝(1,1.0444,1.2606,1.4929,1.7576)； _X'2 = (1,1.0444,1.2606,1.4929,1.7576);

X′₃＝(1,1.1256,1.2915,1.4574,1.6368)；X' ₃ = (1,1.1256,1.2915,1.4574,1.6368);

生产总值X₀以及第一产业X₁、第二产业X₂、第三产业产值X₃的序列的始点零化象分别对应为：The initial point zeroing images of the sequence of gross production value X ₀ and primary industry X ₁ , secondary industry X ₂ , and tertiary industry output value X ₃ correspond to:

表1中国2001年-2005年国内生产总值及各产业产值(单位：千亿元)Table 1 China's 2001-2005 GDP and output value of various industries (unit: 100 billion yuan)

序列sequence 20012001 20022002 20032003 20042004 20052005 X₀ X ₀ 109.7109.7 120.3120.3 135.8135.8 159.9159.9 183.1183.1 X₁ _x1 15.515.5 16.216.2 17.117.1 21twenty one 23.123.1 X₂ _x2 49.549.5 53.953.9 62.462.4 73.973.9 8787 X₃ _x3 44.644.6 50.250.2 56.356.3 6565 7373

作为一种可实现方式，利用序列间始点零化象的距离矩阵，求出序列间的LDTW距离矩阵，采用以下方式实现：As an achievable way, the LDTW distance matrix between the sequences is obtained by using the distance matrix of the zeroing image of the starting point between the sequences, and the following method is used to realize:

表2相同长度下的序列间距离矩阵Table 2 Inter-sequence distance matrix under the same length

设DM(X₀,X_i)为序列X₀和X_i的距离矩阵，其中LDTW(X₀,X_i)为序列X₀和X_i的距离，其中λ表示弯曲路径的长度。所述路径长度包括：将在计算所述预处理后的序列与参考序列的LDTW距离时，LDTW距离对弯曲路径长度施加了约束；考虑弯曲路径的当前长度作为附加因素；其中走过的矩阵的单元格为路径长度，步长＝路径长度–1，用“s”表示步长。Let DM(X ₀ ,X _i ) be the distance matrix of sequences X ₀ and X _i , where LDTW(X ₀ ,X _i ) is the distance between sequences X ₀ and X _i , where λ denotes the length of the curved path. The path length includes: when calculating the LDTW distance between the preprocessed sequence and the reference sequence, the LDTW distance imposes constraints on the length of the curved path; the current length of the curved path is considered as an additional factor; The cell is the length of the path, the step size = the length of the path – 1, and "s" represents the step size.

作为一种可实现方式，所述采用基于LDTW距离的灰关联度模型，根据序列间的距离矩阵，可采用以下方式实现：As an achievable way, the gray correlation degree model based on LDTW distance can be realized in the following ways according to the distance matrix between sequences:

γ(X₀,X_i)为序列X₀和序列X_i的灰关联度值，在此i的取值为1，2，3。由基于LDTW距离的灰关联度模型的定义可知，基于LDTW距离的灰关联度模型的灰关联度值取决于两个序列X₀和X_i的距离，因此分析现有灰关联度计算方法的误差来源，合理计算两个序列的距离将会对基于LDTW距离的灰关联度模型的序列之间的相似性效果起到重要作用。γ(X ₀ ,X _i ) is the gray correlation value of sequence X ₀ and sequence X _i , where the value of i is 1, 2, 3. From the definition of the gray relational degree model based on the LDTW distance, it can be known that the gray relational degree value of the gray relational degree model based on the LDTW distance depends on the distance between the two sequences X ₀ and X _i , so the error of the existing gray relational degree calculation method is analyzed Reasonable calculation of the distance between two sequences will play an important role in the similarity effect between the sequences of the gray relational degree model based on LDTW distance.

其中，LDTW(X₀,X_i)为序列X₀和X_i的距离。Wherein, LDTW(X ₀ ,X _i ) is the distance between the sequence X ₀ and X _i .

如表3所示，将本发明的方法与其他几种传统的方法进行比较，可以看出，邓氏关联度、相对关联度、DTW距离灰关联度和LDTW距离灰关联度均对序列间的关联程度进行了较为一致的判定，与定性分析结论一致(第三产业关联度最大，第二产业次之，第一产业最低)。绝对关联度出现了明显的错误。故序列长度一致时，LDTW-GIM关联度性能可靠，计算结果有效。As shown in Table 3, comparing the method of the present invention with other several traditional methods, it can be seen that Deng's correlation degree, relative correlation degree, DTW distance gray correlation degree and LDTW distance gray correlation degree all have a positive effect on the relationship between sequences. The degree of correlation has been relatively consistent, which is consistent with the conclusion of qualitative analysis (the tertiary industry has the highest degree of correlation, the second is the secondary industry, and the primary industry is the lowest). There is a clear error in absolute correlation. Therefore, when the sequence lengths are consistent, the LDTW-GIM correlation performance is reliable and the calculation results are valid.

表3在序列相同长度下与不同的灰关联度结果进行比较Table 3 compares the results of different gray relational degrees under the same sequence length

实施例2Example 2

如表4所示，为了考虑序列数据出现缺失(即序列长度不一致)时，基于LDTW距离的灰关联度模型的性能，对表1中的数据进行了部分遗失处理。序列的初值像为：As shown in Table 4, in order to consider the performance of the gray relational degree model based on LDTW distance when the sequence data is missing (that is, the sequence length is inconsistent), the data in Table 1 are partially lost. The initial value of the sequence is like:

X′₁＝(1,1.0452,1.1032,1.3548,1.4903)X' ₁ = (1,1.0452,1.1032,1.3548,1.4903)

X′₂＝(1,1.2606,1.4929,1.7576)X' ₂ =(1,1.2606,1.4929,1.7576)

X′₃＝(1,1.1256,1.4574,1.6368)X' ₃ ＝(1,1.1256,1.4574,1.6368)

始点零化像为：The initial zeroing image is:

表4中国2001年-2005年国内生产总值及各产业产值(部分缺失，单位：千亿元)Table 4 China's 2001-2005 GDP and output value of various industries (some missing, unit: 100 billion yuan)

序列sequence 20012001 20022002 20032003 20042004 20052005 X₀ X ₀ 109.7109.7 120.3120.3 135.8135.8 159.9159.9 183.1183.1 X₁ _x1 15.515.5 16.216.2 17.117.1 21twenty one 23.123.1 X₂ _x2 49.549.5 －－-- 62.462.4 73.973.9 8787 X₃ _x3 44.644.6 50.250.2 －－-- 6565 7373

作为一种可实现方式，所述利用序列间始点零化象的距离矩阵，求出序列间的LDTW距离矩阵，采用以下方式实现：As an achievable way, the distance matrix of the initial point zeroing image between the described utilization sequences is used to obtain the LDTW distance matrix between the sequences, which is realized in the following manner:

表5 LDTW的距离矩阵Table 5 Distance matrix of LDTW

设DM(X₀,X_i)为序列X₀和X_i的距离矩阵，其中LDTW(X₀,X_i)为序列X₀和X_i的距离，其中λ表示弯曲路径的长度。Let DM(X ₀ ,X _i ) be the distance matrix of sequences X ₀ and X _i , where LDTW(X ₀ ,X _i ) is the distance between sequences X ₀ and X _i , where λ denotes the length of the curved path.

作为一种可实现方式，所述采用基于LDTW距离的灰关联度模型，根据序列间的距离矩阵，可采用以下方式实现：As an achievable way, the gray correlation degree model based on LDTW distance can be implemented in the following ways according to the distance matrix between sequences:

γ(X₀,X₁)＝0.8407；γ(X₀,X₂)＝0.8863；γ(X₀,X₃)＝0.9058γ(X ₀ ,X ₁ )=0.8407; γ(X ₀ ,X ₂ )=0.8863; γ(X ₀ ,X ₃ )=0.9058

从表6中可以看出，当序列长度不一致时，人为补齐对最终的分析结论造成了比较大的影响。邓氏关联度、绝对关联度、相对关联度均对出现了误判，而LDTW距离灰关联度与定性分析结论一致，说明该关联度具有较强的适应能力。It can be seen from Table 6 that when the sequence lengths are inconsistent, artificial completion has a relatively large impact on the final analysis conclusion. Deng's correlation degree, absolute correlation degree, and relative correlation degree all have misjudgments, while the gray correlation degree of LDTW distance is consistent with the conclusion of qualitative analysis, indicating that the correlation degree has strong adaptability.

表6各关联度不等长度序列关联度结果比较Table 6. Comparison of the correlation degree results of sequences of different correlation degrees with different lengths

传统的灰关联度计算方法(参考文献：刘思峰,杨英杰,吴利丰.灰色系统理论及其应用[M].科学出版社,2014.)：Traditional calculation method of gray relational degree (references: Liu Sifeng, Yang Yingjie, Wu Lifeng. Gray system theory and its application [M]. Science Press, 2014.):

其中，表示两级最小差，表示两级最大差；直接使用绝对值之差来表示两个序列的距离并不能反映两个序列的真实距离，i表示第i维度，k表示第k维度，1≤i≤m,1≤k≤n。in, Indicates the minimum difference between two levels, Indicates the maximum difference between two levels; directly using the difference in absolute value to represent the distance between two sequences does not reflect the real distance between the two sequences, i represents the i-th dimension, k represents the k-th dimension, 1≤i≤m, 1≤k ≤n.

现有的灰关联度计算方法(参考文献：Dai J,Hu F,Liu X.Research on greyincidence measurement method based on dynamic time warping distance[J].Journal of Grey System,2015,27(1):117-126.)：Existing calculation methods of gray relational degree (references: Dai J, Hu F, Liu X. Research on greyincidence measurement method based on dynamic time warping distance [J]. Journal of Gray System, 2015, 27(1): 117- 126.):

DTW(X₀,X_i)表示序列X₀和序列X_i的DTW距离；ξ为分辨系数，λ表示路径长度；DTW(X ₀ ,X _i ) represents the DTW distance between sequence X ₀ and sequence X _i ; ξ is the resolution coefficient, and λ represents the path length;

本发明的灰关联度计算方法：Gray relational calculation method of the present invention:

其中，表示序列X₀的元素的值和序列X_i中的元素的值的两级最小差，表示序列X₀的元素的值和序列X_i中的元素的值的两级最大差，ξ为分辨系数，且ξ∈[0,1]，LDTW(X₀,X_i)表示X₀和X_i的在有限长度下的动态时间弯曲距离，λ是LDTW(X₀,X_i)所对应的路径长度，i∈{1,2,...,I}，I为序列个数，t₀表示第t₀个维度，t_i表示第t_i个维度；1<t₀≤n，1<t_i≤m，n为维度个数，m为序列的个数。in, Denotes the two-level minimum difference between the value of an element of sequence X ₀ and the value of an element in sequence X _i , Indicates the two-level maximum difference between the value of the element of the sequence X ₀ and the value of the element in the sequence _Xi , ξ is the resolution coefficient, and ξ∈[0,1], LDTW(X ₀ ,X _i ) represents X ₀ and X The dynamic time warping distance of _i under finite length, λ is the path length corresponding to LDTW(X ₀ ,X _i ), i∈{1,2,...,I}, I is the number of sequences, t ₀ Indicates the t _0th dimension, t _i indicates the t _i th dimension; 1<t ₀ ≤n, 1<t _i ≤m, n is the number of dimensions, and m is the number of sequences.

本发明对公式(2)进行了改进得到公式(3)，DTW作为一种衡量两个时间序列之间的相似度的方法，它通过弯曲时间轴来获取最小距离来匹配长度不一致的序列。另外，不同时间序列可能仅仅存在时间轴上的位移，也就是说在还原位移的情况下，两个时间序列是一致的。在这些复杂情况下，使用传统的欧几里得距离无法有效地求得两个时间序列之间的距离(或者相似性)。但是DTW存在导致病理性对齐的缺点，图2展示出了由DTW产生的典型病理性对齐，其中我们可以观察几个奇异点(红色三角形位置)。奇异点是一个时间序列内的一个数据点，链接到另一个时间序列的大部分。显然，这样的对准不是“正确的”这种对齐极大地影响了相似性度量的准确性。LDTW是通过限制两个时间序列之间的链接总数。因为整个优化过程是通过让DTW决定要分配给每个数据点的链接数以及放置这些链接的位置，而不是设置刚性极限，例如窗口约束，则链接数量不足以形成奇异点。因此，它允许更多的灵活性，并避免错误正确对齐的风险，为序列之间的相似性进行了更加准确地判断，便于进一步进行聚类分析。The present invention improves the formula (2) to obtain the formula (3). As a method to measure the similarity between two time series, DTW obtains the minimum distance by bending the time axis to match the sequences with inconsistent lengths. In addition, different time series may only have a displacement on the time axis, that is to say, in the case of restoring the displacement, the two time series are consistent. In these complicated cases, the distance (or similarity) between two time series cannot be effectively obtained using the traditional Euclidean distance. However, DTW has the disadvantage of causing pathological alignment. Figure 2 shows a typical pathological alignment produced by DTW, where we can observe several singular points (red triangle positions). A singularity is a data point within one time series that is linked to a large portion of another time series. Clearly, such an alignment is not "correct" and this alignment greatly affects the accuracy of the similarity measure. LDTW works by limiting the total number of links between two time series. Because the entire optimization process is done by letting DTW decide how many links to assign to each data point and where to place those links, rather than setting rigid limits, such as window constraints, the number of links is not enough to form a singularity. Therefore, it allows more flexibility and avoids the risk of misalignment, and provides a more accurate judgment of the similarity between sequences, which is convenient for further cluster analysis.

作为一种可实现方式，所述采用基于LDTW距离的灰关联度模型，根据序列间的LDTW距离和弯曲路径的长度λ，可采用以下方式实现：As an achievable way, the gray correlation degree model based on LDTW distance can be implemented in the following ways according to the LDTW distance between sequences and the length λ of the curved path:

给定序列X_i＝(x_i(1),x_i(2),...,x_i(n))与X₀＝{x₀(1),x₀(2),...,x₀(n)}，则X_i与X₀间的距离矩阵为：A given sequence X _i =( _xi (1), _xi (2),..., _xi (n)) and X ₀ ={x ₀ (1),x ₀ (2),..., x ₀ (n)}, then the distance matrix between _Xi and X ₀ is:

LDTW是通过限制弯曲路径的长度。因此，将弯曲路径长度作为一个附加因素，为了更直观地描述，步数＝路径长度–1，用“s”表示步数。LDTW works by limiting the length of the curved path. Therefore, taking the curved path length as an additional factor, for a more intuitive description, the number of steps = path length – 1, and "s" is used to represent the number of steps.

实施例3Example 3

本发明基于LDTW距离的灰关联度模型可以实现生物信息学中的对生物某些相关数据序列进行聚类，更好的分析一些生物序列相似度情况，比如发现相似蛋白质之间的共通性，对推断这些蛋白质的生物学功能有重要意义。结构相似的蛋白质，他们的功能也相似，所以我们把具有相似功能的蛋白质聚为一类，为生物学家研究蛋白质的功能提供了帮助。The gray relational degree model based on LDTW distance in the present invention can realize the clustering of certain biological related data sequences in bioinformatics, and better analyze the similarity of some biological sequences, such as finding the commonality between similar proteins. It is of great significance to deduce the biological functions of these proteins. Proteins with similar structures have similar functions, so we cluster proteins with similar functions into one group to help biologists study protein functions.

本发明的LDTW距离的灰关联度模型的生物聚类方法，其中，包括对酵母蛋白质定位位点的聚类和鲍鱼年龄的聚类。酵母属性中包括mcg：McGeoch的信号序列识别方法；gvh：von Heijne的信号序列识别方法；alm：ALOM跨膜区域预测程序的得分；mit：氨基酸含量的判别分析得分；erl：存在“HDEL”子字符串(被认为是一个信号保留在内质网腔中)；pox：C-末端的过氧化物酶体靶向信号；vac：氨基酸含量的判别分析得分；nuc：核定位信号的判别分析得分。对鲍鱼年龄的聚类，鲍鱼的年龄取决于从锥体切割壳体，染色，并通过显微镜计数环的数量，这是一个无聊而费时的任务。鲍鱼的属性包括性别、长度、直径、高度、重量、剥离重量、内脏重量、外壳重量、环/整数；The biological clustering method of the gray correlation degree model of LDTW distance of the present invention includes clustering of yeast protein positioning sites and clustering of abalone age. Yeast attributes include mcg: McGeoch's signal sequence identification method; gvh: von Heijne's signal sequence identification method; alm: score of ALOM transmembrane region prediction program; mit: discriminant analysis score for amino acid content; erl: presence of "HDEL" sub string (considered as a signal retained in the ER lumen); pox: peroxisome targeting signal at the C-terminus; vac: discriminant analysis score for amino acid content; nuc: discriminant analysis score for nuclear localization signal . Clustering of the ages of abalones depends on cutting the shells from the cones, staining them, and counting the number of rings by microscopy, which is a tedious and time-consuming task. Abalone properties include gender, length, diameter, height, weight, stripped weight, visceral weight, shell weight, rings/integer;

根据生物的属性含义，根据数据已有的一些属性信息序列对其进行聚类，具体包括：According to the meaning of biological attributes, it is clustered according to some existing attribute information sequences of the data, including:

为检验本发明效果，下面将本发明与现有技术进行实验对比。In order to check the effect of the present invention, the present invention is compared with the prior art in experiments below.

对比实验涉及的对比技术包括：The comparative techniques involved in the comparative experiments include:

基于LDTW距离的灰关联度是本发明基于LDTW距离的灰关联度的生物聚类方案。The gray relational degree based on LDTW distance is a biological clustering scheme based on the gray relational degree of LDTW distance in the present invention.

邓式灰关联度、绝对灰关联度和相对灰关联度是基于传统灰关联度的生物聚类方案，通过直接计算相应序列之间点的距离。Deng's gray relational degree, absolute gray relational degree and relative gray relational degree are biological clustering schemes based on the traditional gray relational degree, by directly calculating the distance between points between corresponding sequences.

基于DTW距离的灰关联度是基于现有的改进传统灰关联度的生物聚类方案，将直接计算序列间的距离改为用动态时间弯曲距离。The gray relational degree based on DTW distance is based on the existing biological clustering scheme that improves the traditional gray relational degree, and the direct calculation of the distance between sequences is replaced by dynamic time warping distance.

比较和参考以往聚类效果的评价指标，本文选取现有技术中常用的两个指标的比较，包括熵和纯度：具体包括：Comparing and referring to the evaluation indicators of previous clustering effects, this paper selects the comparison of two indicators commonly used in the prior art, including entropy and purity: specifically including:

熵：对于一个聚类i，首先计算P_ij。P_ij指的是聚类i中的成员属于类(class)j的概率，其中m_i是在聚类i中所有成员的个数，m_ij是聚类i中的成员属于聚类j的个数。每个聚类的熵可以表示为其中L是类(class)的个数。整个聚类划分的熵为其中K是聚类(cluster)的数目，m是整个聚类划分所涉及到的成员个数。Entropy: For a cluster i, first calculate P _ij . P _ij refers to the probability that a member in cluster i belongs to class (class) j, Among them, m _i is the number of all members in cluster i, and m _ij is the number of members in cluster i belonging to cluster j. The entropy of each cluster can be expressed as where L is the number of classes. The entropy of the whole cluster division is Where K is the number of clusters, and m is the number of members involved in the entire cluster division.

纯度：使用上述熵中P_ij的定义，我们将聚类i的纯度定义为P_i＝max(P_ij)，整个聚类划分的纯度为其中K是聚类(cluster)的数目，m是整个聚类划分所涉及到的成员个数。熵代表类与类之间的信息度，纯度代表聚合度，熵越低，纯度越高，说明聚类效果越好Purity: Using the definition of P _ij in the above entropy, we define the purity of cluster i as P _i =max(P _ij ), and the purity of the entire cluster division is Where K is the number of clusters, and m is the number of members involved in the entire cluster division. Entropy represents the degree of information between classes, and purity represents the degree of aggregation. The lower the entropy and the higher the purity, the better the clustering effect

本实验基于UCI中酵母数据集，所述酵母属性中包括mcg：McGeoch的信号序列识别方法；gvh：von Heijne的信号序列识别方法；alm：ALOM跨膜区域预测程序的得分；mit：氨基酸含量的判别分析得分；erl：存在“HDEL”子字符串(被认为是一个信号保留在内质网腔中)；pox：C-末端的过氧化物酶体靶向信号；vac：氨基酸含量的判别分析得分；nuc：核定位信号的判别分析得分。对其进行聚类，最后根据蛋白质的定位位点判断聚类效果好坏，如表7所示。This experiment is based on the yeast data set in UCI, and the yeast attributes include mcg: signal sequence recognition method of McGeoch; gvh: signal sequence recognition method of von Heijne; alm: score of ALOM transmembrane region prediction program; mit: amino acid content Discriminant analysis score; erl: presence of "HDEL" substring (considered to be a signal retained in the lumen of the endoplasmic reticulum); pox: peroxisome targeting signal at the C-terminus; vac: discriminant analysis of amino acid content score; nuc: discriminant analysis score for nuclear localization signals. It is clustered, and finally the clustering effect is judged according to the location of the protein, as shown in Table 7.

表7酵母的属性序列Table 7 The attribute sequence of yeast

实施例4Example 4

为了进一步验证本发明的聚类效果，本发明还采用了UCI中鲍鱼数据集，对鲍鱼年龄的聚类，鲍鱼的年龄取决于从锥体切割壳体，染色，并通过显微镜计数环的数量-这是一个无聊而费时的任务。鲍鱼的属性包括性别、长度、直径、高度、重量、剥离重量、内脏重量、外壳重量、环/整数，如表8所示，In order to further verify the clustering effect of the present invention, the present invention also uses the abalone data set in UCI to cluster the age of the abalone, the age of the abalone depends on the number of rings cut from the cone, stained, and counted through the microscope- This is a boring and time-consuming task. The properties of abalone include gender, length, diameter, height, weight, stripped weight, visceral weight, shell weight, ring/integer, as shown in Table 8,

表8鲍鱼的属性序列Table 8 Attribute sequence of abalone

经过VS2012和数据库sql sever2012得出本发明和现有技术的对比，2个数据集(无缺失数据)的聚类结果与传统的灰关联度的纯度比较如图3所示，其中，无缺失数据表示在预处理后的序列X_i＝{x_i(1),x_i(2),...,x_i(n)}中，每一个维度的值都有数据存在，也即x_i(j)都对应了一个数据值，相应的，缺失数据表示预处理后的序列X_i＝{x_i(1),x_i(2),...,x_i(n)}中某一个或多个维度中没有没有数据值；下图中的无缺失数据和缺失数据均如上解释，下文不再赘述。Through VS2012 and database sql sever2012, the comparison between the present invention and the prior art is obtained, and the comparison between the clustering results of the two data sets (without missing data) and the purity of the traditional gray relational degree is shown in Figure 3, wherein, there is no missing data Indicates that in the preprocessed sequence X _i ={ _xi (1), _xi (2),..., _xi (n)}, the value of each dimension has data, that is, x _i ( j ₎ all correspond _to a data value, _{correspondingly} , the missing data represents one _or There are no data values in multiple dimensions; no missing data and missing data in the figure below are explained above, and will not be repeated below.

与传统的灰关联度的熵比较如图4所示，与DTW灰关联度的纯度的比较如图5所示。与DTW灰关联度的纯度的比较如图6所示。图3和图4中横坐标GID表示比较的几种灰关联度，纵坐标表示纯度和熵的值，Yeast表示酵母，Abalone表示鲍鱼，下图的Yeast和Abalone均如此，不再赘述；图5和图6中横坐标表示纯度和熵的值，纵坐标表示数据集。其中，纯度平均提升了14.81％，熵平均降低了42.73％。2个数据集中基于LDTW距离的灰关联度的纯度都是最高，熵也是最低，表明聚类的效果比较好。LDTW本身是对DTW进行的改进，当然存在序列之间的距离计算是一致的。为了验证基于LDTW距离的灰关联度的性能，对2个数据集进行了0.1的遗失率，进行在缺失数据信息情况下的聚类结果，2个数据集(缺失数据)的聚类结果与传统的灰关联度的纯度比较如图7所示，与传统的灰关联度的熵比较如图8所示，与DTW灰关联度的纯度的比较如图9所示。与DTW灰关联度的纯度的比较如图10所示。图7和图8中横坐标表示比较的几种灰关联度，纵坐标表示纯度和熵的值，图9和图10中横坐标表示纯度和熵的值，纵坐标表示数据集。其中，纯度平均提升了19.14％，熵平均降低了62.19％。当数据不完备的时候，绝对灰关联度、相对灰关联度和邓氏灰关联度的纯度下降的很快，熵也有明显的上升，对最后聚类效果有明显的影响。LDTW灰关联度不会对缺失数据进行补齐，避免干扰信息的加入，可以有效的解决序列长度不一致的情况，使最后聚类效果基本不受影响。基于LDTW距离的灰关联度的纯度依然最高，熵也是最低，表明聚类的效果比较好。The entropy comparison with the traditional gray relational degree is shown in Figure 4, and the comparison with the purity of the DTW gray relational degree is shown in Figure 5. The comparison with the purity of DTW gray relational degree is shown in Figure 6. In Figure 3 and Figure 4, the abscissa GID indicates several gray correlation degrees for comparison, the ordinate indicates the value of purity and entropy, Yeast indicates yeast, and Abalone indicates abalone. The same is true for Yeast and Abalone in the figure below, so I will not repeat them; Figure 5 And in Figure 6, the abscissa represents the value of purity and entropy, and the ordinate represents the data set. Among them, the average purity increased by 14.81%, and the average entropy decreased by 42.73%. The purity of the gray relational degree based on the LDTW distance in the two data sets is the highest, and the entropy is also the lowest, indicating that the clustering effect is better. LDTW itself is an improvement on DTW, of course, the distance calculation between sequences is consistent. In order to verify the performance of the gray relational degree based on LDTW distance, the loss rate of 0.1 was carried out on the two data sets, and the clustering results in the case of missing data information were carried out. The clustering results of the two data sets (missing data) were compared with the traditional The comparison of the purity of the gray relational degree is shown in Figure 7, the entropy comparison with the traditional gray relational degree is shown in Figure 8, and the comparison with the purity of the DTW gray relational degree is shown in Figure 9. The comparison with the purity of DTW gray relational degree is shown in Figure 10. In Figure 7 and Figure 8, the abscissa represents several gray correlation degrees for comparison, the ordinate represents the value of purity and entropy, the abscissa in Figure 9 and Figure 10 represents the value of purity and entropy, and the ordinate represents the data set. Among them, the average purity increased by 19.14%, and the average entropy decreased by 62.19%. When the data is incomplete, the purity of the absolute gray relational degree, relative gray relational degree and Deng's gray relational degree decreases rapidly, and the entropy also increases significantly, which has a significant impact on the final clustering effect. LDTW gray relational degree will not complement the missing data, avoid the addition of interference information, and can effectively solve the situation of inconsistent sequence lengths, so that the final clustering effect is basically not affected. The purity of the gray relational degree based on the LDTW distance is still the highest, and the entropy is also the lowest, indicating that the clustering effect is better.

数据挖掘中，本发明提供的基于LDTW距离的灰关联度方法对生物序列的相似性计算能够更准确，提供更准确的聚类结果，可以帮助我们研究基因、蛋白质的性质功能，为探索生物的奥秘提供帮助。In data mining, the gray correlation method based on LDTW distance provided by the present invention can calculate the similarity of biological sequences more accurately and provide more accurate clustering results, which can help us study the properties and functions of genes and proteins, and provide a basis for exploring biological Mystery helps.

以上所举实施例，对本发明的目的、技术方案和优点进行了进一步的详细说明，所应理解的是，以上所举实施例仅为本发明的优选实施方式而已，并不用以限制本发明，凡在本发明的精神和原则之内对本发明所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above examples have further described the purpose, technical solutions and advantages of the present invention in detail. It should be understood that the above examples are only preferred implementations of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made to the present invention within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A grey relational clustering method based on LDTW distance, characterized in that the method comprises:

s1, processing the original data set to obtain a preprocessed sequence;

s2, forming a reference sequence by the maximum value of each dimensionality in the preprocessed sequence;

s3, calculating the LDTW distance between the preprocessed sequence and the reference sequence and the length lambda of the bending path of the LDTW distance;

s4, calculating the grey correlation degree between the preprocessed sequence based on the LDTW distance and the reference sequence;

and S5, determining a critical value epsilon according to the result of the grey correlation degree, dividing the grey correlation degree interval into a plurality of critical intervals according to the critical value epsilon, and if the grey correlation degrees of the two sequences fall into the same critical interval, clustering the two sequences into one class.

2. The LDTW distance-based gray-associated clustering method of claim 1, wherein the processing the raw data set to obtain a pre-processed sequence comprises: preprocessing the original sequence by utilizing the original sequence to convert the original sequence into dimensionless data, wherein the dimensionless data is a preprocessed sequence; the method specifically comprises the following steps:

original sequence: x'_i＝(x′_i(1),x′_i(2),...,x′_i(n))；

The sequence after pretreatment: x_i＝{x_i(1),x_i(2),...,x_i(n)}；

Wherein, X'_iDenotes the ith original sequence, x'_i(j) A data value representing a jth dimension of an ith original sequence; x_iDenotes the ith pre-processed sequence, x_i(j) A value representing a j dimension of an ith preprocessed sequence;

wherein,x′_max(j) represents the maximum value of the j dimension in the original sequence, x'_min(j) Representing the minimum value of the j dimension in the original sequence, I belongs to {1, 2.., I }; j belongs to {1, 2., n }, wherein I is the number of sequences and n is the number of dimensions.

3. The LDTW distance-based gray associative clustering method according to claim 1, wherein the constructing the maximum value of each dimension in the preprocessed sequence as the reference sequence comprises:

wherein,which is indicative of a reference sequence, a value representing the jth dimension of the reference sequence; x is the number of_max(j) Representing the maximum value of the j dimension in the preprocessed sequence; j is in the form of {1, 2.,. n }, and n is the number of dimensions.

4. The LDTW distance-based gray association clustering method of claim 1, wherein the calculating the LDTW distance of the preprocessed sequence from the reference sequence comprises:

removing dimensionality of missing data of the preprocessed sequence to obtain a new preprocessed sequence: x_i＝(x_i(1),x_i(2),...,x_i(m)); calculating the LDTW distance between the new preprocessed sequence and the reference sequence; the method specifically comprises the following steps:

the reference sequence is: x₀＝{x₀(1),x₀(2),...,x₀(n)}；

Novel preprocessed sequence X_iWith reference sequence X₀The distance matrix of (a) includes:

wherein, dis (x)_i(m),x₀(n))＝|x_i(m)-x₀(n)|，dis(x_i(m),x₀(n)) is x_i(m) and x₀(n) the distance between; x is the number of_i(m) a value representing the m-dimension of the ith preprocessed sequence; x is the number of₀(n) represents a value of the nth dimension of the reference sequence;

novel pretreatmentLast sequence X_iWith reference sequence X₀The LDTW distance includes:

D(X_i(m),X₀(n), s) represents X₀And X_iLDTW distance of (D), min represents taking the minimum value, D (x)_i(m),x₀(n-1), s-1) represents x₀(n-1) to x_i(m) a distance; d (x)_i(m-1),x₀(n), s-1) represents x₀(n) to x_i(m-1) distance; d (x)_i(m-1),x₀(n-1), s-1) represents x₀(n-1) to x_i(m-1) distance; i belongs to {1, 2.,. I }, s represents the step length, and I is the number of sequences; m and n are dimension numbers, and m is less than or equal to n.

5. The LDTW distance-based gray associative clustering method of claim 1, wherein calculating the curved path length λ of the preprocessed sequence and the reference sequence comprises:

will in calculating the LDTW distance of the preprocessed sequence and the reference sequence, the LDTW distance imposes a constraint on the curved path length; considering the current length of the curved path as an additional factor; the cells of the matrix that are traversed are the path lengths λ.

6. The LDTW distance-based gray associative clustering method of claim 5, wherein the path length λ further comprises:

λ＝s+1

s denotes the step size, i.e. the path length is equal to the step size plus one.

7. The LDTW distance-based gray association clustering method according to claim 1, wherein the calculating the gray association degree between the LDTW distance-based preprocessed sequence and the reference sequence comprises:

dimensionality removal of missing data of the preprocessed sequenceTo get a new pre-processed sequence: x_i＝(x_i(1),x_i(2),...,x_i(m)); calculating the grey correlation degree of the new preprocessed sequence and the reference sequence; the method specifically comprises the following steps:

the reference sequence includes: x₀＝{x₀(1),x₀(2),...,x₀(n)}；

The new pre-processed sequence includes: x_i＝(x_i(1),x_i(2),...,x_i(m))；

The grey correlation between the reference sequence and the new pre-processed sequence is defined as follows:

wherein,represents sequence X₀Value and sequence of elements of (1) X_iThe two-level minimum difference of the values of the elements in (b),represents sequence X₀Value and sequence of elements of (1) X_ithe maximum difference of two levels of the values of the elements in (1), ξ is a resolution coefficient, and ξ belongs to [0,1 ]]，LDTW(X₀,X_i) Represents X₀And X_iIs the dynamic time warping distance of limited length, λ is LDTW (X)₀,X_i) The corresponding path length I belongs to {1, 2.., I }, wherein I is the number of sequences; t is t₀Denotes the t-th₀Dimension of, t_iDenotes the t-th_iA dimension; 1<t₀≤n，1<t_iM is less than or equal to m, m and n are dimension numbers, and m is less than or equal to n.

8. The LDTW distance-based gray correlation clustering method as claimed in claim 1, wherein the threshold value ε is determined according to the gray correlation result, the threshold value ε is divided into a plurality of threshold intervals, and if two sequences of gray are presentIf the correlation degree falls within the same critical interval, the two sequences are grouped into one type including: calculating grey correlation values of the two sequences, and if the grey correlation values of the two sequences fall in the same critical interval, clustering the two sequences into one class; the method specifically comprises the following steps: pretreated sequence X_iLet us assume that any two sequences are represented as:

X_p＝{x_p(1),x_p(2),...,x_p(n)}；

X_q＝{x_q(1),x_q(2),...,x_q(n)}；

wherein, X_pDenotes the p-th sequence, X_qRepresents the q sequence; x is the number of_p(n) nth data representing a pth sequence; x is the number of_q(n) nth data representing a qth sequence; p belongs to {1, 2., I }, and q belongs to {1, 2., I }; using a threshold value of ε to be [0,1 ]]Divided into a plurality of critical intervals, i.e. [0, [ epsilon ]]，[ε,2ε]，...，[1-ε,1](ii) a When two sequences X_pAnd X_qWhen the gray correlation values of (1) are in the same interval, the two sequences X are combined_pAnd X_qThe polymers are grouped into one group.