CN107463952B

CN107463952B - An object material classification method based on multimodal fusion deep learning

Info

Publication number: CN107463952B
Application number: CN201710599106.1A
Authority: CN
Inventors: 刘华平; 方静; 刘晓楠; 孙富春
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2017-07-21
Filing date: 2017-07-21
Publication date: 2020-04-03
Anticipated expiration: 2037-07-21
Also published as: CN107463952A

Abstract

The invention relates to an object material classification method based on multimodal fusion deep learning, and belongs to the technical fields of computer vision, artificial intelligence and material classification. The present invention is an object material classification method based on multi-modal fusion deep learning - a multi-modal fusion method based on multi-scale local receptive field and ultra-limited learning machine. The present invention fuses the perception information (including visual images, tactile acceleration signals and tactile sound signals) of different modalities of the material of the object, and finally realizes the correct classification of the material of the object. This method can not only use multi-scale local receptive fields to extract high-representative features of real complex materials, but also effectively fuse the information of various modalities to achieve information complementation between modalities. The method of the present invention can improve the robustness and accuracy of complex material classification, so that it has greater applicability and versatility.

Description

An object material classification method based on multimodal fusion deep learning

技术领域technical field

本发明涉及一种基于多模态融合深度学习的物体材质分类方法，属于计算机视觉、人工智能和材质分类技术领域。The invention relates to an object material classification method based on multimodal fusion deep learning, and belongs to the technical fields of computer vision, artificial intelligence and material classification.

背景技术Background technique

大千世界，材质种类繁多，可以分为塑料、金属、陶瓷，玻璃、木材、纺织品、石材、纸、橡胶和泡沫等种类。最近，物体材质分类已经极大地引起社会环保，工业界以及学术界的关注。比如材质的分类可以有效的用于材料的循环利用；包装材料的四大支柱：纸，塑料，金属和玻璃，在不同的市场需求下需要不能材质的包装。对于需要长距离运输但对运输质量无特殊要求，一般选用纸，纸板以及包装箱纸板；对于食品包装应该符合卫生标定，糕点等直接入口食品的包装应使用纸盒纸板，食盐等防光防潮的使用罐装，快餐盒的制造可以使用天然植物纤维；合理使用装饰材料是室内装饰成功的关键。基于上述问题的需求，研究一套能够自动对物体材质分类的方法就显得十分必要。There are many kinds of materials, which can be divided into plastics, metals, ceramics, glass, wood, textiles, stone, paper, rubber and foam. Recently, object material classification has greatly attracted the attention of social environmental protection, industry and academia. For example, the classification of materials can be effectively used for material recycling; the four pillars of packaging materials: paper, plastic, metal and glass, which require packaging that cannot be made of materials under different market demands. For long-distance transportation but no special requirements for transportation quality, paper, cardboard and packaging box cardboard are generally used; for food packaging, it should meet the hygienic standard, and the packaging of directly imported food such as cakes should use carton cardboard, salt and other light-proof and moisture-proof materials. Using cans, the manufacture of snack boxes can use natural plant fibers; the rational use of decorative materials is the key to the success of interior decoration. Based on the needs of the above problems, it is very necessary to develop a set of methods that can automatically classify the material of objects.

物体材质分类主流的方法是使用包含丰富信息的视觉图像，但是对于外观极其相似的两个物体仅用视觉图像是不能够区分的。假设有两个物体：一个红色粗糙的纸和一个红色的塑料箔，视觉图像对这两个物体具有较小的区分能力。但是对于上述假设，人脑会本能的将同一物体的不同模态感知特征进行融合，从而达到对物体材质分类的目的。受此启发，要使计算机实现对物体材质的自动分类，可以同时使用物体不同模态信息来进行物体材质分类。The mainstream method of object material classification is to use visual images containing rich information, but for two objects with extremely similar appearances, only visual images cannot be used to distinguish them. Suppose there are two objects: a red rough paper and a red plastic foil, for which the visual image has a small discriminative power. However, for the above assumptions, the human brain will instinctively fuse the different modal perception features of the same object, so as to achieve the purpose of classifying the material of the object. Inspired by this, in order to make the computer realize the automatic classification of the material of the object, the different modal information of the object can be used to classify the material of the object at the same time.

当前也有公开技术用于物体材质分类，如中国专利申请CN105005787A—一种基于灵巧手触觉信息的联合稀疏编码的材质分类。此发明对材质分类仅使用了触觉序列，并未将材质的多种模态信息结合起来。观察到仅使用视觉图像对物体材质分类不能鲁棒地捕获材质特征，如硬度或粗糙度。可以假设当刚性工具拖动或移动到不同物体的表面上时，工具将产生不同频率的振动和声音，因此可以使用与视觉互补的触觉信息来进行物体材质的分类。然而，如何有效地将视觉模态与触觉模态结合仍然是一个具有挑战性的问题。Currently, there are also disclosed technologies for object material classification, such as Chinese patent application CN105005787A—a material classification based on joint sparse coding of tactile information of dexterous hands. This invention only uses haptic sequences for material classification, and does not combine multiple modal information of materials. It was observed that using visual imagery alone to classify object material does not robustly capture material features such as hardness or roughness. It can be assumed that when a rigid tool is dragged or moved over the surfaces of different objects, the tool will generate vibrations and sounds of different frequencies, so that haptic information complementary to vision can be used to classify the material of objects. However, how to effectively combine visual modalities with haptic modalities remains a challenging problem.

发明内容SUMMARY OF THE INVENTION

本发明目的是提出一种基于多模态融合深度学习的物体材质分类方法，在基于多尺度局部感受野的超限学习机方法的基础上实现多模态信息融合的物体材质分类，以提高分类的鲁棒性和准确性，并有效地融合物体材质的多种模态信息进行材质分类。The purpose of the present invention is to propose an object material classification method based on multi-modal fusion deep learning, and realize the object material classification of multi-modal information fusion on the basis of the ultra-limited learning machine method based on multi-scale local receptive field, so as to improve the classification It is robust and accurate, and effectively fuses multiple modal information of object materials for material classification.

本发明提出的基于多模态融合深度学习的物体材质分类方法，包括以下步骤：The object material classification method based on multi-modal fusion deep learning proposed by the present invention includes the following steps:

(1)设训练样本个数为N₁，训练样本材质种类为M₁个，记每类材质训练样本的标签为

其中1≤M₁≤N₁，分别采集所有N₁个训练样本的视觉图像I₁、触觉加速度A₁和触觉声音S₁，建立一个包括I₁、A₁和S₁的数据集D₁，I₁的图像大小为320×480；(1) Suppose the number of training samples is N ₁ , the material types of the training samples are M ₁ , and the labels of the training samples of each type of material are recorded as

where 1≤M ₁ ≤N ₁ , collect visual image I ₁ , haptic acceleration A ₁ and haptic sound S ₁ of all N ₁ training samples respectively, and establish a dataset D ₁ including I ₁ , A ₁ and S ₁ , The image size of I ₁ is 320×480;

设待分类物体个数为N₂，待分类物体材质的种类为M₂个，记每类待分类物体的标签为

其中1≤M₂≤M₁，分别采集所有N₂个待分类物体的视觉图像I₂、触觉加速度A₂和触觉声音S₂，建立一个包括I₂、A₂和S₂的数据集D₂，I₂的图像大小为320×480；Suppose the number of objects to be classified is N ₂ , the material types of the objects to be classified are M ₂ , and the label of each object to be classified is recorded as

where 1≤M ₂ ≤M ₁ , collect visual images I ₂ , haptic acceleration A ₂ and haptic sound S ₂ of all N ₂ objects to be classified, respectively, and establish a dataset D ₂ including I ₂ , A ₂ and S ₂ , the image size of I ₂ is 320×480;

(2)对上述数据集D₁和数据集D₂视觉图像进行视觉图像预处理、触觉加速度信号进行触觉加速度预处理和触觉声音信号进行触觉声音预处理，分别得到视觉图像、触觉加速度频谱图和触觉声音频谱图，包括以下步骤：( ₂ ) Perform visual image preprocessing, tactile acceleration preprocessing for tactile acceleration signals, and tactile sound preprocessing for tactile sound signals on the visual images _of the above datasets D1 and D2, respectively, to obtain visual images, tactile acceleration spectrograms and Haptic sound spectrogram, including the following steps:

(2-1)利用降采样方法，对图像大小为320×480的图像I₁和图像I₂进行降采样，得到I₁和I₂的大小为32×32×3的视觉图像；(2-1) Using the down-sampling method, down-sampling the image I ₁ and the image I ₂ with an image size of 320×480 to obtain a visual image with a size of I ₁ and I ₂ of 32×32×3;

(2-2)利用短时傅里叶变换方法，分别将触觉加速度A₁和触觉加速度A₂转换到频域，短时傅里叶变换中的汉明窗的窗口长度为500，窗口偏移量为100，采样频率为10kHz，分别得到触觉加速度A₁和触觉加速度A₂的频谱图，从频谱图中选择前500个低频信道作为频谱图像，对该频谱图像进行降采样，得到A₁和A₂的大小为32×32×3的触觉加速度频谱图像；(2-2) Using the short-time Fourier transform method, the tactile acceleration A ₁ and the tactile acceleration A ₂ are respectively converted into the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of haptic acceleration A ₁ and haptic acceleration A ₂ are obtained respectively, and the first 500 low-frequency channels are selected from the spectrogram as the spectrum image, and the spectrum image is down-sampled to obtain A ₁ and A ₂ is a haptic acceleration spectrum image with a size of 32×32×3;

(2-3)利用短时傅里叶变换方法，分别将触觉声音S₁和触觉声音S₂转换到频域，短时傅里叶变换中的汉明窗的窗口长度为500，窗口偏移量为100，采样频率为10kHz，分别得到触觉声音S₁和触觉声音S₂的频谱图，从频谱图中选择前500个低频信道作为频谱图像，对该频谱图像进行降采样，得到S₁和S₂的大小为32×32×3的声音频谱图像；(2-3) Using the short-time Fourier transform method, the tactile sound S ₁ and the tactile sound S ₂ are respectively converted to the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of the haptic sound S1 and the haptic sound S2 are obtained respectively, and the _first 500 low _- frequency channels are selected from the spectrogram as the spectrum image, and the spectrum image is down _- sampled to obtain S1 and S ₂ is a sound spectrum image of size 32×32×3;

(3)通过多尺度特征映射，获得视觉模态、触觉加速度模态和触觉声音模态的卷积特征，包括以下步骤：(3) Obtain convolution features of visual modality, tactile acceleration modality and tactile sound modality through multi-scale feature mapping, including the following steps:

(3-1)将上述步骤(2)得到的I₁和I₂的大小为32×32×3的视觉图像、A₁和A₂的大小为32×32×3的触觉加速度频谱图像和S₁和S₂的大小为32×32×3的声音频谱图像输入到神经网络第一层，即输入层，输入图像的大小为d×d，该神经网络中的局部感受野具有Ψ个尺度通道，Ψ个尺度通道的大小分别为r₁,r₂,…,r_Ψ，每个尺度通道产生K个不同的输入权重，从而随机生成Ψ×K个特征图，将神经网络随机产生的第Φ个尺度通道的视觉图像、触觉加速度频谱图和声音频谱图的初始权重记为

和

和

分别由

和

逐列组成，其中，上角标I表示训练样本和待分类物体的视觉模态，上角标A表示训练样本和待分类物体的触觉加速度模态，S表示训练样本和待分类物体的触觉声音模态，

表示初始权重，

表示产生第ζ个特征图的初始权重，1≤Φ≤Ψ，1≤ζ≤K，第Φ个尺度局部感受野的大小为r_Φ×r_Φ，(3-1) Combine the visual images with the size of I ₁ and I ₂ obtained in the above step (2) of 32×32×3, the tactile acceleration spectrum images of A ₁ and A ₂ with the size of 32×32×3, and S ₁ and S ₂ sound spectrum images of size 32 × 32 × 3 are input to the first layer of the neural network, namely the input layer, the size of the input image is d × d, and the local receptive field in this neural network has Ψ scale channels , the sizes of the Ψ scale channels are r ₁ , r ₂ ,...,r _Ψ respectively, each scale channel generates K different input weights, thereby randomly generating Ψ×K feature maps, and the Φth Φ randomly generated by the neural network is generated. The initial weights of the visual image, tactile acceleration spectrogram and sound spectrogram of each scale channel are denoted as

and

respectively by

and

Formed column by column, where the superscript I represents the visual mode of the training sample and the object to be classified, the superscript A represents the tactile acceleration mode of the training sample and the object to be classified, and S represents the tactile sound of the training sample and the object to be classified modal,

represents the initial weight,

represents the initial weight for generating the ζth feature map, 1≤Φ≤Ψ, 1≤ζ≤K, and the size of the Φth scale local receptive field is r _Φ ×r _Φ ,

进而得到第Φ个尺度通道的所有K个特征图的大小为(d-r_Φ+1)×(d-r_Φ+1)；Then, the size of all K feature maps of the Φth scale channel is obtained as (dr _Φ +1)×(dr _Φ +1);

(3-2)使用奇异值分解方法，对上述第Φ个尺度通道的初始权重矩阵

进行正交化处理，得到正交矩阵

和

和

中的每一列

和

分别为

和

的正交基，第Φ个尺度通道的第ζ个特征图的输入权重

和

分别为由

和

形成的方阵；(3-2) Using the singular value decomposition method, the initial weight matrix of the above Φth scale channel

Orthogonalization is performed to obtain an orthogonal matrix

and

each column in

and

respectively

and

The orthonormal basis of , the input weights of the ζth feature map of the Φth scale channel

and

respectively for the reason

and

formed square matrix;

利用下式，分别计算视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道的第ζ特征图中的节点(i,j)的卷积特征：The convolution features of nodes (i, j) in the ζ-th feature map of the Φ-th scale channel of the visual modality, tactile acceleration modality, and haptic sound modality are calculated using the following formula:

Φ＝1,2,3...,Ψ，Φ=1,2,3...,Ψ,

i,j＝1,...,(d-r_Φ+1)，i,j=1,...,( _drΦ +1),

ζ＝1,2,3...,K，ζ=1,2,3...,K,

和

分别表示视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道中第ζ特征图的节点(i,j)的卷积特征，x是与节点(i,j)对应的矩阵；

and

Represents the convolution features of the node (i, j) of the ζ-th feature map in the Φ-th scale channel of the visual modality, tactile acceleration modality and haptic sound modality, respectively, x is the matrix corresponding to node (i, j) ;

(4)对上述视觉模态、触觉加速度模态和触觉声音模态的卷积特征进行多尺度平方根池化，池化尺度有Ψ个尺度，Ψ个尺度的大小分别为e₁,e₂,…,e_Ψ，第Φ个尺度下池化大小e_Φ表示池化中心和边缘之间的距离，池化图和特征图大小相同，为(d-r_Φ+1)×(d-r_Φ+1)，根据上述步骤(3)得到的卷积特征，利用下式计算池化特征：(4) Multi-scale square root pooling is performed on the convolution features of the above visual modality, tactile acceleration modality and tactile sound modality. The pooling scale has Ψ scales, and the sizes of the Ψ scales are e ₁ , e ₂ , …,e _Ψ , the pooling size e _Φ at the Φth scale represents the distance between the pooling center and the edge, the pooling map and the feature map have the same size, which is (dr _Φ +1)×(dr _Φ +1), according to The convolution feature obtained in the above step (3), the pooling feature is calculated by the following formula:

p,q＝1,...,(d-r_Φ+1)，p,q=1,...,( _drΦ +1),

若节点(i,j)不在(d-r_Φ+1)，则

和

均为零，If node (i,j) is not at (dr _Φ +1), then

and

are all zero,

Φ＝1,2,3...,Ψ，Φ=1,2,3...,Ψ,

ζ＝1,2,3...,K，ζ=1,2,3...,K,

其中，

和

分别表示视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道中第ζ个池化图的节点(p,q)的池化特征；in,

and

Represent the pooling features of the nodes (p, q) of the ζth pooling graph in the Φth scale channel of the visual modality, the haptic acceleration modality and the haptic sound modality, respectively;

(5)根据上述池化特征，得到三个模态的全连接特征向量，包括以下步骤：(5) According to the above pooling features, the fully connected feature vectors of the three modalities are obtained, including the following steps:

(5-1)将步骤(4)的池化特征中的第ω个训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的池化图的所有池化特征，分别连接成一个行向量

和

其中1≤ω≤N₁；(5-1) Connect all the pooled features of the pooled graphs of the visual image modality, tactile acceleration modality and tactile sound modality of the ωth training sample in the pooling feature of step (4) into one row vector

and

where 1≤ω≤N ₁ ;

(5-2)遍历N₁个训练样本，重复上述步骤(5-1)，分别得到N₁训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的行向量组合，记为：(5-2) Traverse N ₁ training samples, repeat the above step (5-1), and obtain the row vector combination of the visual image mode, tactile acceleration mode and tactile sound mode of the N ₁ training samples respectively, which is recorded as:

其中，

表示视觉模态的组合特征向量矩阵，

表示触觉加速度模态特征矩阵，

表示触觉声音模态的特征向量矩阵；in,

the combined eigenvector matrix representing the visual modality,

represents the tactile acceleration modal feature matrix,

Eigenvector matrix representing the tactile sound modality;

(6)三个模态的全连接特征向量，进行多模态融合，得到多模态融合后的混合矩阵，包括以下步骤：(6) The fully connected eigenvectors of the three modalities are multi-modal fusion, and the mixed matrix after multi-modal fusion is obtained, including the following steps:

(6-1)将上述步骤(5)的N₁训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的行向量输入混合层进行组合处理，得到一个混合矩阵H＝[H^I,H^A,H^S]；(6-1) Input the row vectors of the visual image modality, the tactile acceleration modality and the tactile sound modality of the N ₁ training samples in the above step (5) into the mixing layer for combined processing to obtain a mixed matrix H=[H ^I , H ^A , H ^S ];

(6-2)对步骤(6-1)的混合矩阵H中的每个样本的混合行向量进行调整，生成一个多模态融合后的二维混合矩阵，二维混合矩阵的大小为

其中，d'是二维矩阵的长度，取值范围为

(6-2) Adjust the mixing row vector of each sample in the mixing matrix H of step (6-1) to generate a two-dimensional mixing matrix after multimodal fusion, and the size of the two-dimensional mixing matrix is

Among them, d' is the length of the two-dimensional matrix, and the value range is

(7)将上述步骤(6)得到的多模态融合后的混合矩阵输入到神经网络的混合网络层，通过多尺度特征映射，获得多模态混合卷积特征，包括以下步骤：(7) Input the mixed matrix after multi-modal fusion obtained in the above step (6) into the mixed network layer of the neural network, and obtain multi-modal mixed convolution features through multi-scale feature mapping, including the following steps:

(7-1)将上述步骤(6-2)得到的多模态融合后的混合矩阵输入到混合网络中，混合矩阵的大小为d'×d”，该混合网络有Ψ'个尺度通道，Ψ'个尺度通道的大小分别为r₁,r₂,…,r_Ψ'，每个尺度通道产生K'个不同的输入权重，从而随机生成Ψ'×K'个混合特征图，将混合网络随机产生第Φ'个尺度通道混合初始权重记为

由

逐列组成，其中上角标hybrid表示三模态融合，

表示混合网络的初始权重，

表示产生第ζ'个混合特征图的初始权重，1≤Φ'≤Ψ',1≤ζ'≤K',第Φ'个尺度通道局部感受野的大小为r_Φ'×r_Φ'，那么

(7-1) Input the multimodal fusion matrix obtained in the above step (6-2) into the hybrid network, the size of the hybrid matrix is d'×d", and the hybrid network has Ψ' scale channels, The sizes of the Ψ' scale channels are r ₁ , r ₂ ,...,r _Ψ' , and each scale channel generates K' different input weights, thereby randomly generating Ψ'×K' mixed feature maps, and the mixed network Randomly generate the Φ'th scale channel mixing initial weight and record it as

Depend on

Column-by-column composition, where the superscript hybrid indicates three-modal fusion,

represents the initial weights of the hybrid network,

Represents the initial weight of generating the ζ'th mixed feature map, 1≤Φ'≤Ψ', 1≤ζ'≤K', the size of the local receptive field of the Φ'th scale channel is r _Φ' ×r _Φ' , then

进而得到第Φ'个尺度通道第ζ'个特征图的大小为(d'-r_Φ'+1)×(d”-r_Φ'+1)；Then, the size of the ζ'th feature map of the Φ'th scale channel is (d'-r _Φ' +1)×(d"-r _Φ' +1);

(7-2)使用奇异值分解方法，对上述第Φ'个尺度通道初始权重矩阵

进行正交化处理，得到正交矩阵

的每一列

是

的正交基，第Φ'个尺度通道的第ζ'个特征图的输入权重

是由

形成的方阵；(7-2) Using the singular value decomposition method, the initial weight matrix for the above Φ'th scale channel

Orthogonalization is performed to obtain an orthogonal matrix

each column of

Yes

The orthonormal basis of , the input weights of the ζ'th feature map of the Φ'th scale channel

By

formed square matrix;

利用下式，计算第Φ'个尺度通道的第ζ'特征图中的卷积节点(i',j')混合卷积特征：Using the following formula, calculate the mixed convolution feature of the convolution node (i', j') in the ζ'th feature map of the Φ'th scale channel:

Φ'＝1,2,3...,Ψ'，Φ'=1,2,3...,Ψ',

i',j'＝1,...,(d'-r_Φ'+1)，i',j'=1,...,(d'- _rΦ' +1),

ζ'＝1,2,3...,K'，ζ'=1,2,3...,K',

是第Φ'个尺度通道的第ζ'特征图中的卷积节点(i',j')混合卷积特征，x'是与节点(i',j')对应的矩阵；

is the convolution node (i', j') mixed convolution feature in the ζ'-th feature map of the Φ'-th scale channel, and x' is the matrix corresponding to the node (i', j');

(8)对上述混合卷积特征，进行混合多尺度平方根池化，池化尺度有Ψ'个尺度，大小分别为e₁,e₂,…,e_Ψ'，第Φ'个尺度下池化图和特征图大小相同，为(d'-r_Φ'+1)×(d”-r_Φ'+1)，根据上述步骤(7)得到的混合卷积特征，利用下式计算混合池化特征：(8) Perform hybrid multi-scale square root pooling for the above mixed convolution features, the pooling scale has Ψ' scales, and the sizes are e ₁ , e ₂ ,...,e _Ψ' , and the pooling graph at the Φ'th scale The size is the same as the feature map, which is (d'-r _Φ' +1)×(d"-r _Φ' +1). According to the mixed convolution feature obtained in the above step (7), the following formula is used to calculate the mixed pooling feature :

p',q'＝1,...,(d'-r_Φ'+1),p',q'=1,...,(d'-r _Φ' +1),

若节点(i',j')不在d'-r_Φ'+1，则

为零，If the node (i', j') is not in d'-r _Φ' +1, then

zero,

Φ'＝1,2,3...,Ψ'，Φ'=1,2,3...,Ψ',

ζ'＝1,2,3...,K'，ζ'=1,2,3...,K',

其中，

表示第Φ'个尺度通道的第ζ'个池化图的组合节点(p',q')的混合池化特征；in,

Represents the mixed pooling feature of the combined node (p', q') of the ζ'th pooling graph of the Φ'th scale channel;

(9)根据上述混合池化特征，重复步骤(5)，将不同尺度的混合池化特征向量进行全连接，得到混合网络的组合特征矩阵

其中K'表示每个尺度通道产生不同特征图的个数；(9) According to the above hybrid pooling feature, repeat step (5) to fully connect the hybrid pooling feature vectors of different scales to obtain the combined feature matrix of the hybrid network

where K' represents the number of different feature maps generated by each scale channel;

(10)根据上述步骤(9)得到的混合网络的组合特征矩阵H^hybric，利用下式，根据训练样本的个数N₁，计算神经网络的训练样本输出权重β：(10) According to the combined feature matrix H ^hybric of the hybrid network obtained in the above step (9), use the following formula to calculate the training sample output weight β of the neural network according to the number N ₁ of training samples:

若

则

like

but

若

则

like

but

其中，T是训练样本

的期望值，C为正则化系数，取值为任意值，本发明一个实施例中，C的取值为5，上标T表示矩阵转置；where T is the training sample

The expected value of , C is the regularization coefficient, and the value is any value. In an embodiment of the present invention, the value of C is 5, and the superscript T represents the matrix transposition;

(11)利用上述步骤(3)中三个模态初始权重正交化后的正交矩阵

和

对经过预处理的待分类数据集D₂，利用上述步骤(3)-步骤(9)的方法，得到待分类样本的三模态混合特征向量H_test；(11) Use the orthogonal matrix obtained by orthogonalizing the initial weights of the three modes in the above step (3)

and

For the preprocessed data set D ₂ to be classified, the method of the above-mentioned steps (3)-step (9) is used to obtain the three-modal mixed feature vector H _test of the sample to be classified;

(12)根据上述步骤(10)的训练样本输出权重β和上述步骤(11)的三模态混合特征向量H_test，利用下式计算出N₂个待分类样本的预测标签μ_ε，实现基于多模态融合深度学习的物体材质分类，(12) According to the training sample output weight β in the above step (10) and the three-modal mixed feature vector H _test in the above step (11), use the following formula to calculate the predicted labels μ _ε of the N ₂ samples to be classified, and realize the Multi-modal fusion deep learning object material classification,

μ_ε＝H_testβ1≤ε≤M。μ _ε =H _test β1≤ε≤M.

本发明提出的基于多模态融合深度学习的物体材质分类方法，具有以下特点和优点：The object material classification method based on multi-modal fusion deep learning proposed by the present invention has the following characteristics and advantages:

1、本发明提出的基于多尺度局部感受野的超限学习机方法，可以用多个尺度的局部感受野来感受材质，提取出多样的特征，实现复杂物体材质的分类。1. The ELM method based on multi-scale local receptive fields proposed by the present invention can use local receptive fields of multiple scales to perceive materials, extract various features, and realize the classification of complex object materials.

2、本发明的基于多尺度局部感受野的超限学习机的深度学习方法，可以将特征学习和图像分类集一体，而不是由人为设计的特征提取器提取特征，因此该算法适用于大部分不同材质的对象分类。2. The deep learning method of the ELM based on the multi-scale local receptive field of the present invention can integrate feature learning and image classification, instead of extracting features by an artificially designed feature extractor, so the algorithm is suitable for most Object classification for different materials.

3、本发明的基于多尺度局部感受野的超限学习机方法，是一种基于多尺度局部感受野的超限学习机的多模态融合深度学习方法，可以有效的将物体材质三个模态的信息融合，实现信息互补，提高了材质分类的鲁棒性和准确率。3. The ELM method based on the multi-scale local receptive field of the present invention is a multi-modal fusion deep learning method based on the ELM based on the multi-scale local receptive field, which can effectively combine the three modes of the object material. The state information fusion realizes information complementation, and improves the robustness and accuracy of material classification.

附图说明Description of drawings

图1为本发明方法的流程框图。FIG. 1 is a flow chart of the method of the present invention.

图2为本发明方法中基于多尺度局部感受野的超限学习机的流程框图。FIG. 2 is a flow chart of an ELM based on a multi-scale local receptive field in the method of the present invention.

图3为本发明中基于多尺度局部感受野的超限学习机方法不同模态融合的流程框图。FIG. 3 is a flow chart of the fusion of different modalities of the ELM method based on the multi-scale local receptive field in the present invention.

具体实施方式Detailed ways

本发明提出的基于多模态融合深度学习的物体材质分类方法，其流程框图如图1所示，主要分为视觉图像模态，触觉加速度模态,触觉声音模态和混合网络四大部分。包括以下步骤：The object material classification method based on multi-modal fusion deep learning proposed by the present invention, its flowchart is shown in Figure 1, which is mainly divided into four parts: visual image mode, tactile acceleration mode, tactile sound mode and hybrid network. Include the following steps:

其中1≤M₂≤M₁，分别采集所有N₂个待分类物体的视觉图像I₂、触觉加速度A₂和触觉声音S₂，建立一个包括I₂、A₂和S₂的数据集D₂，I₂的图像大小为320×480；其中的触觉加速度A₁和A₂是刚性物体在材质表面滑动时用传感器采集得到的一维信号，触觉声音S₁和S₂也是刚性物体在物体材质表面滑动时，用麦克风保存的一维信号；Suppose the number of objects to be classified is N ₂ , the material types of the objects to be classified are M ₂ , and the label of each object to be classified is recorded as

where 1≤M ₂ ≤M ₁ , collect visual images I ₂ , haptic acceleration A ₂ and haptic sound S ₂ of all N ₂ objects to be classified, respectively, and establish a dataset D ₂ including I ₂ , A ₂ and S ₂ , the image size of I ₂ is 320×480; the tactile accelerations A ₁ and A ₂ are the one-dimensional signals collected by the sensor when the rigid object slides on the surface of the material, and the tactile sounds S ₁ and S ₂ are also the rigid objects in the object material The one-dimensional signal saved with the microphone when the surface slides;

(2-2)利用短时傅里叶变换方法，分别将触觉加速度A₁和触觉加速度A₂转换到频域，短时傅里叶变换中的汉明窗的窗口长度为500，窗口偏移量为100，采样频率为10kHz，分别得到触觉加速度A₁和触觉加速度A₂的频谱图，从频谱图中选择前500个低频信道作为频谱图像，该频谱图像保留了来自触觉信号的大部分能量，对该频谱图像进行降采样，得到A₁和A₂的大小为32×32×3的触觉加速度频谱图像；(2-2) Using the short-time Fourier transform method, the tactile acceleration A ₁ and the tactile acceleration A ₂ are respectively converted into the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of the haptic acceleration A ₁ and the haptic acceleration A ₂ are obtained respectively, and the first 500 low-frequency channels are selected from the spectrogram as the spectrum image, which retains most of the energy from the haptic signal , down-sampling the spectral image to obtain the tactile acceleration spectral images of A ₁ and A ₂ with a size of 32×32×3;

(2-3)利用短时傅里叶变换方法，分别将触觉声音S₁和触觉声音S₂转换到频域，短时傅里叶变换中的汉明窗的窗口长度为500，窗口偏移量为100，采样频率为10kHz，分别得到触觉声音S₁和触觉声音S₂的频谱图，从频谱图中选择前500个低频信道作为频谱图像，该频谱图像保留了来自触觉信号的大部分能量，对该频谱图像进行降采样，得到S₁和S₂的大小为32×32×3的声音频谱图像；(2-3) Using the short-time Fourier transform method, the tactile sound S ₁ and the tactile sound S ₂ are respectively converted to the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of the haptic sound S1 and the haptic sound S2 are obtained respectively, and the _first 500 low _- frequency channels are selected from the spectrogram as the spectrum image, which retains most of the energy from the haptic signal , down-sampling the spectrum image to obtain a sound spectrum image with a size of 32×32× ₃ for S1 and S2 _;

和

和

分别由

和

表示初始权重，

and

respectively by

and

represents the initial weight,

进行正交化处理，得到正交矩阵

和

正交化的输入权重更能提取出更为完备的特征，

和

中的每一列

和

分别为

和

的正交基，第Φ个尺度通道的第ζ个特征图的输入权重

和

分别为由

和

Orthogonalization is performed to obtain an orthogonal matrix

and

Orthogonalized input weights can extract more complete features,

and

each column in

and

respectively

and

respectively for the reason

and

formed square matrix;

Φ＝1,2,3...,Ψ，Φ=1,2,3...,Ψ,

i,j＝1,...,(d-r_Φ+1)，i,j=1,...,( _drΦ +1),

ζ＝1,2,3...,K，ζ=1,2,3...,K,

和

and

(4)对上述视觉模态、触觉加速度模态和触觉声音模态的卷积特征进行多尺度平方根池化，池化尺度有Ψ个尺度，Ψ个尺度的大小分别为e₁,e₂,…,e_Ψ，第Φ个尺度下池化大小e_Φ表示池化中心和边缘之间的距离，如图2所示，池化图和特征图大小相同，为(d-r_Φ+1)×(d-r_Φ+1)，根据上述步骤(3)得到的卷积特征，利用下式计算池化特征：(4) Multi-scale square root pooling is performed on the convolution features of the above visual modality, tactile acceleration modality and tactile sound modality. The pooling scale has Ψ scales, and the sizes of the Ψ scales are e ₁ , e ₂ , …,e _Ψ , the pooling size e _Φ at the Φth scale represents the distance between the pooling center and the edge, as shown in Figure 2, the pooling map and the feature map have the same size, which is (dr _Φ +1)×(dr _Φ +1), according to the convolution feature obtained in the above step (3), use the following formula to calculate the pooling feature:

p,q＝1,...,(d-r_Φ+1),p,q=1,...,( _drΦ +1),

若节点(i,j)不在(d-r_Φ+1)，则

和

均为零，If node (i,j) is not at (dr _Φ +1), then

and

are all zero,

Φ＝1,2,3...,Ψ，Φ=1,2,3...,Ψ,

ζ＝1,2,3...,K，ζ=1,2,3...,K,

其中，

和

and

和

and

where 1≤ω≤N ₁ ;

其中，

表示视觉模态的组合特征向量矩阵，

表示触觉加速度模态特征矩阵，

表示触觉声音模态的特征向量矩阵；in,

the combined eigenvector matrix representing the visual modality,

represents the tactile acceleration modal feature matrix,

Eigenvector matrix representing the tactile sound modality;

如图3所示，其中，d'是二维矩阵的长度，取值范围为

As shown in Figure 3, where d' is the length of the two-dimensional matrix, and the value range is

由

逐列组成，其中上角标hybrid表示三模态融合，

表示混合网络的初始权重，

Depend on

represents the initial weights of the hybrid network,

进行正交化处理，得到正交矩阵

正交化的输入权重更能提取出更为完备的特征，

的每一列

是

的正交基，第Φ'个尺度通道的第ζ'个特征图的输入权重

是由

Orthogonalization is performed to obtain an orthogonal matrix

Orthogonalized input weights can extract more complete features,

each column of

Yes

By

formed square matrix;

Φ'＝1,2,3...,Ψ'，Φ'=1,2,3...,Ψ',

i',j'＝1,...,(d'-r_Φ'+1)，i',j'=1,...,(d'- _rΦ' +1),

ζ'＝1,2,3...,K'，ζ'=1,2,3...,K',

p',q'＝1,...,(d'-r_Φ'+1),p',q'=1,...,(d'-r _Φ' +1),

若节点(i',j')不在d'-r_Φ'+1，则

为零，If the node (i', j') is not in d'-r _Φ' +1, then

zero,

Φ'＝1,2,3...,Ψ'，Φ'=1,2,3...,Ψ',

ζ'＝1,2,3...,K'，ζ'=1,2,3...,K',

其中，

若

则

like

but

若

则

like

but

其中，T是训练样本

(11)利用上述步骤(3)中三个模态初始权重正交化后的正交矩阵

和

对经过预处理的待分类数据集D₂，得到待分类样本的三模态混合特征向量H_test；利用上述步骤(3)，可以得到待分类物体三个模态的卷积特征向量；利用上述步骤(4)，可以得到待分类物体的三个模态的池化特征向量；利用上述步骤(5)，可以得到待分类物体的三个模态的全连接特征向量；利用上述步骤(6)，可以得到待分类物体的多模态融合后的混合矩阵；利用上述步骤(7)，可以得到待分类物体的多模态混合卷积特征；利用上述步骤(8)，可以得到待分类物体的多模态混合池化特征；利用上述步骤(9)，可以得到待分类物体的三模态混合特征向量H_test。(11) Use the orthogonal matrix obtained by orthogonalizing the initial weights of the three modes in the above step (3)

and

For the preprocessed data set D ₂ to be classified, the three-modal mixed feature vector H _test of the sample to be classified is obtained; using the above step (3), the convolution feature vectors of the three modalities of the object to be classified can be obtained; using the above Step (4), the pooled feature vectors of the three modalities of the object to be classified can be obtained; using the above step (5), the fully connected feature vectors of the three modalities of the object to be classified can be obtained; using the above step (6) , the mixed matrix of the multimodal fusion of the object to be classified can be obtained; using the above step (7), the multimodal mixed convolution feature of the object to be classified can be obtained; using the above step (8), the mixed convolution feature of the object to be classified can be obtained. Multi-modal mixed pooling feature; using the above step (9), the three-modal mixed feature vector H _test of the object to be classified can be obtained.

μ_ε＝H_testβ 1≤ε≤M。μ _ε =H _test β 1≤ε≤M.

Claims

1. an object material classification method based on multi-mode fusion deep learning is characterized by comprising the following steps:

(1) let the number of training samples be N₁The training sample material type is M₁Each class of material training sample is marked with a label of

Wherein 1 is less than or equal to M₁≤N₁Separately collecting all N₁Visual image I of a training sample₁Tactile acceleration A₁And a tactile sound S₁Establishing an inclusion I₁、A₁And S₁Data set D of₁，I₁The image size of (2) is 320 × 480;

setting the number of objects to be classified as N₂The kind of the material of the object to be classified is M₂Each class of object to be classified is labeled as

Wherein 1 is less than or equal to M₂≤M₁Separately collecting all N₂Visual image I of an object to be classified₂Tactile acceleration A₂And a tactile sound S₂Establishing an inclusion I₂、A₂And S₂Data set D of₂，I₂The image size of (2) is 320 × 480;

(2) for the above data set D₁And a data set D₂The method comprises the following steps of carrying out visual image preprocessing on a visual image, carrying out tactile acceleration preprocessing on a tactile acceleration signal and carrying out tactile sound preprocessing on a tactile sound signal to respectively obtain a visual image, a tactile acceleration spectrogram and a tactile sound spectrogram, and comprises the following steps:

(2-1) image I with image size of 320X 480 by using down-sampling method₁And image I₂Down-sampling to obtain I₁And I₂A visual image of size 32 × 32 × 3;

(2-2) separately converting the tactile acceleration A into a plurality of tactile accelerations A by short-time Fourier transform₁And tactile acceleration A₂Converting to frequency domain, the window length of Hamming window in short-time Fourier transform is 500, the window offset is 100, the sampling frequency is 10kHz, and the tactile acceleration A is obtained respectively₁And tactile acceleration A₂The first 500 low-frequency channels are selected from the spectrogram to be used as spectrum images, and the spectrum images are subjected to down-sampling to obtain A₁And A₂A haptic acceleration spectrum image of size 32 × 32 × 3;

(2-3) separately converting the tactile sounds S by short-time Fourier transform₁And a tactile sound S₂Conversion to frequency domain, short timeThe window length of Hamming window in Fourier transform is 500, window offset is 100, sampling frequency is 10kHz, and tactile sound S is obtained respectively₁And a tactile sound S₂The first 500 low-frequency channels are selected from the spectrogram to be used as spectrum images, and the spectrum images are subjected to down-sampling to obtain S₁And S₂A sound spectrum image of size 32 × 32 × 3;

(3) obtaining convolution characteristics of a visual modality, a tactile acceleration modality and a tactile sound modality through multi-scale feature mapping, and comprising the following steps:

(3-1) subjecting the I obtained in the step (2) to₁And I₂A 32X 3 visual image, A₁And A₂Magnitude of 32 × 32 × 3 and S₁And S₂The size of the input image is d × d × 3, the local receptive field in the neural network has Ψ scale channels, and the Ψ scale channels have r sizes respectively₁,r₂,…,r_ΨGenerating K different input weights for each scale channel so as to randomly generate psi multiplied by K feature maps, and recording initial weights of a visual image, a tactile acceleration frequency spectrogram and a sound frequency spectrogram of a phi scale channel randomly generated by a neural network as

And

and

are respectively composed of

And

the method comprises the steps of composing column by column, wherein an upper corner mark I represents visual modals of a training sample and an object to be classified, an upper corner mark A represents a tactile acceleration modals of the training sample and the object to be classified, S represents a tactile sound modals of the training sample and the object to be classified,

it is shown that the initial weight is,

representing the initial weight for generating the zeta-th feature map, phi is more than or equal to 1 and less than or equal to psi, zeta is more than or equal to 1 and less than or equal to K, and the size of the phi-th scale local receptive field is r_Φ×r_Φ，

And obtaining the size (d-r) of all K characteristic maps of the phi-th scale channel_Φ+1)×(d-r_Φ+1)；

(3-2) initial weight matrix for the phi-th scale channel using singular value decomposition method

Performing orthogonalization processing to obtain an orthogonal matrix

And

and

each column of

And

are respectively as

Orthogonal basis of (1), input weight of the ζ -th feature map of the Φ -th scale channel

And

are respectively composed of

And

forming a square matrix;

calculating the convolution characteristics of the nodes (i, j) in the zeta-th feature map of the phi-th scale channel of the visual, tactile acceleration and tactile sound modalities respectively by using the following formula:

and

a convolution feature of a node (i, j) of a ζ -th feature graph in a Φ -th scale channel respectively representing a visual modality, a tactile acceleration modality, and a tactile sound modality, x being a matrix corresponding to the node (i, j);

(4) performing multi-scale square root pooling on convolution characteristics of the visual modality, the tactile acceleration modality and the tactile sound modality, wherein the pooling scales have psi scales, and the magnitudes of the psi scales are e₁,e₂,…,e_ΨSize of pooling at the phi-th scale e_ΦIndicating the distance between the pooling center and the edge, the pooling map and the feature map being of the same size and being (d-r)_Φ+1)×(d-r_Φ+1) calculating the pooling feature from the convolution feature obtained in step (3) using the following formula:

if node i is not (0, (d-r)_Φ+1)), node j is not at (0, (d-r)_Φ+1)), then

And

are all zero, and the total number of the active carbon particles is zero,

Φ＝1,2,3...,Ψ，

ζ＝1,2,3...,K，

wherein,

and

pooling features of nodes (p, q) of a ζ -th pooling graph in a Φ -th scale channel representing a visual modality, a haptic acceleration modality, and a haptic sound modality, respectively;

(5) obtaining full-connection feature vectors of three modes according to the pooling features, and the method comprises the following steps:

(5-1) connecting all the pooled features of the pooled graphs of the visual image modality, the tactile acceleration modality and the tactile sound modality of the omega training sample in the pooled features of the step (4) into a row vector respectively

And

wherein omega is more than or equal to 1 and less than or equal to N₁；

(5-2) traversal of N₁Repeating the step (5-1) for each training sample to obtain N₁The row vector combination of the visual image modality, the tactile acceleration modality, and the tactile sound modality of the training sample is recorded as:

wherein,

a matrix of combined feature vectors representing the visual modalities,

a matrix of tactile acceleration modal characteristics is represented,

a matrix of feature vectors representing haptic sound modalities;

(6) the method comprises the following steps of performing multi-mode fusion on fully connected feature vectors of three modes to obtain a multi-mode fused mixing matrix:

(6-1) reacting N in the step (5)₁The method comprises the steps of inputting a mixed layer of a visual image modality, a tactile acceleration modality and a tactile sound modality of a training sample in a row vector mode, and performing combination processing to obtain a mixed matrix H ═ H^I,H^A,H^S]；

(6-2) adjusting the mixing row vector of each sample in the mixing matrix H in the step (6-1) to generate a multi-mode fused two-dimensional mixing matrix, wherein the size of the two-dimensional mixing matrix is d' × d ",

wherein d' is the length of the two-dimensional matrix and has a value range of

(7) Inputting the multi-modal fused mixing matrix obtained in the step (6) into a mixing network layer of a neural network, and obtaining multi-modal mixed convolution characteristics through multi-scale characteristic mapping, wherein the method comprises the following steps:

(7-1) inputting the multi-modal fused mixing matrix obtained in the step (6-2) into a mixing network, wherein the size of the mixing matrix is d 'multiplied by d', the mixing network is provided with psi 'scale channels, and the sizes of the psi' scale channels are r₁,r₂,…,r_Ψ'Generating K 'different input weights for each scale channel, thereby randomly generating psi' multiplied by K 'mixed feature maps, and recording phi' scale channel mixed initial weights randomly generated by the mixed network as

By

Column by column, wherein the superscript hybrid represents a tri-modal fusion,

an initial weight of the hybrid network is represented,

representing the initial weight for generating the zeta 'th mixed feature map, 1 ≦ phi' ≦ psi ',1 ≦ zeta' ≦ K ', and the size of the phi' th scale channel local receptive field is r_Φ'×r_Φ'Then, then

Further, the size of the zeta ' th characteristic diagram of the phi ' th scale channel is obtained as (d ' -r)_Φ'+1)×(d”-r_Φ'+1)；

(7-2) using a singular value decomposition method to initialize a weight matrix for the phi' -th scale channel

Performing orthogonalization processing to obtain an orthogonal matrix

Each column of

Is that

The input weight of the ζ 'th feature map of the Φ' th scale channel

Is formed by

Forming a square matrix;

calculating the convolution node (i ', j') mixed convolution characteristics in the zeta 'th characteristic graph of the phi' th scale channel by using the following formula:

is a convolution node (i ', j ') mixed convolution feature in the ζ ' th feature graph of the Φ ' th scale channel, and x ' is a matrix corresponding to the node (i ', j ');

(8) performing mixed multi-scale square root pooling on the mixed convolution characteristics, wherein the pooling scales have psi' scales and the sizes are e respectively₁,e₂,…,e_Ψ'The pooling map and the feature map at the phi 'th scale have the same size and are (d' -r)_Φ'+1)×(d”-r_Φ'+1), calculating the mixed pooling feature according to the mixed convolution feature obtained in the step (7) by using the following formula:

if node i 'is not (0, (d' -r)_Φ’+1)), node j 'is not (0, (d' -r)_Φ’+1)), then

The number of the carbon atoms is zero,

Φ'＝1,2,3...,Ψ'，

ζ'＝1,2,3...,K'；

wherein,

hybrid pooling characteristics of the combined nodes (p ', q') of the ζ 'th pooling map representing the Φ' th scale channel;

(9) according to the mixed pooling characteristics, adopting the method of the step (5) to fully connect the mixed pooling characteristic vectors with different scales to obtain a combined characteristic matrix of the mixed network

Wherein K' represents the number of different characteristic graphs generated by each scale channel;

(10) the combination characteristic matrix H of the hybrid network obtained according to the step (9)^hybricUsing the following formula, based on the number N of training samples₁Computing training sample output weights for the neural network β:

if it is

Then

If it is

Then

Where T is a training sample

C is a regularization coefficient, the value is an arbitrary value, and superscript T represents matrix transposition;

(11) utilizing the orthogonal matrix after the initial weight orthogonalization of the three modes in the step (3)

And

for the preprocessed data set D to be classified₂Obtaining a combined feature matrix H of the mixed network of the samples to be classified by using the methods from the step (3) to the step (9)_test；

(12) According to the training sample output weight β of the step (10) and the combined feature matrix H of the mixed network of the samples to be classified of the step (11)_testBy usingN is calculated by the following formula₂Prediction tag mu of sample to be classified_εRealizes object material classification based on multi-mode fusion deep learning,

μ_ε＝H_testβ 1≤ε≤M₂。