CN107463952B - An object material classification method based on multimodal fusion deep learning - Google Patents
An object material classification method based on multimodal fusion deep learning Download PDFInfo
- Publication number
- CN107463952B CN107463952B CN201710599106.1A CN201710599106A CN107463952B CN 107463952 B CN107463952 B CN 107463952B CN 201710599106 A CN201710599106 A CN 201710599106A CN 107463952 B CN107463952 B CN 107463952B
- Authority
- CN
- China
- Prior art keywords
- tactile
- matrix
- modality
- scale
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000000463 material Substances 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000004927 fusion Effects 0.000 title claims abstract description 31
- 238000013135 deep learning Methods 0.000 title claims abstract description 14
- 230000001133 acceleration Effects 0.000 claims abstract description 69
- 230000000007 visual effect Effects 0.000 claims abstract description 59
- 230000005236 sound signal Effects 0.000 claims abstract description 4
- 239000011159 matrix material Substances 0.000 claims description 75
- 238000011176 pooling Methods 0.000 claims description 51
- 238000012549 training Methods 0.000 claims description 45
- 239000013598 vector Substances 0.000 claims description 29
- 238000001228 spectrum Methods 0.000 claims description 20
- 238000005070 sampling Methods 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 14
- 238000012360 testing method Methods 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 9
- 238000000354 decomposition reaction Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 230000017105 transposition Effects 0.000 claims description 3
- 239000000523 sample Substances 0.000 claims 15
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 claims 1
- 229910052799 carbon Inorganic materials 0.000 claims 1
- 125000004432 carbon atom Chemical group C* 0.000 claims 1
- 238000006243 chemical reaction Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 claims 1
- 239000002245 particle Substances 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000008447 perception Effects 0.000 abstract description 2
- 239000011365 complex material Substances 0.000 abstract 2
- 238000007500 overflow downdraw method Methods 0.000 abstract 1
- 238000004806 packaging method and process Methods 0.000 description 4
- 239000000123 paper Substances 0.000 description 4
- 239000004033 plastic Substances 0.000 description 3
- 229920003023 plastic Polymers 0.000 description 3
- 235000013305 food Nutrition 0.000 description 2
- 239000011521 glass Substances 0.000 description 2
- 239000002184 metal Substances 0.000 description 2
- 229910052751 metal Inorganic materials 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 210000004556 brain Anatomy 0.000 description 1
- 239000000919 ceramic Substances 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000005034 decoration Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000006260 foam Substances 0.000 description 1
- 239000011888 foil Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 150000002739 metals Chemical class 0.000 description 1
- 239000005022 packaging material Substances 0.000 description 1
- 238000004064 recycling Methods 0.000 description 1
- 239000005060 rubber Substances 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
- 235000011888 snacks Nutrition 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 239000004753 textile Substances 0.000 description 1
- 239000002023 wood Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
本发明涉及一种基于多模态融合深度学习的物体材质分类方法,属于计算机视觉、人工智能和材质分类技术领域。本发明是基于多模态融合深度学习的物体材质分类方法‑‑一种基于多尺度局部感受野的超限学习机的多模态融合方法。本发明将物体材质的不同模态的感知信息(包括视觉图像、触觉加速度信号和触觉声音信号)进行融合,最终实现物体材质的正确分类。该方法不仅可以利用多尺度局部感受野对现实复杂材质进行高代表性特征提取,而且可以有效融合各个模态信息,实现模态之间信息互补。利用本发明的方法可以提高复杂材质分类的鲁棒性和准确性,使之有更大的适用性和通用性。
The invention relates to an object material classification method based on multimodal fusion deep learning, and belongs to the technical fields of computer vision, artificial intelligence and material classification. The present invention is an object material classification method based on multi-modal fusion deep learning - a multi-modal fusion method based on multi-scale local receptive field and ultra-limited learning machine. The present invention fuses the perception information (including visual images, tactile acceleration signals and tactile sound signals) of different modalities of the material of the object, and finally realizes the correct classification of the material of the object. This method can not only use multi-scale local receptive fields to extract high-representative features of real complex materials, but also effectively fuse the information of various modalities to achieve information complementation between modalities. The method of the present invention can improve the robustness and accuracy of complex material classification, so that it has greater applicability and versatility.
Description
技术领域technical field
本发明涉及一种基于多模态融合深度学习的物体材质分类方法,属于计算机视觉、人工智能和材质分类技术领域。The invention relates to an object material classification method based on multimodal fusion deep learning, and belongs to the technical fields of computer vision, artificial intelligence and material classification.
背景技术Background technique
大千世界,材质种类繁多,可以分为塑料、金属、陶瓷,玻璃、木材、纺织品、石材、纸、橡胶和泡沫等种类。最近,物体材质分类已经极大地引起社会环保,工业界以及学术界的关注。比如材质的分类可以有效的用于材料的循环利用;包装材料的四大支柱:纸,塑料,金属和玻璃,在不同的市场需求下需要不能材质的包装。对于需要长距离运输但对运输质量无特殊要求,一般选用纸,纸板以及包装箱纸板;对于食品包装应该符合卫生标定,糕点等直接入口食品的包装应使用纸盒纸板,食盐等防光防潮的使用罐装,快餐盒的制造可以使用天然植物纤维;合理使用装饰材料是室内装饰成功的关键。基于上述问题的需求,研究一套能够自动对物体材质分类的方法就显得十分必要。There are many kinds of materials, which can be divided into plastics, metals, ceramics, glass, wood, textiles, stone, paper, rubber and foam. Recently, object material classification has greatly attracted the attention of social environmental protection, industry and academia. For example, the classification of materials can be effectively used for material recycling; the four pillars of packaging materials: paper, plastic, metal and glass, which require packaging that cannot be made of materials under different market demands. For long-distance transportation but no special requirements for transportation quality, paper, cardboard and packaging box cardboard are generally used; for food packaging, it should meet the hygienic standard, and the packaging of directly imported food such as cakes should use carton cardboard, salt and other light-proof and moisture-proof materials. Using cans, the manufacture of snack boxes can use natural plant fibers; the rational use of decorative materials is the key to the success of interior decoration. Based on the needs of the above problems, it is very necessary to develop a set of methods that can automatically classify the material of objects.
物体材质分类主流的方法是使用包含丰富信息的视觉图像,但是对于外观极其相似的两个物体仅用视觉图像是不能够区分的。假设有两个物体:一个红色粗糙的纸和一个红色的塑料箔,视觉图像对这两个物体具有较小的区分能力。但是对于上述假设,人脑会本能的将同一物体的不同模态感知特征进行融合,从而达到对物体材质分类的目的。受此启发,要使计算机实现对物体材质的自动分类,可以同时使用物体不同模态信息来进行物体材质分类。The mainstream method of object material classification is to use visual images containing rich information, but for two objects with extremely similar appearances, only visual images cannot be used to distinguish them. Suppose there are two objects: a red rough paper and a red plastic foil, for which the visual image has a small discriminative power. However, for the above assumptions, the human brain will instinctively fuse the different modal perception features of the same object, so as to achieve the purpose of classifying the material of the object. Inspired by this, in order to make the computer realize the automatic classification of the material of the object, the different modal information of the object can be used to classify the material of the object at the same time.
当前也有公开技术用于物体材质分类,如中国专利申请CN105005787A—一种基于灵巧手触觉信息的联合稀疏编码的材质分类。此发明对材质分类仅使用了触觉序列,并未将材质的多种模态信息结合起来。观察到仅使用视觉图像对物体材质分类不能鲁棒地捕获材质特征,如硬度或粗糙度。可以假设当刚性工具拖动或移动到不同物体的表面上时,工具将产生不同频率的振动和声音,因此可以使用与视觉互补的触觉信息来进行物体材质的分类。然而,如何有效地将视觉模态与触觉模态结合仍然是一个具有挑战性的问题。Currently, there are also disclosed technologies for object material classification, such as Chinese patent application CN105005787A—a material classification based on joint sparse coding of tactile information of dexterous hands. This invention only uses haptic sequences for material classification, and does not combine multiple modal information of materials. It was observed that using visual imagery alone to classify object material does not robustly capture material features such as hardness or roughness. It can be assumed that when a rigid tool is dragged or moved over the surfaces of different objects, the tool will generate vibrations and sounds of different frequencies, so that haptic information complementary to vision can be used to classify the material of objects. However, how to effectively combine visual modalities with haptic modalities remains a challenging problem.
发明内容SUMMARY OF THE INVENTION
本发明目的是提出一种基于多模态融合深度学习的物体材质分类方法,在基于多尺度局部感受野的超限学习机方法的基础上实现多模态信息融合的物体材质分类,以提高分类的鲁棒性和准确性,并有效地融合物体材质的多种模态信息进行材质分类。The purpose of the present invention is to propose an object material classification method based on multi-modal fusion deep learning, and realize the object material classification of multi-modal information fusion on the basis of the ultra-limited learning machine method based on multi-scale local receptive field, so as to improve the classification It is robust and accurate, and effectively fuses multiple modal information of object materials for material classification.
本发明提出的基于多模态融合深度学习的物体材质分类方法,包括以下步骤:The object material classification method based on multi-modal fusion deep learning proposed by the present invention includes the following steps:
(1)设训练样本个数为N1,训练样本材质种类为M1个,记每类材质训练样本的标签为其中1≤M1≤N1,分别采集所有N1个训练样本的视觉图像I1、触觉加速度A1和触觉声音S1,建立一个包括I1、A1和S1的数据集D1,I1的图像大小为320×480;(1) Suppose the number of training samples is N 1 , the material types of the training samples are M 1 , and the labels of the training samples of each type of material are recorded as where 1≤M 1 ≤N 1 , collect visual image I 1 , haptic acceleration A 1 and haptic sound S 1 of all N 1 training samples respectively, and establish a dataset D 1 including I 1 , A 1 and S 1 , The image size of I 1 is 320×480;
设待分类物体个数为N2,待分类物体材质的种类为M2个,记每类待分类物体的标签为其中1≤M2≤M1,分别采集所有N2个待分类物体的视觉图像I2、触觉加速度A2和触觉声音S2,建立一个包括I2、A2和S2的数据集D2,I2的图像大小为320×480;Suppose the number of objects to be classified is N 2 , the material types of the objects to be classified are M 2 , and the label of each object to be classified is recorded as where 1≤M 2 ≤M 1 , collect visual images I 2 , haptic acceleration A 2 and haptic sound S 2 of all N 2 objects to be classified, respectively, and establish a dataset D 2 including I 2 , A 2 and S 2 , the image size of I 2 is 320×480;
(2)对上述数据集D1和数据集D2视觉图像进行视觉图像预处理、触觉加速度信号进行触觉加速度预处理和触觉声音信号进行触觉声音预处理,分别得到视觉图像、触觉加速度频谱图和触觉声音频谱图,包括以下步骤:( 2 ) Perform visual image preprocessing, tactile acceleration preprocessing for tactile acceleration signals, and tactile sound preprocessing for tactile sound signals on the visual images of the above datasets D1 and D2, respectively, to obtain visual images, tactile acceleration spectrograms and Haptic sound spectrogram, including the following steps:
(2-1)利用降采样方法,对图像大小为320×480的图像I1和图像I2进行降采样,得到I1和I2的大小为32×32×3的视觉图像;(2-1) Using the down-sampling method, down-sampling the image I 1 and the image I 2 with an image size of 320×480 to obtain a visual image with a size of I 1 and I 2 of 32×32×3;
(2-2)利用短时傅里叶变换方法,分别将触觉加速度A1和触觉加速度A2转换到频域,短时傅里叶变换中的汉明窗的窗口长度为500,窗口偏移量为100,采样频率为10kHz,分别得到触觉加速度A1和触觉加速度A2的频谱图,从频谱图中选择前500个低频信道作为频谱图像,对该频谱图像进行降采样,得到A1和A2的大小为32×32×3的触觉加速度频谱图像;(2-2) Using the short-time Fourier transform method, the tactile acceleration A 1 and the tactile acceleration A 2 are respectively converted into the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of haptic acceleration A 1 and haptic acceleration A 2 are obtained respectively, and the first 500 low-frequency channels are selected from the spectrogram as the spectrum image, and the spectrum image is down-sampled to obtain A 1 and A 2 is a haptic acceleration spectrum image with a size of 32×32×3;
(2-3)利用短时傅里叶变换方法,分别将触觉声音S1和触觉声音S2转换到频域,短时傅里叶变换中的汉明窗的窗口长度为500,窗口偏移量为100,采样频率为10kHz,分别得到触觉声音S1和触觉声音S2的频谱图,从频谱图中选择前500个低频信道作为频谱图像,对该频谱图像进行降采样,得到S1和S2的大小为32×32×3的声音频谱图像;(2-3) Using the short-time Fourier transform method, the tactile sound S 1 and the tactile sound S 2 are respectively converted to the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of the haptic sound S1 and the haptic sound S2 are obtained respectively, and the first 500 low - frequency channels are selected from the spectrogram as the spectrum image, and the spectrum image is down - sampled to obtain S1 and S 2 is a sound spectrum image of size 32×32×3;
(3)通过多尺度特征映射,获得视觉模态、触觉加速度模态和触觉声音模态的卷积特征,包括以下步骤:(3) Obtain convolution features of visual modality, tactile acceleration modality and tactile sound modality through multi-scale feature mapping, including the following steps:
(3-1)将上述步骤(2)得到的I1和I2的大小为32×32×3的视觉图像、A1和A2的大小为32×32×3的触觉加速度频谱图像和S1和S2的大小为32×32×3的声音频谱图像输入到神经网络第一层,即输入层,输入图像的大小为d×d,该神经网络中的局部感受野具有Ψ个尺度通道,Ψ个尺度通道的大小分别为r1,r2,…,rΨ,每个尺度通道产生K个不同的输入权重,从而随机生成Ψ×K个特征图,将神经网络随机产生的第Φ个尺度通道的视觉图像、触觉加速度频谱图和声音频谱图的初始权重记为和 和分别由和逐列组成,其中,上角标I表示训练样本和待分类物体的视觉模态,上角标A表示训练样本和待分类物体的触觉加速度模态,S表示训练样本和待分类物体的触觉声音模态,表示初始权重,表示产生第ζ个特征图的初始权重,1≤Φ≤Ψ,1≤ζ≤K,第Φ个尺度局部感受野的大小为rΦ×rΦ,(3-1) Combine the visual images with the size of I 1 and I 2 obtained in the above step (2) of 32×32×3, the tactile acceleration spectrum images of A 1 and A 2 with the size of 32×32×3, and S 1 and S 2 sound spectrum images of size 32 × 32 × 3 are input to the first layer of the neural network, namely the input layer, the size of the input image is d × d, and the local receptive field in this neural network has Ψ scale channels , the sizes of the Ψ scale channels are r 1 , r 2 ,...,r Ψ respectively, each scale channel generates K different input weights, thereby randomly generating Ψ×K feature maps, and the Φth Φ randomly generated by the neural network is generated. The initial weights of the visual image, tactile acceleration spectrogram and sound spectrogram of each scale channel are denoted as and and respectively by and Formed column by column, where the superscript I represents the visual mode of the training sample and the object to be classified, the superscript A represents the tactile acceleration mode of the training sample and the object to be classified, and S represents the tactile sound of the training sample and the object to be classified modal, represents the initial weight, represents the initial weight for generating the ζth feature map, 1≤Φ≤Ψ, 1≤ζ≤K, and the size of the Φth scale local receptive field is r Φ ×r Φ ,
进而得到第Φ个尺度通道的所有K个特征图的大小为(d-rΦ+1)×(d-rΦ+1);Then, the size of all K feature maps of the Φth scale channel is obtained as (dr Φ +1)×(dr Φ +1);
(3-2)使用奇异值分解方法,对上述第Φ个尺度通道的初始权重矩阵进行正交化处理,得到正交矩阵和 和中的每一列和分别为和的正交基,第Φ个尺度通道的第ζ个特征图的输入权重 和分别为由和形成的方阵;(3-2) Using the singular value decomposition method, the initial weight matrix of the above Φth scale channel Orthogonalization is performed to obtain an orthogonal matrix and and each column in and respectively and The orthonormal basis of , the input weights of the ζth feature map of the Φth scale channel and respectively for the reason and formed square matrix;
利用下式,分别计算视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道的第ζ特征图中的节点(i,j)的卷积特征:The convolution features of nodes (i, j) in the ζ-th feature map of the Φ-th scale channel of the visual modality, tactile acceleration modality, and haptic sound modality are calculated using the following formula:
Φ=1,2,3...,Ψ,Φ=1,2,3...,Ψ,
i,j=1,...,(d-rΦ+1),i,j=1,...,( drΦ +1),
ζ=1,2,3...,K,ζ=1,2,3...,K,
和分别表示视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道中第ζ特征图的节点(i,j)的卷积特征,x是与节点(i,j)对应的矩阵; and Represents the convolution features of the node (i, j) of the ζ-th feature map in the Φ-th scale channel of the visual modality, tactile acceleration modality and haptic sound modality, respectively, x is the matrix corresponding to node (i, j) ;
(4)对上述视觉模态、触觉加速度模态和触觉声音模态的卷积特征进行多尺度平方根池化,池化尺度有Ψ个尺度,Ψ个尺度的大小分别为e1,e2,…,eΨ,第Φ个尺度下池化大小eΦ表示池化中心和边缘之间的距离,池化图和特征图大小相同,为(d-rΦ+1)×(d-rΦ+1),根据上述步骤(3)得到的卷积特征,利用下式计算池化特征:(4) Multi-scale square root pooling is performed on the convolution features of the above visual modality, tactile acceleration modality and tactile sound modality. The pooling scale has Ψ scales, and the sizes of the Ψ scales are e 1 , e 2 , …,e Ψ , the pooling size e Φ at the Φth scale represents the distance between the pooling center and the edge, the pooling map and the feature map have the same size, which is (dr Φ +1)×(dr Φ +1), according to The convolution feature obtained in the above step (3), the pooling feature is calculated by the following formula:
p,q=1,...,(d-rΦ+1),p,q=1,...,( drΦ +1),
若节点(i,j)不在(d-rΦ+1),则和均为零,If node (i,j) is not at (dr Φ +1), then and are all zero,
Φ=1,2,3...,Ψ,Φ=1,2,3...,Ψ,
ζ=1,2,3...,K,ζ=1,2,3...,K,
其中,和分别表示视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道中第ζ个池化图的节点(p,q)的池化特征;in, and Represent the pooling features of the nodes (p, q) of the ζth pooling graph in the Φth scale channel of the visual modality, the haptic acceleration modality and the haptic sound modality, respectively;
(5)根据上述池化特征,得到三个模态的全连接特征向量,包括以下步骤:(5) According to the above pooling features, the fully connected feature vectors of the three modalities are obtained, including the following steps:
(5-1)将步骤(4)的池化特征中的第ω个训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的池化图的所有池化特征,分别连接成一个行向量和其中1≤ω≤N1;(5-1) Connect all the pooled features of the pooled graphs of the visual image modality, tactile acceleration modality and tactile sound modality of the ωth training sample in the pooling feature of step (4) into one row vector and where 1≤ω≤N 1 ;
(5-2)遍历N1个训练样本,重复上述步骤(5-1),分别得到N1训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的行向量组合,记为:(5-2) Traverse N 1 training samples, repeat the above step (5-1), and obtain the row vector combination of the visual image mode, tactile acceleration mode and tactile sound mode of the N 1 training samples respectively, which is recorded as:
其中,表示视觉模态的组合特征向量矩阵,表示触觉加速度模态特征矩阵,表示触觉声音模态的特征向量矩阵;in, the combined eigenvector matrix representing the visual modality, represents the tactile acceleration modal feature matrix, Eigenvector matrix representing the tactile sound modality;
(6)三个模态的全连接特征向量,进行多模态融合,得到多模态融合后的混合矩阵,包括以下步骤:(6) The fully connected eigenvectors of the three modalities are multi-modal fusion, and the mixed matrix after multi-modal fusion is obtained, including the following steps:
(6-1)将上述步骤(5)的N1训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的行向量输入混合层进行组合处理,得到一个混合矩阵H=[HI,HA,HS];(6-1) Input the row vectors of the visual image modality, the tactile acceleration modality and the tactile sound modality of the N 1 training samples in the above step (5) into the mixing layer for combined processing to obtain a mixed matrix H=[H I , H A , H S ];
(6-2)对步骤(6-1)的混合矩阵H中的每个样本的混合行向量进行调整,生成一个多模态融合后的二维混合矩阵,二维混合矩阵的大小为其中,d'是二维矩阵的长度,取值范围为 (6-2) Adjust the mixing row vector of each sample in the mixing matrix H of step (6-1) to generate a two-dimensional mixing matrix after multimodal fusion, and the size of the two-dimensional mixing matrix is Among them, d' is the length of the two-dimensional matrix, and the value range is
(7)将上述步骤(6)得到的多模态融合后的混合矩阵输入到神经网络的混合网络层,通过多尺度特征映射,获得多模态混合卷积特征,包括以下步骤:(7) Input the mixed matrix after multi-modal fusion obtained in the above step (6) into the mixed network layer of the neural network, and obtain multi-modal mixed convolution features through multi-scale feature mapping, including the following steps:
(7-1)将上述步骤(6-2)得到的多模态融合后的混合矩阵输入到混合网络中,混合矩阵的大小为d'×d”,该混合网络有Ψ'个尺度通道,Ψ'个尺度通道的大小分别为r1,r2,…,rΨ',每个尺度通道产生K'个不同的输入权重,从而随机生成Ψ'×K'个混合特征图,将混合网络随机产生第Φ'个尺度通道混合初始权重记为由逐列组成,其中上角标hybrid表示三模态融合,表示混合网络的初始权重,表示产生第ζ'个混合特征图的初始权重,1≤Φ'≤Ψ',1≤ζ'≤K',第Φ'个尺度通道局部感受野的大小为rΦ'×rΦ',那么 (7-1) Input the multimodal fusion matrix obtained in the above step (6-2) into the hybrid network, the size of the hybrid matrix is d'×d", and the hybrid network has Ψ' scale channels, The sizes of the Ψ' scale channels are r 1 , r 2 ,...,r Ψ' , and each scale channel generates K' different input weights, thereby randomly generating Ψ'×K' mixed feature maps, and the mixed network Randomly generate the Φ'th scale channel mixing initial weight and record it as Depend on Column-by-column composition, where the superscript hybrid indicates three-modal fusion, represents the initial weights of the hybrid network, Represents the initial weight of generating the ζ'th mixed feature map, 1≤Φ'≤Ψ', 1≤ζ'≤K', the size of the local receptive field of the Φ'th scale channel is r Φ' ×r Φ' , then
进而得到第Φ'个尺度通道第ζ'个特征图的大小为(d'-rΦ'+1)×(d”-rΦ'+1);Then, the size of the ζ'th feature map of the Φ'th scale channel is (d'-r Φ' +1)×(d"-r Φ' +1);
(7-2)使用奇异值分解方法,对上述第Φ'个尺度通道初始权重矩阵进行正交化处理,得到正交矩阵的每一列是的正交基,第Φ'个尺度通道的第ζ'个特征图的输入权重是由形成的方阵;(7-2) Using the singular value decomposition method, the initial weight matrix for the above Φ'th scale channel Orthogonalization is performed to obtain an orthogonal matrix each column of Yes The orthonormal basis of , the input weights of the ζ'th feature map of the Φ'th scale channel By formed square matrix;
利用下式,计算第Φ'个尺度通道的第ζ'特征图中的卷积节点(i',j')混合卷积特征:Using the following formula, calculate the mixed convolution feature of the convolution node (i', j') in the ζ'th feature map of the Φ'th scale channel:
Φ'=1,2,3...,Ψ',Φ'=1,2,3...,Ψ',
i',j'=1,...,(d'-rΦ'+1),i',j'=1,...,(d'- rΦ' +1),
ζ'=1,2,3...,K',ζ'=1,2,3...,K',
是第Φ'个尺度通道的第ζ'特征图中的卷积节点(i',j')混合卷积特征,x'是与节点(i',j')对应的矩阵; is the convolution node (i', j') mixed convolution feature in the ζ'-th feature map of the Φ'-th scale channel, and x' is the matrix corresponding to the node (i', j');
(8)对上述混合卷积特征,进行混合多尺度平方根池化,池化尺度有Ψ'个尺度,大小分别为e1,e2,…,eΨ',第Φ'个尺度下池化图和特征图大小相同,为(d'-rΦ'+1)×(d”-rΦ'+1),根据上述步骤(7)得到的混合卷积特征,利用下式计算混合池化特征:(8) Perform hybrid multi-scale square root pooling for the above mixed convolution features, the pooling scale has Ψ' scales, and the sizes are e 1 , e 2 ,...,e Ψ' , and the pooling graph at the Φ'th scale The size is the same as the feature map, which is (d'-r Φ' +1)×(d"-r Φ' +1). According to the mixed convolution feature obtained in the above step (7), the following formula is used to calculate the mixed pooling feature :
p',q'=1,...,(d'-rΦ'+1),p',q'=1,...,(d'-r Φ' +1),
若节点(i',j')不在d'-rΦ'+1,则为零,If the node (i', j') is not in d'-r Φ' +1, then zero,
Φ'=1,2,3...,Ψ',Φ'=1,2,3...,Ψ',
ζ'=1,2,3...,K',ζ'=1,2,3...,K',
其中,表示第Φ'个尺度通道的第ζ'个池化图的组合节点(p',q')的混合池化特征;in, Represents the mixed pooling feature of the combined node (p', q') of the ζ'th pooling graph of the Φ'th scale channel;
(9)根据上述混合池化特征,重复步骤(5),将不同尺度的混合池化特征向量进行全连接,得到混合网络的组合特征矩阵其中K'表示每个尺度通道产生不同特征图的个数;(9) According to the above hybrid pooling feature, repeat step (5) to fully connect the hybrid pooling feature vectors of different scales to obtain the combined feature matrix of the hybrid network where K' represents the number of different feature maps generated by each scale channel;
(10)根据上述步骤(9)得到的混合网络的组合特征矩阵Hhybric,利用下式,根据训练样本的个数N1,计算神经网络的训练样本输出权重β:(10) According to the combined feature matrix H hybric of the hybrid network obtained in the above step (9), use the following formula to calculate the training sample output weight β of the neural network according to the number N 1 of training samples:
若则 like but
若则 like but
其中,T是训练样本的期望值,C为正则化系数,取值为任意值,本发明一个实施例中,C的取值为5,上标T表示矩阵转置;where T is the training sample The expected value of , C is the regularization coefficient, and the value is any value. In an embodiment of the present invention, the value of C is 5, and the superscript T represents the matrix transposition;
(11)利用上述步骤(3)中三个模态初始权重正交化后的正交矩阵和对经过预处理的待分类数据集D2,利用上述步骤(3)-步骤(9)的方法,得到待分类样本的三模态混合特征向量Htest;(11) Use the orthogonal matrix obtained by orthogonalizing the initial weights of the three modes in the above step (3) and For the preprocessed data set D 2 to be classified, the method of the above-mentioned steps (3)-step (9) is used to obtain the three-modal mixed feature vector H test of the sample to be classified;
(12)根据上述步骤(10)的训练样本输出权重β和上述步骤(11)的三模态混合特征向量Htest,利用下式计算出N2个待分类样本的预测标签με,实现基于多模态融合深度学习的物体材质分类,(12) According to the training sample output weight β in the above step (10) and the three-modal mixed feature vector H test in the above step (11), use the following formula to calculate the predicted labels μ ε of the N 2 samples to be classified, and realize the Multi-modal fusion deep learning object material classification,
με=Htestβ1≤ε≤M。μ ε =H test β1≤ε≤M.
本发明提出的基于多模态融合深度学习的物体材质分类方法,具有以下特点和优点:The object material classification method based on multi-modal fusion deep learning proposed by the present invention has the following characteristics and advantages:
1、本发明提出的基于多尺度局部感受野的超限学习机方法,可以用多个尺度的局部感受野来感受材质,提取出多样的特征,实现复杂物体材质的分类。1. The ELM method based on multi-scale local receptive fields proposed by the present invention can use local receptive fields of multiple scales to perceive materials, extract various features, and realize the classification of complex object materials.
2、本发明的基于多尺度局部感受野的超限学习机的深度学习方法,可以将特征学习和图像分类集一体,而不是由人为设计的特征提取器提取特征,因此该算法适用于大部分不同材质的对象分类。2. The deep learning method of the ELM based on the multi-scale local receptive field of the present invention can integrate feature learning and image classification, instead of extracting features by an artificially designed feature extractor, so the algorithm is suitable for most Object classification for different materials.
3、本发明的基于多尺度局部感受野的超限学习机方法,是一种基于多尺度局部感受野的超限学习机的多模态融合深度学习方法,可以有效的将物体材质三个模态的信息融合,实现信息互补,提高了材质分类的鲁棒性和准确率。3. The ELM method based on the multi-scale local receptive field of the present invention is a multi-modal fusion deep learning method based on the ELM based on the multi-scale local receptive field, which can effectively combine the three modes of the object material. The state information fusion realizes information complementation, and improves the robustness and accuracy of material classification.
附图说明Description of drawings
图1为本发明方法的流程框图。FIG. 1 is a flow chart of the method of the present invention.
图2为本发明方法中基于多尺度局部感受野的超限学习机的流程框图。FIG. 2 is a flow chart of an ELM based on a multi-scale local receptive field in the method of the present invention.
图3为本发明中基于多尺度局部感受野的超限学习机方法不同模态融合的流程框图。FIG. 3 is a flow chart of the fusion of different modalities of the ELM method based on the multi-scale local receptive field in the present invention.
具体实施方式Detailed ways
本发明提出的基于多模态融合深度学习的物体材质分类方法,其流程框图如图1所示,主要分为视觉图像模态,触觉加速度模态,触觉声音模态和混合网络四大部分。包括以下步骤:The object material classification method based on multi-modal fusion deep learning proposed by the present invention, its flowchart is shown in Figure 1, which is mainly divided into four parts: visual image mode, tactile acceleration mode, tactile sound mode and hybrid network. Include the following steps:
(1)设训练样本个数为N1,训练样本材质种类为M1个,记每类材质训练样本的标签为其中1≤M1≤N1,分别采集所有N1个训练样本的视觉图像I1、触觉加速度A1和触觉声音S1,建立一个包括I1、A1和S1的数据集D1,I1的图像大小为320×480;(1) Suppose the number of training samples is N 1 , the material types of the training samples are M 1 , and the labels of the training samples of each type of material are recorded as where 1≤M 1 ≤N 1 , collect visual image I 1 , haptic acceleration A 1 and haptic sound S 1 of all N 1 training samples respectively, and establish a dataset D 1 including I 1 , A 1 and S 1 , The image size of I 1 is 320×480;
设待分类物体个数为N2,待分类物体材质的种类为M2个,记每类待分类物体的标签为其中1≤M2≤M1,分别采集所有N2个待分类物体的视觉图像I2、触觉加速度A2和触觉声音S2,建立一个包括I2、A2和S2的数据集D2,I2的图像大小为320×480;其中的触觉加速度A1和A2是刚性物体在材质表面滑动时用传感器采集得到的一维信号,触觉声音S1和S2也是刚性物体在物体材质表面滑动时,用麦克风保存的一维信号;Suppose the number of objects to be classified is N 2 , the material types of the objects to be classified are M 2 , and the label of each object to be classified is recorded as where 1≤M 2 ≤M 1 , collect visual images I 2 , haptic acceleration A 2 and haptic sound S 2 of all N 2 objects to be classified, respectively, and establish a dataset D 2 including I 2 , A 2 and S 2 , the image size of I 2 is 320×480; the tactile accelerations A 1 and A 2 are the one-dimensional signals collected by the sensor when the rigid object slides on the surface of the material, and the tactile sounds S 1 and S 2 are also the rigid objects in the object material The one-dimensional signal saved with the microphone when the surface slides;
(2)对上述数据集D1和数据集D2视觉图像进行视觉图像预处理、触觉加速度信号进行触觉加速度预处理和触觉声音信号进行触觉声音预处理,分别得到视觉图像、触觉加速度频谱图和触觉声音频谱图,包括以下步骤:( 2 ) Perform visual image preprocessing, tactile acceleration preprocessing for tactile acceleration signals, and tactile sound preprocessing for tactile sound signals on the visual images of the above datasets D1 and D2, respectively, to obtain visual images, tactile acceleration spectrograms and Haptic sound spectrogram, including the following steps:
(2-1)利用降采样方法,对图像大小为320×480的图像I1和图像I2进行降采样,得到I1和I2的大小为32×32×3的视觉图像;(2-1) Using the down-sampling method, down-sampling the image I 1 and the image I 2 with an image size of 320×480 to obtain a visual image with a size of I 1 and I 2 of 32×32×3;
(2-2)利用短时傅里叶变换方法,分别将触觉加速度A1和触觉加速度A2转换到频域,短时傅里叶变换中的汉明窗的窗口长度为500,窗口偏移量为100,采样频率为10kHz,分别得到触觉加速度A1和触觉加速度A2的频谱图,从频谱图中选择前500个低频信道作为频谱图像,该频谱图像保留了来自触觉信号的大部分能量,对该频谱图像进行降采样,得到A1和A2的大小为32×32×3的触觉加速度频谱图像;(2-2) Using the short-time Fourier transform method, the tactile acceleration A 1 and the tactile acceleration A 2 are respectively converted into the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of the haptic acceleration A 1 and the haptic acceleration A 2 are obtained respectively, and the first 500 low-frequency channels are selected from the spectrogram as the spectrum image, which retains most of the energy from the haptic signal , down-sampling the spectral image to obtain the tactile acceleration spectral images of A 1 and A 2 with a size of 32×32×3;
(2-3)利用短时傅里叶变换方法,分别将触觉声音S1和触觉声音S2转换到频域,短时傅里叶变换中的汉明窗的窗口长度为500,窗口偏移量为100,采样频率为10kHz,分别得到触觉声音S1和触觉声音S2的频谱图,从频谱图中选择前500个低频信道作为频谱图像,该频谱图像保留了来自触觉信号的大部分能量,对该频谱图像进行降采样,得到S1和S2的大小为32×32×3的声音频谱图像;(2-3) Using the short-time Fourier transform method, the tactile sound S 1 and the tactile sound S 2 are respectively converted to the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of the haptic sound S1 and the haptic sound S2 are obtained respectively, and the first 500 low - frequency channels are selected from the spectrogram as the spectrum image, which retains most of the energy from the haptic signal , down-sampling the spectrum image to obtain a sound spectrum image with a size of 32×32× 3 for S1 and S2 ;
(3)通过多尺度特征映射,获得视觉模态、触觉加速度模态和触觉声音模态的卷积特征,包括以下步骤:(3) Obtain convolution features of visual modality, tactile acceleration modality and tactile sound modality through multi-scale feature mapping, including the following steps:
(3-1)将上述步骤(2)得到的I1和I2的大小为32×32×3的视觉图像、A1和A2的大小为32×32×3的触觉加速度频谱图像和S1和S2的大小为32×32×3的声音频谱图像输入到神经网络第一层,即输入层,输入图像的大小为d×d,该神经网络中的局部感受野具有Ψ个尺度通道,Ψ个尺度通道的大小分别为r1,r2,…,rΨ,每个尺度通道产生K个不同的输入权重,从而随机生成Ψ×K个特征图,将神经网络随机产生的第Φ个尺度通道的视觉图像、触觉加速度频谱图和声音频谱图的初始权重记为和 和分别由和逐列组成,其中,上角标I表示训练样本和待分类物体的视觉模态,上角标A表示训练样本和待分类物体的触觉加速度模态,S表示训练样本和待分类物体的触觉声音模态,表示初始权重,表示产生第ζ个特征图的初始权重,1≤Φ≤Ψ,1≤ζ≤K,第Φ个尺度局部感受野的大小为rΦ×rΦ,(3-1) Combine the visual images with the size of I 1 and I 2 obtained in the above step (2) of 32×32×3, the tactile acceleration spectrum images of A 1 and A 2 with the size of 32×32×3, and S 1 and S 2 sound spectrum images of size 32 × 32 × 3 are input to the first layer of the neural network, namely the input layer, the size of the input image is d × d, and the local receptive field in this neural network has Ψ scale channels , the sizes of the Ψ scale channels are r 1 , r 2 ,...,r Ψ respectively, each scale channel generates K different input weights, thereby randomly generating Ψ×K feature maps, and the Φth Φ randomly generated by the neural network is generated. The initial weights of the visual image, tactile acceleration spectrogram and sound spectrogram of each scale channel are denoted as and and respectively by and Formed column by column, where the superscript I represents the visual mode of the training sample and the object to be classified, the superscript A represents the tactile acceleration mode of the training sample and the object to be classified, and S represents the tactile sound of the training sample and the object to be classified modal, represents the initial weight, represents the initial weight for generating the ζth feature map, 1≤Φ≤Ψ, 1≤ζ≤K, and the size of the Φth scale local receptive field is r Φ ×r Φ ,
进而得到第Φ个尺度通道的所有K个特征图的大小为(d-rΦ+1)×(d-rΦ+1);Then, the size of all K feature maps of the Φth scale channel is obtained as (dr Φ +1)×(dr Φ +1);
(3-2)使用奇异值分解方法,对上述第Φ个尺度通道的初始权重矩阵进行正交化处理,得到正交矩阵和正交化的输入权重更能提取出更为完备的特征,和中的每一列和分别为和的正交基,第Φ个尺度通道的第ζ个特征图的输入权重 和分别为由和形成的方阵;(3-2) Using the singular value decomposition method, the initial weight matrix of the above Φth scale channel Orthogonalization is performed to obtain an orthogonal matrix and Orthogonalized input weights can extract more complete features, and each column in and respectively and The orthonormal basis of , the input weights of the ζth feature map of the Φth scale channel and respectively for the reason and formed square matrix;
利用下式,分别计算视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道的第ζ特征图中的节点(i,j)的卷积特征:The convolution features of nodes (i, j) in the ζ-th feature map of the Φ-th scale channel of the visual modality, tactile acceleration modality, and haptic sound modality are calculated using the following formula:
Φ=1,2,3...,Ψ,Φ=1,2,3...,Ψ,
i,j=1,...,(d-rΦ+1),i,j=1,...,( drΦ +1),
ζ=1,2,3...,K,ζ=1,2,3...,K,
和分别表示视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道中第ζ特征图的节点(i,j)的卷积特征,x是与节点(i,j)对应的矩阵; and Represents the convolution features of the node (i, j) of the ζ-th feature map in the Φ-th scale channel of the visual modality, tactile acceleration modality and haptic sound modality, respectively, x is the matrix corresponding to node (i, j) ;
(4)对上述视觉模态、触觉加速度模态和触觉声音模态的卷积特征进行多尺度平方根池化,池化尺度有Ψ个尺度,Ψ个尺度的大小分别为e1,e2,…,eΨ,第Φ个尺度下池化大小eΦ表示池化中心和边缘之间的距离,如图2所示,池化图和特征图大小相同,为(d-rΦ+1)×(d-rΦ+1),根据上述步骤(3)得到的卷积特征,利用下式计算池化特征:(4) Multi-scale square root pooling is performed on the convolution features of the above visual modality, tactile acceleration modality and tactile sound modality. The pooling scale has Ψ scales, and the sizes of the Ψ scales are e 1 , e 2 , …,e Ψ , the pooling size e Φ at the Φth scale represents the distance between the pooling center and the edge, as shown in Figure 2, the pooling map and the feature map have the same size, which is (dr Φ +1)×(dr Φ +1), according to the convolution feature obtained in the above step (3), use the following formula to calculate the pooling feature:
p,q=1,...,(d-rΦ+1),p,q=1,...,( drΦ +1),
若节点(i,j)不在(d-rΦ+1),则和均为零,If node (i,j) is not at (dr Φ +1), then and are all zero,
Φ=1,2,3...,Ψ,Φ=1,2,3...,Ψ,
ζ=1,2,3...,K,ζ=1,2,3...,K,
其中,和分别表示视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道中第ζ个池化图的节点(p,q)的池化特征;in, and Represent the pooling features of the nodes (p, q) of the ζth pooling graph in the Φth scale channel of the visual modality, the haptic acceleration modality and the haptic sound modality, respectively;
(5)根据上述池化特征,得到三个模态的全连接特征向量,包括以下步骤:(5) According to the above pooling features, the fully connected feature vectors of the three modalities are obtained, including the following steps:
(5-1)将步骤(4)的池化特征中的第ω个训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的池化图的所有池化特征,分别连接成一个行向量和其中1≤ω≤N1;(5-1) Connect all the pooled features of the pooled graphs of the visual image modality, tactile acceleration modality and tactile sound modality of the ωth training sample in the pooling feature of step (4) into one row vector and where 1≤ω≤N 1 ;
(5-2)遍历N1个训练样本,重复上述步骤(5-1),分别得到N1训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的行向量组合,记为:(5-2) Traverse N 1 training samples, repeat the above step (5-1), and obtain the row vector combination of the visual image mode, tactile acceleration mode and tactile sound mode of the N 1 training samples respectively, which is recorded as:
其中,表示视觉模态的组合特征向量矩阵,表示触觉加速度模态特征矩阵,表示触觉声音模态的特征向量矩阵;in, the combined eigenvector matrix representing the visual modality, represents the tactile acceleration modal feature matrix, Eigenvector matrix representing the tactile sound modality;
(6)三个模态的全连接特征向量,进行多模态融合,得到多模态融合后的混合矩阵,包括以下步骤:(6) The fully connected eigenvectors of the three modalities are multi-modal fusion, and the mixed matrix after multi-modal fusion is obtained, including the following steps:
(6-1)将上述步骤(5)的N1训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的行向量输入混合层进行组合处理,得到一个混合矩阵H=[HI,HA,HS];(6-1) Input the row vectors of the visual image modality, the tactile acceleration modality and the tactile sound modality of the N 1 training samples in the above step (5) into the mixing layer for combined processing to obtain a mixed matrix H=[H I , H A , H S ];
(6-2)对步骤(6-1)的混合矩阵H中的每个样本的混合行向量进行调整,生成一个多模态融合后的二维混合矩阵,二维混合矩阵的大小为如图3所示,其中,d'是二维矩阵的长度,取值范围为 (6-2) Adjust the mixing row vector of each sample in the mixing matrix H of step (6-1) to generate a two-dimensional mixing matrix after multimodal fusion, and the size of the two-dimensional mixing matrix is As shown in Figure 3, where d' is the length of the two-dimensional matrix, and the value range is
(7)将上述步骤(6)得到的多模态融合后的混合矩阵输入到神经网络的混合网络层,通过多尺度特征映射,获得多模态混合卷积特征,包括以下步骤:(7) Input the mixed matrix after multi-modal fusion obtained in the above step (6) into the mixed network layer of the neural network, and obtain multi-modal mixed convolution features through multi-scale feature mapping, including the following steps:
(7-1)将上述步骤(6-2)得到的多模态融合后的混合矩阵输入到混合网络中,混合矩阵的大小为d'×d”,该混合网络有Ψ'个尺度通道,Ψ'个尺度通道的大小分别为r1,r2,…,rΨ',每个尺度通道产生K'个不同的输入权重,从而随机生成Ψ'×K'个混合特征图,将混合网络随机产生第Φ'个尺度通道混合初始权重记为由逐列组成,其中上角标hybrid表示三模态融合,表示混合网络的初始权重,表示产生第ζ'个混合特征图的初始权重,1≤Φ'≤Ψ',1≤ζ'≤K',第Φ'个尺度通道局部感受野的大小为rΦ'×rΦ',那么 (7-1) Input the multimodal fusion matrix obtained in the above step (6-2) into the hybrid network, the size of the hybrid matrix is d'×d", and the hybrid network has Ψ' scale channels, The sizes of the Ψ' scale channels are r 1 , r 2 ,...,r Ψ' , and each scale channel generates K' different input weights, thereby randomly generating Ψ'×K' mixed feature maps, and the mixed network Randomly generate the Φ'th scale channel mixing initial weight and record it as Depend on Column-by-column composition, where the superscript hybrid indicates three-modal fusion, represents the initial weights of the hybrid network, Represents the initial weight of generating the ζ'th mixed feature map, 1≤Φ'≤Ψ', 1≤ζ'≤K', the size of the local receptive field of the Φ'th scale channel is r Φ' ×r Φ' , then
进而得到第Φ'个尺度通道第ζ'个特征图的大小为(d'-rΦ'+1)×(d”-rΦ'+1);Then, the size of the ζ'th feature map of the Φ'th scale channel is (d'-r Φ' +1)×(d"-r Φ' +1);
(7-2)使用奇异值分解方法,对上述第Φ'个尺度通道初始权重矩阵进行正交化处理,得到正交矩阵正交化的输入权重更能提取出更为完备的特征,的每一列是的正交基,第Φ'个尺度通道的第ζ'个特征图的输入权重是由形成的方阵;(7-2) Using the singular value decomposition method, the initial weight matrix for the above Φ'th scale channel Orthogonalization is performed to obtain an orthogonal matrix Orthogonalized input weights can extract more complete features, each column of Yes The orthonormal basis of , the input weights of the ζ'th feature map of the Φ'th scale channel By formed square matrix;
利用下式,计算第Φ'个尺度通道的第ζ'特征图中的卷积节点(i',j')混合卷积特征:Using the following formula, calculate the mixed convolution feature of the convolution node (i', j') in the ζ'th feature map of the Φ'th scale channel:
Φ'=1,2,3...,Ψ',Φ'=1,2,3...,Ψ',
i',j'=1,...,(d'-rΦ'+1),i',j'=1,...,(d'- rΦ' +1),
ζ'=1,2,3...,K',ζ'=1,2,3...,K',
是第Φ'个尺度通道的第ζ'特征图中的卷积节点(i',j')混合卷积特征,x'是与节点(i',j')对应的矩阵; is the convolution node (i', j') mixed convolution feature in the ζ'-th feature map of the Φ'-th scale channel, and x' is the matrix corresponding to the node (i', j');
(8)对上述混合卷积特征,进行混合多尺度平方根池化,池化尺度有Ψ'个尺度,大小分别为e1,e2,…,eΨ',第Φ'个尺度下池化图和特征图大小相同,为(d'-rΦ'+1)×(d”-rΦ'+1),根据上述步骤(7)得到的混合卷积特征,利用下式计算混合池化特征:(8) Perform hybrid multi-scale square root pooling for the above mixed convolution features, the pooling scale has Ψ' scales, and the sizes are e 1 , e 2 ,...,e Ψ' , and the pooling graph at the Φ'th scale The size is the same as the feature map, which is (d'-r Φ' +1)×(d"-r Φ' +1). According to the mixed convolution feature obtained in the above step (7), the following formula is used to calculate the mixed pooling feature :
p',q'=1,...,(d'-rΦ'+1),p',q'=1,...,(d'-r Φ' +1),
若节点(i',j')不在d'-rΦ'+1,则为零,If the node (i', j') is not in d'-r Φ' +1, then zero,
Φ'=1,2,3...,Ψ',Φ'=1,2,3...,Ψ',
ζ'=1,2,3...,K',ζ'=1,2,3...,K',
其中,表示第Φ'个尺度通道的第ζ'个池化图的组合节点(p',q')的混合池化特征;in, Represents the mixed pooling feature of the combined node (p', q') of the ζ'th pooling graph of the Φ'th scale channel;
(9)根据上述混合池化特征,重复步骤(5),将不同尺度的混合池化特征向量进行全连接,得到混合网络的组合特征矩阵其中K'表示每个尺度通道产生不同特征图的个数;(9) According to the above hybrid pooling feature, repeat step (5) to fully connect the hybrid pooling feature vectors of different scales to obtain the combined feature matrix of the hybrid network where K' represents the number of different feature maps generated by each scale channel;
(10)根据上述步骤(9)得到的混合网络的组合特征矩阵Hhybric,利用下式,根据训练样本的个数N1,计算神经网络的训练样本输出权重β:(10) According to the combined feature matrix H hybric of the hybrid network obtained in the above step (9), use the following formula to calculate the training sample output weight β of the neural network according to the number N 1 of training samples:
若则 like but
若则 like but
其中,T是训练样本的期望值,C为正则化系数,取值为任意值,本发明一个实施例中,C的取值为5,上标T表示矩阵转置;where T is the training sample The expected value of , C is the regularization coefficient, and the value is any value. In an embodiment of the present invention, the value of C is 5, and the superscript T represents the matrix transposition;
(11)利用上述步骤(3)中三个模态初始权重正交化后的正交矩阵和对经过预处理的待分类数据集D2,得到待分类样本的三模态混合特征向量Htest;利用上述步骤(3),可以得到待分类物体三个模态的卷积特征向量;利用上述步骤(4),可以得到待分类物体的三个模态的池化特征向量;利用上述步骤(5),可以得到待分类物体的三个模态的全连接特征向量;利用上述步骤(6),可以得到待分类物体的多模态融合后的混合矩阵;利用上述步骤(7),可以得到待分类物体的多模态混合卷积特征;利用上述步骤(8),可以得到待分类物体的多模态混合池化特征;利用上述步骤(9),可以得到待分类物体的三模态混合特征向量Htest。(11) Use the orthogonal matrix obtained by orthogonalizing the initial weights of the three modes in the above step (3) and For the preprocessed data set D 2 to be classified, the three-modal mixed feature vector H test of the sample to be classified is obtained; using the above step (3), the convolution feature vectors of the three modalities of the object to be classified can be obtained; using the above Step (4), the pooled feature vectors of the three modalities of the object to be classified can be obtained; using the above step (5), the fully connected feature vectors of the three modalities of the object to be classified can be obtained; using the above step (6) , the mixed matrix of the multimodal fusion of the object to be classified can be obtained; using the above step (7), the multimodal mixed convolution feature of the object to be classified can be obtained; using the above step (8), the mixed convolution feature of the object to be classified can be obtained. Multi-modal mixed pooling feature; using the above step (9), the three-modal mixed feature vector H test of the object to be classified can be obtained.
(12)根据上述步骤(10)的训练样本输出权重β和上述步骤(11)的三模态混合特征向量Htest,利用下式计算出N2个待分类样本的预测标签με,实现基于多模态融合深度学习的物体材质分类,(12) According to the training sample output weight β in the above step (10) and the three-modal mixed feature vector H test in the above step (11), use the following formula to calculate the predicted labels μ ε of the N 2 samples to be classified, and realize the Multi-modal fusion deep learning object material classification,
με=Htestβ 1≤ε≤M。μ ε =H test β 1≤ε≤M.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710599106.1A CN107463952B (en) | 2017-07-21 | 2017-07-21 | An object material classification method based on multimodal fusion deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710599106.1A CN107463952B (en) | 2017-07-21 | 2017-07-21 | An object material classification method based on multimodal fusion deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107463952A CN107463952A (en) | 2017-12-12 |
CN107463952B true CN107463952B (en) | 2020-04-03 |
Family
ID=60546004
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710599106.1A Active CN107463952B (en) | 2017-07-21 | 2017-07-21 | An object material classification method based on multimodal fusion deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107463952B (en) |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108734210B (en) * | 2018-05-17 | 2021-10-15 | 浙江工业大学 | An object detection method based on cross-modal multi-scale feature fusion |
CN108846375B (en) * | 2018-06-29 | 2019-06-18 | 山东大学 | A multimodal collaborative learning method and device based on neural network |
CN109190638A (en) * | 2018-08-09 | 2019-01-11 | 太原理工大学 | Classification method based on the online order limit learning machine of multiple dimensioned local receptor field |
EP3620978A1 (en) * | 2018-09-07 | 2020-03-11 | Ibeo Automotive Systems GmbH | Method and device for classifying objects |
CN109447124B (en) * | 2018-09-28 | 2019-11-19 | 北京达佳互联信息技术有限公司 | Image classification method, device, electronic equipment and storage medium |
CN109508740B (en) * | 2018-11-09 | 2019-08-13 | 郑州轻工业学院 | Object hardness identification method based on Gaussian mixed noise production confrontation network |
CN109902585B (en) * | 2019-01-29 | 2023-04-07 | 中国民航大学 | Finger three-mode fusion recognition method based on graph model |
CN110020596B (en) * | 2019-02-21 | 2021-04-30 | 北京大学 | Video content positioning method based on feature fusion and cascade learning |
CN110659427A (en) * | 2019-09-06 | 2020-01-07 | 北京百度网讯科技有限公司 | City function division method and device based on multi-source data and electronic equipment |
CN110942060B (en) * | 2019-10-22 | 2023-05-23 | 清华大学 | Material recognition method and device based on laser speckle and mode fusion |
CN110909637A (en) * | 2019-11-08 | 2020-03-24 | 清华大学 | Outdoor mobile robot terrain recognition method based on visual-touch fusion |
CN111028204B (en) * | 2019-11-19 | 2021-10-08 | 清华大学 | A cloth defect detection method based on multi-modal fusion deep learning |
CN110861853B (en) * | 2019-11-29 | 2021-10-19 | 三峡大学 | A smart garbage sorting method combining vision and touch |
CN111590611B (en) * | 2020-05-25 | 2022-12-02 | 北京具身智能科技有限公司 | Article classification and recovery method based on multi-mode active perception |
CN113111902B (en) * | 2021-01-02 | 2024-10-15 | 大连理工大学 | Pavement material identification method based on voice and image multi-mode collaborative learning |
CN112893180A (en) * | 2021-01-20 | 2021-06-04 | 同济大学 | Object touch classification method and system considering friction coefficient abnormal value elimination |
CN113780460A (en) * | 2021-09-18 | 2021-12-10 | 广东人工智能与先进计算研究院 | Material identification method and device, robot, electronic equipment and storage medium |
CN114358084B (en) * | 2022-01-07 | 2025-02-14 | 吉林大学 | A tactile material classification method based on DDQN and generative adversarial network |
CN114723963B (en) * | 2022-04-26 | 2024-06-04 | 东南大学 | Task action and object physical attribute identification method based on visual touch signal |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104715260A (en) * | 2015-03-05 | 2015-06-17 | 中南大学 | Multi-modal fusion image sorting method based on RLS-ELM |
CN105512609A (en) * | 2015-11-25 | 2016-04-20 | 北京工业大学 | Multi-mode fusion video emotion identification method based on kernel-based over-limit learning machine |
CN105956351A (en) * | 2016-07-05 | 2016-09-21 | 上海航天控制技术研究所 | Touch information classified computing and modelling method based on machine learning |
CN106874961A (en) * | 2017-03-03 | 2017-06-20 | 北京奥开信息科技有限公司 | A kind of indoor scene recognition methods using the very fast learning machine based on local receptor field |
WO2017100903A1 (en) * | 2015-12-14 | 2017-06-22 | Motion Metrics International Corp. | Method and apparatus for identifying fragmented material portions within an image |
-
2017
- 2017-07-21 CN CN201710599106.1A patent/CN107463952B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104715260A (en) * | 2015-03-05 | 2015-06-17 | 中南大学 | Multi-modal fusion image sorting method based on RLS-ELM |
CN105512609A (en) * | 2015-11-25 | 2016-04-20 | 北京工业大学 | Multi-mode fusion video emotion identification method based on kernel-based over-limit learning machine |
WO2017100903A1 (en) * | 2015-12-14 | 2017-06-22 | Motion Metrics International Corp. | Method and apparatus for identifying fragmented material portions within an image |
CN105956351A (en) * | 2016-07-05 | 2016-09-21 | 上海航天控制技术研究所 | Touch information classified computing and modelling method based on machine learning |
CN106874961A (en) * | 2017-03-03 | 2017-06-20 | 北京奥开信息科技有限公司 | A kind of indoor scene recognition methods using the very fast learning machine based on local receptor field |
Non-Patent Citations (3)
Title |
---|
Deep Learning for Surface Material Classification Using Haptic and Visual Information;Haitian Zheng et al.;《IEEE TRANSACTIONS ON MULTIMEDIA》;20161130;第2407-2416页 * |
Multi-Modal Local Receptive Field Extreme Learning Machine for Object Recognition;Fengxue Li et al.;《2016 International Joint Conference on Neural Networks (IJCNN)》;20161103;第1696-1701页 * |
基于神经网络的三维模型视觉特征分析;韦伟;《计算机工程与应用》;20080721;第174-178页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107463952A (en) | 2017-12-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107463952B (en) | An object material classification method based on multimodal fusion deep learning | |
CN108764313B (en) | Supermarket commodity recognition method based on deep learning | |
Dering et al. | A convolutional neural network model for predicting a product's function, given its form | |
CN111639679B (en) | A Few-Sample Learning Method Based on Multi-scale Metric Learning | |
CN107798349B (en) | A transfer learning method based on deep sparse autoencoder | |
CN100487720C (en) | Face comparison device | |
CN104408405B (en) | Face representation and similarity calculating method | |
CN108536780B (en) | Cross-modal object material retrieval method based on tactile texture features | |
CN109559758A (en) | A method of texture image is converted by haptic signal based on deep learning | |
Kim et al. | Label-preserving data augmentation for mobile sensor data | |
Ryumin et al. | Automatic detection and recognition of 3D manual gestures for human-machine interaction | |
CN104504406B (en) | A kind of approximate multiimage matching process rapidly and efficiently | |
CN109447996A (en) | Hand Segmentation in 3-D image | |
CN104867106A (en) | Depth map super-resolution method | |
CN103714331A (en) | Facial expression feature extraction method based on point distribution model | |
Kaur et al. | Scene perception system for visually impaired based on object detection and classification using multimodal deep convolutional neural network | |
CN106462773B (en) | Pattern recognition system and method using GABOR function | |
Barbhuiya et al. | Alexnet-CNN based feature extraction and classification of multiclass ASL hand gestures | |
CN104699781A (en) | Specific absorption rate image retrieval method based on double-layer anchor chart hash | |
Lin et al. | Using CNN to classify hyperspectral data based on spatial-spectral information | |
Zhang et al. | A framework for the fusion of visual and tactile modalities for improving robot perception. | |
CN110705572A (en) | Image recognition method | |
CN107451578A (en) | Deaf-mute's sign language machine translation method based on somatosensory device | |
CN113688864A (en) | Human-object interaction relation classification method based on split attention | |
CN105894048A (en) | Food safety detection method based on mobile phone |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |