[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN107463952B - An object material classification method based on multimodal fusion deep learning - Google Patents

An object material classification method based on multimodal fusion deep learning Download PDF

Info

Publication number
CN107463952B
CN107463952B CN201710599106.1A CN201710599106A CN107463952B CN 107463952 B CN107463952 B CN 107463952B CN 201710599106 A CN201710599106 A CN 201710599106A CN 107463952 B CN107463952 B CN 107463952B
Authority
CN
China
Prior art keywords
tactile
matrix
modality
scale
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710599106.1A
Other languages
Chinese (zh)
Other versions
CN107463952A (en
Inventor
刘华平
方静
刘晓楠
孙富春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201710599106.1A priority Critical patent/CN107463952B/en
Publication of CN107463952A publication Critical patent/CN107463952A/en
Application granted granted Critical
Publication of CN107463952B publication Critical patent/CN107463952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

本发明涉及一种基于多模态融合深度学习的物体材质分类方法,属于计算机视觉、人工智能和材质分类技术领域。本发明是基于多模态融合深度学习的物体材质分类方法‑‑一种基于多尺度局部感受野的超限学习机的多模态融合方法。本发明将物体材质的不同模态的感知信息(包括视觉图像、触觉加速度信号和触觉声音信号)进行融合,最终实现物体材质的正确分类。该方法不仅可以利用多尺度局部感受野对现实复杂材质进行高代表性特征提取,而且可以有效融合各个模态信息,实现模态之间信息互补。利用本发明的方法可以提高复杂材质分类的鲁棒性和准确性,使之有更大的适用性和通用性。

Figure 201710599106

The invention relates to an object material classification method based on multimodal fusion deep learning, and belongs to the technical fields of computer vision, artificial intelligence and material classification. The present invention is an object material classification method based on multi-modal fusion deep learning - a multi-modal fusion method based on multi-scale local receptive field and ultra-limited learning machine. The present invention fuses the perception information (including visual images, tactile acceleration signals and tactile sound signals) of different modalities of the material of the object, and finally realizes the correct classification of the material of the object. This method can not only use multi-scale local receptive fields to extract high-representative features of real complex materials, but also effectively fuse the information of various modalities to achieve information complementation between modalities. The method of the present invention can improve the robustness and accuracy of complex material classification, so that it has greater applicability and versatility.

Figure 201710599106

Description

一种基于多模态融合深度学习的物体材质分类方法An object material classification method based on multimodal fusion deep learning

技术领域technical field

本发明涉及一种基于多模态融合深度学习的物体材质分类方法,属于计算机视觉、人工智能和材质分类技术领域。The invention relates to an object material classification method based on multimodal fusion deep learning, and belongs to the technical fields of computer vision, artificial intelligence and material classification.

背景技术Background technique

大千世界,材质种类繁多,可以分为塑料、金属、陶瓷,玻璃、木材、纺织品、石材、纸、橡胶和泡沫等种类。最近,物体材质分类已经极大地引起社会环保,工业界以及学术界的关注。比如材质的分类可以有效的用于材料的循环利用;包装材料的四大支柱:纸,塑料,金属和玻璃,在不同的市场需求下需要不能材质的包装。对于需要长距离运输但对运输质量无特殊要求,一般选用纸,纸板以及包装箱纸板;对于食品包装应该符合卫生标定,糕点等直接入口食品的包装应使用纸盒纸板,食盐等防光防潮的使用罐装,快餐盒的制造可以使用天然植物纤维;合理使用装饰材料是室内装饰成功的关键。基于上述问题的需求,研究一套能够自动对物体材质分类的方法就显得十分必要。There are many kinds of materials, which can be divided into plastics, metals, ceramics, glass, wood, textiles, stone, paper, rubber and foam. Recently, object material classification has greatly attracted the attention of social environmental protection, industry and academia. For example, the classification of materials can be effectively used for material recycling; the four pillars of packaging materials: paper, plastic, metal and glass, which require packaging that cannot be made of materials under different market demands. For long-distance transportation but no special requirements for transportation quality, paper, cardboard and packaging box cardboard are generally used; for food packaging, it should meet the hygienic standard, and the packaging of directly imported food such as cakes should use carton cardboard, salt and other light-proof and moisture-proof materials. Using cans, the manufacture of snack boxes can use natural plant fibers; the rational use of decorative materials is the key to the success of interior decoration. Based on the needs of the above problems, it is very necessary to develop a set of methods that can automatically classify the material of objects.

物体材质分类主流的方法是使用包含丰富信息的视觉图像,但是对于外观极其相似的两个物体仅用视觉图像是不能够区分的。假设有两个物体:一个红色粗糙的纸和一个红色的塑料箔,视觉图像对这两个物体具有较小的区分能力。但是对于上述假设,人脑会本能的将同一物体的不同模态感知特征进行融合,从而达到对物体材质分类的目的。受此启发,要使计算机实现对物体材质的自动分类,可以同时使用物体不同模态信息来进行物体材质分类。The mainstream method of object material classification is to use visual images containing rich information, but for two objects with extremely similar appearances, only visual images cannot be used to distinguish them. Suppose there are two objects: a red rough paper and a red plastic foil, for which the visual image has a small discriminative power. However, for the above assumptions, the human brain will instinctively fuse the different modal perception features of the same object, so as to achieve the purpose of classifying the material of the object. Inspired by this, in order to make the computer realize the automatic classification of the material of the object, the different modal information of the object can be used to classify the material of the object at the same time.

当前也有公开技术用于物体材质分类,如中国专利申请CN105005787A—一种基于灵巧手触觉信息的联合稀疏编码的材质分类。此发明对材质分类仅使用了触觉序列,并未将材质的多种模态信息结合起来。观察到仅使用视觉图像对物体材质分类不能鲁棒地捕获材质特征,如硬度或粗糙度。可以假设当刚性工具拖动或移动到不同物体的表面上时,工具将产生不同频率的振动和声音,因此可以使用与视觉互补的触觉信息来进行物体材质的分类。然而,如何有效地将视觉模态与触觉模态结合仍然是一个具有挑战性的问题。Currently, there are also disclosed technologies for object material classification, such as Chinese patent application CN105005787A—a material classification based on joint sparse coding of tactile information of dexterous hands. This invention only uses haptic sequences for material classification, and does not combine multiple modal information of materials. It was observed that using visual imagery alone to classify object material does not robustly capture material features such as hardness or roughness. It can be assumed that when a rigid tool is dragged or moved over the surfaces of different objects, the tool will generate vibrations and sounds of different frequencies, so that haptic information complementary to vision can be used to classify the material of objects. However, how to effectively combine visual modalities with haptic modalities remains a challenging problem.

发明内容SUMMARY OF THE INVENTION

本发明目的是提出一种基于多模态融合深度学习的物体材质分类方法,在基于多尺度局部感受野的超限学习机方法的基础上实现多模态信息融合的物体材质分类,以提高分类的鲁棒性和准确性,并有效地融合物体材质的多种模态信息进行材质分类。The purpose of the present invention is to propose an object material classification method based on multi-modal fusion deep learning, and realize the object material classification of multi-modal information fusion on the basis of the ultra-limited learning machine method based on multi-scale local receptive field, so as to improve the classification It is robust and accurate, and effectively fuses multiple modal information of object materials for material classification.

本发明提出的基于多模态融合深度学习的物体材质分类方法,包括以下步骤:The object material classification method based on multi-modal fusion deep learning proposed by the present invention includes the following steps:

(1)设训练样本个数为N1,训练样本材质种类为M1个,记每类材质训练样本的标签为

Figure BDA0001356686690000021
其中1≤M1≤N1,分别采集所有N1个训练样本的视觉图像I1、触觉加速度A1和触觉声音S1,建立一个包括I1、A1和S1的数据集D1,I1的图像大小为320×480;(1) Suppose the number of training samples is N 1 , the material types of the training samples are M 1 , and the labels of the training samples of each type of material are recorded as
Figure BDA0001356686690000021
where 1≤M 1 ≤N 1 , collect visual image I 1 , haptic acceleration A 1 and haptic sound S 1 of all N 1 training samples respectively, and establish a dataset D 1 including I 1 , A 1 and S 1 , The image size of I 1 is 320×480;

设待分类物体个数为N2,待分类物体材质的种类为M2个,记每类待分类物体的标签为

Figure BDA0001356686690000022
其中1≤M2≤M1,分别采集所有N2个待分类物体的视觉图像I2、触觉加速度A2和触觉声音S2,建立一个包括I2、A2和S2的数据集D2,I2的图像大小为320×480;Suppose the number of objects to be classified is N 2 , the material types of the objects to be classified are M 2 , and the label of each object to be classified is recorded as
Figure BDA0001356686690000022
where 1≤M 2 ≤M 1 , collect visual images I 2 , haptic acceleration A 2 and haptic sound S 2 of all N 2 objects to be classified, respectively, and establish a dataset D 2 including I 2 , A 2 and S 2 , the image size of I 2 is 320×480;

(2)对上述数据集D1和数据集D2视觉图像进行视觉图像预处理、触觉加速度信号进行触觉加速度预处理和触觉声音信号进行触觉声音预处理,分别得到视觉图像、触觉加速度频谱图和触觉声音频谱图,包括以下步骤:( 2 ) Perform visual image preprocessing, tactile acceleration preprocessing for tactile acceleration signals, and tactile sound preprocessing for tactile sound signals on the visual images of the above datasets D1 and D2, respectively, to obtain visual images, tactile acceleration spectrograms and Haptic sound spectrogram, including the following steps:

(2-1)利用降采样方法,对图像大小为320×480的图像I1和图像I2进行降采样,得到I1和I2的大小为32×32×3的视觉图像;(2-1) Using the down-sampling method, down-sampling the image I 1 and the image I 2 with an image size of 320×480 to obtain a visual image with a size of I 1 and I 2 of 32×32×3;

(2-2)利用短时傅里叶变换方法,分别将触觉加速度A1和触觉加速度A2转换到频域,短时傅里叶变换中的汉明窗的窗口长度为500,窗口偏移量为100,采样频率为10kHz,分别得到触觉加速度A1和触觉加速度A2的频谱图,从频谱图中选择前500个低频信道作为频谱图像,对该频谱图像进行降采样,得到A1和A2的大小为32×32×3的触觉加速度频谱图像;(2-2) Using the short-time Fourier transform method, the tactile acceleration A 1 and the tactile acceleration A 2 are respectively converted into the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of haptic acceleration A 1 and haptic acceleration A 2 are obtained respectively, and the first 500 low-frequency channels are selected from the spectrogram as the spectrum image, and the spectrum image is down-sampled to obtain A 1 and A 2 is a haptic acceleration spectrum image with a size of 32×32×3;

(2-3)利用短时傅里叶变换方法,分别将触觉声音S1和触觉声音S2转换到频域,短时傅里叶变换中的汉明窗的窗口长度为500,窗口偏移量为100,采样频率为10kHz,分别得到触觉声音S1和触觉声音S2的频谱图,从频谱图中选择前500个低频信道作为频谱图像,对该频谱图像进行降采样,得到S1和S2的大小为32×32×3的声音频谱图像;(2-3) Using the short-time Fourier transform method, the tactile sound S 1 and the tactile sound S 2 are respectively converted to the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of the haptic sound S1 and the haptic sound S2 are obtained respectively, and the first 500 low - frequency channels are selected from the spectrogram as the spectrum image, and the spectrum image is down - sampled to obtain S1 and S 2 is a sound spectrum image of size 32×32×3;

(3)通过多尺度特征映射,获得视觉模态、触觉加速度模态和触觉声音模态的卷积特征,包括以下步骤:(3) Obtain convolution features of visual modality, tactile acceleration modality and tactile sound modality through multi-scale feature mapping, including the following steps:

(3-1)将上述步骤(2)得到的I1和I2的大小为32×32×3的视觉图像、A1和A2的大小为32×32×3的触觉加速度频谱图像和S1和S2的大小为32×32×3的声音频谱图像输入到神经网络第一层,即输入层,输入图像的大小为d×d,该神经网络中的局部感受野具有Ψ个尺度通道,Ψ个尺度通道的大小分别为r1,r2,…,rΨ,每个尺度通道产生K个不同的输入权重,从而随机生成Ψ×K个特征图,将神经网络随机产生的第Φ个尺度通道的视觉图像、触觉加速度频谱图和声音频谱图的初始权重记为

Figure BDA0001356686690000031
Figure BDA0001356686690000032
Figure BDA0001356686690000033
Figure BDA0001356686690000034
分别由
Figure BDA0001356686690000035
Figure BDA0001356686690000036
逐列组成,其中,上角标I表示训练样本和待分类物体的视觉模态,上角标A表示训练样本和待分类物体的触觉加速度模态,S表示训练样本和待分类物体的触觉声音模态,
Figure BDA0001356686690000037
表示初始权重,
Figure BDA0001356686690000038
表示产生第ζ个特征图的初始权重,1≤Φ≤Ψ,1≤ζ≤K,第Φ个尺度局部感受野的大小为rΦ×rΦ,(3-1) Combine the visual images with the size of I 1 and I 2 obtained in the above step (2) of 32×32×3, the tactile acceleration spectrum images of A 1 and A 2 with the size of 32×32×3, and S 1 and S 2 sound spectrum images of size 32 × 32 × 3 are input to the first layer of the neural network, namely the input layer, the size of the input image is d × d, and the local receptive field in this neural network has Ψ scale channels , the sizes of the Ψ scale channels are r 1 , r 2 ,...,r Ψ respectively, each scale channel generates K different input weights, thereby randomly generating Ψ×K feature maps, and the Φth Φ randomly generated by the neural network is generated. The initial weights of the visual image, tactile acceleration spectrogram and sound spectrogram of each scale channel are denoted as
Figure BDA0001356686690000031
and
Figure BDA0001356686690000032
Figure BDA0001356686690000033
and
Figure BDA0001356686690000034
respectively by
Figure BDA0001356686690000035
and
Figure BDA0001356686690000036
Formed column by column, where the superscript I represents the visual mode of the training sample and the object to be classified, the superscript A represents the tactile acceleration mode of the training sample and the object to be classified, and S represents the tactile sound of the training sample and the object to be classified modal,
Figure BDA0001356686690000037
represents the initial weight,
Figure BDA0001356686690000038
represents the initial weight for generating the ζth feature map, 1≤Φ≤Ψ, 1≤ζ≤K, and the size of the Φth scale local receptive field is r Φ ×r Φ ,

Figure BDA0001356686690000039
Figure BDA0001356686690000039

Figure BDA00013566866900000310
Figure BDA00013566866900000310

进而得到第Φ个尺度通道的所有K个特征图的大小为(d-rΦ+1)×(d-rΦ+1);Then, the size of all K feature maps of the Φth scale channel is obtained as (dr Φ +1)×(dr Φ +1);

(3-2)使用奇异值分解方法,对上述第Φ个尺度通道的初始权重矩阵

Figure BDA00013566866900000311
进行正交化处理,得到正交矩阵
Figure BDA00013566866900000312
Figure BDA00013566866900000313
Figure BDA0001356686690000041
Figure BDA0001356686690000042
中的每一列
Figure BDA0001356686690000043
Figure BDA0001356686690000044
分别为
Figure BDA0001356686690000045
Figure BDA0001356686690000046
的正交基,第Φ个尺度通道的第ζ个特征图的输入权重
Figure BDA0001356686690000048
Figure BDA0001356686690000049
Figure BDA00013566866900000410
分别为由
Figure BDA00013566866900000411
Figure BDA00013566866900000412
形成的方阵;(3-2) Using the singular value decomposition method, the initial weight matrix of the above Φth scale channel
Figure BDA00013566866900000311
Orthogonalization is performed to obtain an orthogonal matrix
Figure BDA00013566866900000312
and
Figure BDA00013566866900000313
Figure BDA0001356686690000041
and
Figure BDA0001356686690000042
each column in
Figure BDA0001356686690000043
and
Figure BDA0001356686690000044
respectively
Figure BDA0001356686690000045
and
Figure BDA0001356686690000046
The orthonormal basis of , the input weights of the ζth feature map of the Φth scale channel
Figure BDA0001356686690000048
Figure BDA0001356686690000049
and
Figure BDA00013566866900000410
respectively for the reason
Figure BDA00013566866900000411
and
Figure BDA00013566866900000412
formed square matrix;

利用下式,分别计算视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道的第ζ特征图中的节点(i,j)的卷积特征:The convolution features of nodes (i, j) in the ζ-th feature map of the Φ-th scale channel of the visual modality, tactile acceleration modality, and haptic sound modality are calculated using the following formula:

Figure BDA00013566866900000413
Figure BDA00013566866900000413

Φ=1,2,3...,Ψ,Φ=1,2,3...,Ψ,

i,j=1,...,(d-rΦ+1),i,j=1,...,( drΦ +1),

ζ=1,2,3...,K,ζ=1,2,3...,K,

Figure BDA00013566866900000414
Figure BDA00013566866900000415
分别表示视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道中第ζ特征图的节点(i,j)的卷积特征,x是与节点(i,j)对应的矩阵;
Figure BDA00013566866900000414
and
Figure BDA00013566866900000415
Represents the convolution features of the node (i, j) of the ζ-th feature map in the Φ-th scale channel of the visual modality, tactile acceleration modality and haptic sound modality, respectively, x is the matrix corresponding to node (i, j) ;

(4)对上述视觉模态、触觉加速度模态和触觉声音模态的卷积特征进行多尺度平方根池化,池化尺度有Ψ个尺度,Ψ个尺度的大小分别为e1,e2,…,eΨ,第Φ个尺度下池化大小eΦ表示池化中心和边缘之间的距离,池化图和特征图大小相同,为(d-rΦ+1)×(d-rΦ+1),根据上述步骤(3)得到的卷积特征,利用下式计算池化特征:(4) Multi-scale square root pooling is performed on the convolution features of the above visual modality, tactile acceleration modality and tactile sound modality. The pooling scale has Ψ scales, and the sizes of the Ψ scales are e 1 , e 2 , …,e Ψ , the pooling size e Φ at the Φth scale represents the distance between the pooling center and the edge, the pooling map and the feature map have the same size, which is (dr Φ +1)×(dr Φ +1), according to The convolution feature obtained in the above step (3), the pooling feature is calculated by the following formula:

Figure BDA0001356686690000051
Figure BDA0001356686690000051

p,q=1,...,(d-rΦ+1),p,q=1,...,( drΦ +1),

若节点(i,j)不在(d-rΦ+1),则

Figure BDA0001356686690000052
Figure BDA0001356686690000053
均为零,If node (i,j) is not at (dr Φ +1), then
Figure BDA0001356686690000052
and
Figure BDA0001356686690000053
are all zero,

Φ=1,2,3...,Ψ,Φ=1,2,3...,Ψ,

ζ=1,2,3...,K,ζ=1,2,3...,K,

其中,

Figure BDA0001356686690000054
Figure BDA0001356686690000055
分别表示视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道中第ζ个池化图的节点(p,q)的池化特征;in,
Figure BDA0001356686690000054
and
Figure BDA0001356686690000055
Represent the pooling features of the nodes (p, q) of the ζth pooling graph in the Φth scale channel of the visual modality, the haptic acceleration modality and the haptic sound modality, respectively;

(5)根据上述池化特征,得到三个模态的全连接特征向量,包括以下步骤:(5) According to the above pooling features, the fully connected feature vectors of the three modalities are obtained, including the following steps:

(5-1)将步骤(4)的池化特征中的第ω个训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的池化图的所有池化特征,分别连接成一个行向量

Figure BDA0001356686690000056
Figure BDA0001356686690000057
其中1≤ω≤N1;(5-1) Connect all the pooled features of the pooled graphs of the visual image modality, tactile acceleration modality and tactile sound modality of the ωth training sample in the pooling feature of step (4) into one row vector
Figure BDA0001356686690000056
and
Figure BDA0001356686690000057
where 1≤ω≤N 1 ;

(5-2)遍历N1个训练样本,重复上述步骤(5-1),分别得到N1训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的行向量组合,记为:(5-2) Traverse N 1 training samples, repeat the above step (5-1), and obtain the row vector combination of the visual image mode, tactile acceleration mode and tactile sound mode of the N 1 training samples respectively, which is recorded as:

Figure BDA0001356686690000058
Figure BDA0001356686690000058

其中,

Figure BDA0001356686690000059
表示视觉模态的组合特征向量矩阵,
Figure BDA00013566866900000510
表示触觉加速度模态特征矩阵,
Figure BDA00013566866900000511
表示触觉声音模态的特征向量矩阵;in,
Figure BDA0001356686690000059
the combined eigenvector matrix representing the visual modality,
Figure BDA00013566866900000510
represents the tactile acceleration modal feature matrix,
Figure BDA00013566866900000511
Eigenvector matrix representing the tactile sound modality;

(6)三个模态的全连接特征向量,进行多模态融合,得到多模态融合后的混合矩阵,包括以下步骤:(6) The fully connected eigenvectors of the three modalities are multi-modal fusion, and the mixed matrix after multi-modal fusion is obtained, including the following steps:

(6-1)将上述步骤(5)的N1训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的行向量输入混合层进行组合处理,得到一个混合矩阵H=[HI,HA,HS];(6-1) Input the row vectors of the visual image modality, the tactile acceleration modality and the tactile sound modality of the N 1 training samples in the above step (5) into the mixing layer for combined processing to obtain a mixed matrix H=[H I , H A , H S ];

(6-2)对步骤(6-1)的混合矩阵H中的每个样本的混合行向量进行调整,生成一个多模态融合后的二维混合矩阵,二维混合矩阵的大小为

Figure BDA0001356686690000061
其中,d'是二维矩阵的长度,取值范围为
Figure BDA0001356686690000062
(6-2) Adjust the mixing row vector of each sample in the mixing matrix H of step (6-1) to generate a two-dimensional mixing matrix after multimodal fusion, and the size of the two-dimensional mixing matrix is
Figure BDA0001356686690000061
Among them, d' is the length of the two-dimensional matrix, and the value range is
Figure BDA0001356686690000062

(7)将上述步骤(6)得到的多模态融合后的混合矩阵输入到神经网络的混合网络层,通过多尺度特征映射,获得多模态混合卷积特征,包括以下步骤:(7) Input the mixed matrix after multi-modal fusion obtained in the above step (6) into the mixed network layer of the neural network, and obtain multi-modal mixed convolution features through multi-scale feature mapping, including the following steps:

(7-1)将上述步骤(6-2)得到的多模态融合后的混合矩阵输入到混合网络中,混合矩阵的大小为d'×d”,该混合网络有Ψ'个尺度通道,Ψ'个尺度通道的大小分别为r1,r2,…,rΨ',每个尺度通道产生K'个不同的输入权重,从而随机生成Ψ'×K'个混合特征图,将混合网络随机产生第Φ'个尺度通道混合初始权重记为

Figure BDA0001356686690000063
Figure BDA0001356686690000064
逐列组成,其中上角标hybrid表示三模态融合,
Figure BDA0001356686690000065
表示混合网络的初始权重,
Figure BDA0001356686690000066
表示产生第ζ'个混合特征图的初始权重,1≤Φ'≤Ψ',1≤ζ'≤K',第Φ'个尺度通道局部感受野的大小为rΦ'×rΦ',那么
Figure BDA0001356686690000067
Figure BDA0001356686690000068
(7-1) Input the multimodal fusion matrix obtained in the above step (6-2) into the hybrid network, the size of the hybrid matrix is d'×d", and the hybrid network has Ψ' scale channels, The sizes of the Ψ' scale channels are r 1 , r 2 ,...,r Ψ' , and each scale channel generates K' different input weights, thereby randomly generating Ψ'×K' mixed feature maps, and the mixed network Randomly generate the Φ'th scale channel mixing initial weight and record it as
Figure BDA0001356686690000063
Depend on
Figure BDA0001356686690000064
Column-by-column composition, where the superscript hybrid indicates three-modal fusion,
Figure BDA0001356686690000065
represents the initial weights of the hybrid network,
Figure BDA0001356686690000066
Represents the initial weight of generating the ζ'th mixed feature map, 1≤Φ'≤Ψ', 1≤ζ'≤K', the size of the local receptive field of the Φ'th scale channel is r Φ' ×r Φ' , then
Figure BDA0001356686690000067
Figure BDA0001356686690000068

进而得到第Φ'个尺度通道第ζ'个特征图的大小为(d'-rΦ'+1)×(d”-rΦ'+1);Then, the size of the ζ'th feature map of the Φ'th scale channel is (d'-r Φ' +1)×(d"-r Φ' +1);

(7-2)使用奇异值分解方法,对上述第Φ'个尺度通道初始权重矩阵

Figure BDA0001356686690000069
进行正交化处理,得到正交矩阵
Figure BDA00013566866900000610
的每一列
Figure BDA00013566866900000611
Figure BDA00013566866900000612
的正交基,第Φ'个尺度通道的第ζ'个特征图的输入权重
Figure BDA00013566866900000613
是由
Figure BDA00013566866900000614
形成的方阵;(7-2) Using the singular value decomposition method, the initial weight matrix for the above Φ'th scale channel
Figure BDA0001356686690000069
Orthogonalization is performed to obtain an orthogonal matrix
Figure BDA00013566866900000610
each column of
Figure BDA00013566866900000611
Yes
Figure BDA00013566866900000612
The orthonormal basis of , the input weights of the ζ'th feature map of the Φ'th scale channel
Figure BDA00013566866900000613
By
Figure BDA00013566866900000614
formed square matrix;

利用下式,计算第Φ'个尺度通道的第ζ'特征图中的卷积节点(i',j')混合卷积特征:Using the following formula, calculate the mixed convolution feature of the convolution node (i', j') in the ζ'th feature map of the Φ'th scale channel:

Figure BDA0001356686690000071
Figure BDA0001356686690000071

Φ'=1,2,3...,Ψ',Φ'=1,2,3...,Ψ',

i',j'=1,...,(d'-rΦ'+1),i',j'=1,...,(d'- rΦ' +1),

ζ'=1,2,3...,K',ζ'=1,2,3...,K',

Figure BDA0001356686690000072
是第Φ'个尺度通道的第ζ'特征图中的卷积节点(i',j')混合卷积特征,x'是与节点(i',j')对应的矩阵;
Figure BDA0001356686690000072
is the convolution node (i', j') mixed convolution feature in the ζ'-th feature map of the Φ'-th scale channel, and x' is the matrix corresponding to the node (i', j');

(8)对上述混合卷积特征,进行混合多尺度平方根池化,池化尺度有Ψ'个尺度,大小分别为e1,e2,…,eΨ',第Φ'个尺度下池化图和特征图大小相同,为(d'-rΦ'+1)×(d”-rΦ'+1),根据上述步骤(7)得到的混合卷积特征,利用下式计算混合池化特征:(8) Perform hybrid multi-scale square root pooling for the above mixed convolution features, the pooling scale has Ψ' scales, and the sizes are e 1 , e 2 ,...,e Ψ' , and the pooling graph at the Φ'th scale The size is the same as the feature map, which is (d'-r Φ' +1)×(d"-r Φ' +1). According to the mixed convolution feature obtained in the above step (7), the following formula is used to calculate the mixed pooling feature :

Figure BDA0001356686690000073
Figure BDA0001356686690000073

p',q'=1,...,(d'-rΦ'+1),p',q'=1,...,(d'-r Φ' +1),

若节点(i',j')不在d'-rΦ'+1,则

Figure BDA0001356686690000074
为零,If the node (i', j') is not in d'-r Φ' +1, then
Figure BDA0001356686690000074
zero,

Φ'=1,2,3...,Ψ',Φ'=1,2,3...,Ψ',

ζ'=1,2,3...,K',ζ'=1,2,3...,K',

其中,

Figure BDA0001356686690000075
表示第Φ'个尺度通道的第ζ'个池化图的组合节点(p',q')的混合池化特征;in,
Figure BDA0001356686690000075
Represents the mixed pooling feature of the combined node (p', q') of the ζ'th pooling graph of the Φ'th scale channel;

(9)根据上述混合池化特征,重复步骤(5),将不同尺度的混合池化特征向量进行全连接,得到混合网络的组合特征矩阵

Figure BDA0001356686690000076
其中K'表示每个尺度通道产生不同特征图的个数;(9) According to the above hybrid pooling feature, repeat step (5) to fully connect the hybrid pooling feature vectors of different scales to obtain the combined feature matrix of the hybrid network
Figure BDA0001356686690000076
where K' represents the number of different feature maps generated by each scale channel;

(10)根据上述步骤(9)得到的混合网络的组合特征矩阵Hhybric,利用下式,根据训练样本的个数N1,计算神经网络的训练样本输出权重β:(10) According to the combined feature matrix H hybric of the hybrid network obtained in the above step (9), use the following formula to calculate the training sample output weight β of the neural network according to the number N 1 of training samples:

Figure BDA0001356686690000081
Figure BDA0001356686690000082
like
Figure BDA0001356686690000081
but
Figure BDA0001356686690000082

Figure BDA0001356686690000083
Figure BDA0001356686690000084
like
Figure BDA0001356686690000083
but
Figure BDA0001356686690000084

其中,T是训练样本

Figure BDA0001356686690000085
的期望值,C为正则化系数,取值为任意值,本发明一个实施例中,C的取值为5,上标T表示矩阵转置;where T is the training sample
Figure BDA0001356686690000085
The expected value of , C is the regularization coefficient, and the value is any value. In an embodiment of the present invention, the value of C is 5, and the superscript T represents the matrix transposition;

(11)利用上述步骤(3)中三个模态初始权重正交化后的正交矩阵

Figure BDA0001356686690000086
Figure BDA0001356686690000087
对经过预处理的待分类数据集D2,利用上述步骤(3)-步骤(9)的方法,得到待分类样本的三模态混合特征向量Htest;(11) Use the orthogonal matrix obtained by orthogonalizing the initial weights of the three modes in the above step (3)
Figure BDA0001356686690000086
and
Figure BDA0001356686690000087
For the preprocessed data set D 2 to be classified, the method of the above-mentioned steps (3)-step (9) is used to obtain the three-modal mixed feature vector H test of the sample to be classified;

(12)根据上述步骤(10)的训练样本输出权重β和上述步骤(11)的三模态混合特征向量Htest,利用下式计算出N2个待分类样本的预测标签με,实现基于多模态融合深度学习的物体材质分类,(12) According to the training sample output weight β in the above step (10) and the three-modal mixed feature vector H test in the above step (11), use the following formula to calculate the predicted labels μ ε of the N 2 samples to be classified, and realize the Multi-modal fusion deep learning object material classification,

με=Htestβ1≤ε≤M。μ ε =H test β1≤ε≤M.

本发明提出的基于多模态融合深度学习的物体材质分类方法,具有以下特点和优点:The object material classification method based on multi-modal fusion deep learning proposed by the present invention has the following characteristics and advantages:

1、本发明提出的基于多尺度局部感受野的超限学习机方法,可以用多个尺度的局部感受野来感受材质,提取出多样的特征,实现复杂物体材质的分类。1. The ELM method based on multi-scale local receptive fields proposed by the present invention can use local receptive fields of multiple scales to perceive materials, extract various features, and realize the classification of complex object materials.

2、本发明的基于多尺度局部感受野的超限学习机的深度学习方法,可以将特征学习和图像分类集一体,而不是由人为设计的特征提取器提取特征,因此该算法适用于大部分不同材质的对象分类。2. The deep learning method of the ELM based on the multi-scale local receptive field of the present invention can integrate feature learning and image classification, instead of extracting features by an artificially designed feature extractor, so the algorithm is suitable for most Object classification for different materials.

3、本发明的基于多尺度局部感受野的超限学习机方法,是一种基于多尺度局部感受野的超限学习机的多模态融合深度学习方法,可以有效的将物体材质三个模态的信息融合,实现信息互补,提高了材质分类的鲁棒性和准确率。3. The ELM method based on the multi-scale local receptive field of the present invention is a multi-modal fusion deep learning method based on the ELM based on the multi-scale local receptive field, which can effectively combine the three modes of the object material. The state information fusion realizes information complementation, and improves the robustness and accuracy of material classification.

附图说明Description of drawings

图1为本发明方法的流程框图。FIG. 1 is a flow chart of the method of the present invention.

图2为本发明方法中基于多尺度局部感受野的超限学习机的流程框图。FIG. 2 is a flow chart of an ELM based on a multi-scale local receptive field in the method of the present invention.

图3为本发明中基于多尺度局部感受野的超限学习机方法不同模态融合的流程框图。FIG. 3 is a flow chart of the fusion of different modalities of the ELM method based on the multi-scale local receptive field in the present invention.

具体实施方式Detailed ways

本发明提出的基于多模态融合深度学习的物体材质分类方法,其流程框图如图1所示,主要分为视觉图像模态,触觉加速度模态,触觉声音模态和混合网络四大部分。包括以下步骤:The object material classification method based on multi-modal fusion deep learning proposed by the present invention, its flowchart is shown in Figure 1, which is mainly divided into four parts: visual image mode, tactile acceleration mode, tactile sound mode and hybrid network. Include the following steps:

(1)设训练样本个数为N1,训练样本材质种类为M1个,记每类材质训练样本的标签为

Figure BDA0001356686690000091
其中1≤M1≤N1,分别采集所有N1个训练样本的视觉图像I1、触觉加速度A1和触觉声音S1,建立一个包括I1、A1和S1的数据集D1,I1的图像大小为320×480;(1) Suppose the number of training samples is N 1 , the material types of the training samples are M 1 , and the labels of the training samples of each type of material are recorded as
Figure BDA0001356686690000091
where 1≤M 1 ≤N 1 , collect visual image I 1 , haptic acceleration A 1 and haptic sound S 1 of all N 1 training samples respectively, and establish a dataset D 1 including I 1 , A 1 and S 1 , The image size of I 1 is 320×480;

设待分类物体个数为N2,待分类物体材质的种类为M2个,记每类待分类物体的标签为

Figure BDA0001356686690000092
其中1≤M2≤M1,分别采集所有N2个待分类物体的视觉图像I2、触觉加速度A2和触觉声音S2,建立一个包括I2、A2和S2的数据集D2,I2的图像大小为320×480;其中的触觉加速度A1和A2是刚性物体在材质表面滑动时用传感器采集得到的一维信号,触觉声音S1和S2也是刚性物体在物体材质表面滑动时,用麦克风保存的一维信号;Suppose the number of objects to be classified is N 2 , the material types of the objects to be classified are M 2 , and the label of each object to be classified is recorded as
Figure BDA0001356686690000092
where 1≤M 2 ≤M 1 , collect visual images I 2 , haptic acceleration A 2 and haptic sound S 2 of all N 2 objects to be classified, respectively, and establish a dataset D 2 including I 2 , A 2 and S 2 , the image size of I 2 is 320×480; the tactile accelerations A 1 and A 2 are the one-dimensional signals collected by the sensor when the rigid object slides on the surface of the material, and the tactile sounds S 1 and S 2 are also the rigid objects in the object material The one-dimensional signal saved with the microphone when the surface slides;

(2)对上述数据集D1和数据集D2视觉图像进行视觉图像预处理、触觉加速度信号进行触觉加速度预处理和触觉声音信号进行触觉声音预处理,分别得到视觉图像、触觉加速度频谱图和触觉声音频谱图,包括以下步骤:( 2 ) Perform visual image preprocessing, tactile acceleration preprocessing for tactile acceleration signals, and tactile sound preprocessing for tactile sound signals on the visual images of the above datasets D1 and D2, respectively, to obtain visual images, tactile acceleration spectrograms and Haptic sound spectrogram, including the following steps:

(2-1)利用降采样方法,对图像大小为320×480的图像I1和图像I2进行降采样,得到I1和I2的大小为32×32×3的视觉图像;(2-1) Using the down-sampling method, down-sampling the image I 1 and the image I 2 with an image size of 320×480 to obtain a visual image with a size of I 1 and I 2 of 32×32×3;

(2-2)利用短时傅里叶变换方法,分别将触觉加速度A1和触觉加速度A2转换到频域,短时傅里叶变换中的汉明窗的窗口长度为500,窗口偏移量为100,采样频率为10kHz,分别得到触觉加速度A1和触觉加速度A2的频谱图,从频谱图中选择前500个低频信道作为频谱图像,该频谱图像保留了来自触觉信号的大部分能量,对该频谱图像进行降采样,得到A1和A2的大小为32×32×3的触觉加速度频谱图像;(2-2) Using the short-time Fourier transform method, the tactile acceleration A 1 and the tactile acceleration A 2 are respectively converted into the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of the haptic acceleration A 1 and the haptic acceleration A 2 are obtained respectively, and the first 500 low-frequency channels are selected from the spectrogram as the spectrum image, which retains most of the energy from the haptic signal , down-sampling the spectral image to obtain the tactile acceleration spectral images of A 1 and A 2 with a size of 32×32×3;

(2-3)利用短时傅里叶变换方法,分别将触觉声音S1和触觉声音S2转换到频域,短时傅里叶变换中的汉明窗的窗口长度为500,窗口偏移量为100,采样频率为10kHz,分别得到触觉声音S1和触觉声音S2的频谱图,从频谱图中选择前500个低频信道作为频谱图像,该频谱图像保留了来自触觉信号的大部分能量,对该频谱图像进行降采样,得到S1和S2的大小为32×32×3的声音频谱图像;(2-3) Using the short-time Fourier transform method, the tactile sound S 1 and the tactile sound S 2 are respectively converted to the frequency domain, the window length of the Hamming window in the short-time Fourier transform is 500, and the window offset The amount is 100, the sampling frequency is 10kHz, and the spectrograms of the haptic sound S1 and the haptic sound S2 are obtained respectively, and the first 500 low - frequency channels are selected from the spectrogram as the spectrum image, which retains most of the energy from the haptic signal , down-sampling the spectrum image to obtain a sound spectrum image with a size of 32×32× 3 for S1 and S2 ;

(3)通过多尺度特征映射,获得视觉模态、触觉加速度模态和触觉声音模态的卷积特征,包括以下步骤:(3) Obtain convolution features of visual modality, tactile acceleration modality and tactile sound modality through multi-scale feature mapping, including the following steps:

(3-1)将上述步骤(2)得到的I1和I2的大小为32×32×3的视觉图像、A1和A2的大小为32×32×3的触觉加速度频谱图像和S1和S2的大小为32×32×3的声音频谱图像输入到神经网络第一层,即输入层,输入图像的大小为d×d,该神经网络中的局部感受野具有Ψ个尺度通道,Ψ个尺度通道的大小分别为r1,r2,…,rΨ,每个尺度通道产生K个不同的输入权重,从而随机生成Ψ×K个特征图,将神经网络随机产生的第Φ个尺度通道的视觉图像、触觉加速度频谱图和声音频谱图的初始权重记为

Figure BDA0001356686690000101
Figure BDA0001356686690000102
Figure BDA0001356686690000103
Figure BDA0001356686690000104
分别由
Figure BDA0001356686690000105
Figure BDA0001356686690000106
逐列组成,其中,上角标I表示训练样本和待分类物体的视觉模态,上角标A表示训练样本和待分类物体的触觉加速度模态,S表示训练样本和待分类物体的触觉声音模态,
Figure BDA0001356686690000107
表示初始权重,
Figure BDA0001356686690000108
表示产生第ζ个特征图的初始权重,1≤Φ≤Ψ,1≤ζ≤K,第Φ个尺度局部感受野的大小为rΦ×rΦ,(3-1) Combine the visual images with the size of I 1 and I 2 obtained in the above step (2) of 32×32×3, the tactile acceleration spectrum images of A 1 and A 2 with the size of 32×32×3, and S 1 and S 2 sound spectrum images of size 32 × 32 × 3 are input to the first layer of the neural network, namely the input layer, the size of the input image is d × d, and the local receptive field in this neural network has Ψ scale channels , the sizes of the Ψ scale channels are r 1 , r 2 ,...,r Ψ respectively, each scale channel generates K different input weights, thereby randomly generating Ψ×K feature maps, and the Φth Φ randomly generated by the neural network is generated. The initial weights of the visual image, tactile acceleration spectrogram and sound spectrogram of each scale channel are denoted as
Figure BDA0001356686690000101
and
Figure BDA0001356686690000102
Figure BDA0001356686690000103
and
Figure BDA0001356686690000104
respectively by
Figure BDA0001356686690000105
and
Figure BDA0001356686690000106
Formed column by column, where the superscript I represents the visual mode of the training sample and the object to be classified, the superscript A represents the tactile acceleration mode of the training sample and the object to be classified, and S represents the tactile sound of the training sample and the object to be classified modal,
Figure BDA0001356686690000107
represents the initial weight,
Figure BDA0001356686690000108
represents the initial weight for generating the ζth feature map, 1≤Φ≤Ψ, 1≤ζ≤K, and the size of the Φth scale local receptive field is r Φ ×r Φ ,

Figure BDA0001356686690000109
Figure BDA0001356686690000109

Figure BDA00013566866900001010
Figure BDA00013566866900001010

进而得到第Φ个尺度通道的所有K个特征图的大小为(d-rΦ+1)×(d-rΦ+1);Then, the size of all K feature maps of the Φth scale channel is obtained as (dr Φ +1)×(dr Φ +1);

(3-2)使用奇异值分解方法,对上述第Φ个尺度通道的初始权重矩阵

Figure BDA0001356686690000111
进行正交化处理,得到正交矩阵
Figure BDA0001356686690000112
Figure BDA0001356686690000113
正交化的输入权重更能提取出更为完备的特征,
Figure BDA0001356686690000114
Figure BDA0001356686690000115
中的每一列
Figure BDA0001356686690000116
Figure BDA0001356686690000117
分别为
Figure BDA0001356686690000118
Figure BDA0001356686690000119
的正交基,第Φ个尺度通道的第ζ个特征图的输入权重
Figure BDA00013566866900001110
Figure BDA00013566866900001111
Figure BDA00013566866900001112
分别为由
Figure BDA00013566866900001113
Figure BDA00013566866900001114
形成的方阵;(3-2) Using the singular value decomposition method, the initial weight matrix of the above Φth scale channel
Figure BDA0001356686690000111
Orthogonalization is performed to obtain an orthogonal matrix
Figure BDA0001356686690000112
and
Figure BDA0001356686690000113
Orthogonalized input weights can extract more complete features,
Figure BDA0001356686690000114
and
Figure BDA0001356686690000115
each column in
Figure BDA0001356686690000116
and
Figure BDA0001356686690000117
respectively
Figure BDA0001356686690000118
and
Figure BDA0001356686690000119
The orthonormal basis of , the input weights of the ζth feature map of the Φth scale channel
Figure BDA00013566866900001110
Figure BDA00013566866900001111
and
Figure BDA00013566866900001112
respectively for the reason
Figure BDA00013566866900001113
and
Figure BDA00013566866900001114
formed square matrix;

利用下式,分别计算视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道的第ζ特征图中的节点(i,j)的卷积特征:The convolution features of nodes (i, j) in the ζ-th feature map of the Φ-th scale channel of the visual modality, tactile acceleration modality, and haptic sound modality are calculated using the following formula:

Figure BDA00013566866900001115
Figure BDA00013566866900001115

Φ=1,2,3...,Ψ,Φ=1,2,3...,Ψ,

i,j=1,...,(d-rΦ+1),i,j=1,...,( drΦ +1),

ζ=1,2,3...,K,ζ=1,2,3...,K,

Figure BDA00013566866900001116
Figure BDA00013566866900001117
分别表示视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道中第ζ特征图的节点(i,j)的卷积特征,x是与节点(i,j)对应的矩阵;
Figure BDA00013566866900001116
and
Figure BDA00013566866900001117
Represents the convolution features of the node (i, j) of the ζ-th feature map in the Φ-th scale channel of the visual modality, tactile acceleration modality and haptic sound modality, respectively, x is the matrix corresponding to node (i, j) ;

(4)对上述视觉模态、触觉加速度模态和触觉声音模态的卷积特征进行多尺度平方根池化,池化尺度有Ψ个尺度,Ψ个尺度的大小分别为e1,e2,…,eΨ,第Φ个尺度下池化大小eΦ表示池化中心和边缘之间的距离,如图2所示,池化图和特征图大小相同,为(d-rΦ+1)×(d-rΦ+1),根据上述步骤(3)得到的卷积特征,利用下式计算池化特征:(4) Multi-scale square root pooling is performed on the convolution features of the above visual modality, tactile acceleration modality and tactile sound modality. The pooling scale has Ψ scales, and the sizes of the Ψ scales are e 1 , e 2 , …,e Ψ , the pooling size e Φ at the Φth scale represents the distance between the pooling center and the edge, as shown in Figure 2, the pooling map and the feature map have the same size, which is (dr Φ +1)×(dr Φ +1), according to the convolution feature obtained in the above step (3), use the following formula to calculate the pooling feature:

Figure BDA0001356686690000121
Figure BDA0001356686690000121

p,q=1,...,(d-rΦ+1),p,q=1,...,( drΦ +1),

若节点(i,j)不在(d-rΦ+1),则

Figure BDA0001356686690000122
Figure BDA0001356686690000123
均为零,If node (i,j) is not at (dr Φ +1), then
Figure BDA0001356686690000122
and
Figure BDA0001356686690000123
are all zero,

Φ=1,2,3...,Ψ,Φ=1,2,3...,Ψ,

ζ=1,2,3...,K,ζ=1,2,3...,K,

其中,

Figure BDA0001356686690000124
Figure BDA0001356686690000125
分别表示视觉模态、触觉加速度模态和触觉声音模态的第Φ个尺度通道中第ζ个池化图的节点(p,q)的池化特征;in,
Figure BDA0001356686690000124
and
Figure BDA0001356686690000125
Represent the pooling features of the nodes (p, q) of the ζth pooling graph in the Φth scale channel of the visual modality, the haptic acceleration modality and the haptic sound modality, respectively;

(5)根据上述池化特征,得到三个模态的全连接特征向量,包括以下步骤:(5) According to the above pooling features, the fully connected feature vectors of the three modalities are obtained, including the following steps:

(5-1)将步骤(4)的池化特征中的第ω个训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的池化图的所有池化特征,分别连接成一个行向量

Figure BDA0001356686690000126
Figure BDA0001356686690000127
其中1≤ω≤N1;(5-1) Connect all the pooled features of the pooled graphs of the visual image modality, tactile acceleration modality and tactile sound modality of the ωth training sample in the pooling feature of step (4) into one row vector
Figure BDA0001356686690000126
and
Figure BDA0001356686690000127
where 1≤ω≤N 1 ;

(5-2)遍历N1个训练样本,重复上述步骤(5-1),分别得到N1训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的行向量组合,记为:(5-2) Traverse N 1 training samples, repeat the above step (5-1), and obtain the row vector combination of the visual image mode, tactile acceleration mode and tactile sound mode of the N 1 training samples respectively, which is recorded as:

Figure BDA0001356686690000128
Figure BDA0001356686690000128

其中,

Figure BDA0001356686690000129
表示视觉模态的组合特征向量矩阵,
Figure BDA00013566866900001210
表示触觉加速度模态特征矩阵,
Figure BDA00013566866900001211
表示触觉声音模态的特征向量矩阵;in,
Figure BDA0001356686690000129
the combined eigenvector matrix representing the visual modality,
Figure BDA00013566866900001210
represents the tactile acceleration modal feature matrix,
Figure BDA00013566866900001211
Eigenvector matrix representing the tactile sound modality;

(6)三个模态的全连接特征向量,进行多模态融合,得到多模态融合后的混合矩阵,包括以下步骤:(6) The fully connected eigenvectors of the three modalities are multi-modal fusion, and the mixed matrix after multi-modal fusion is obtained, including the following steps:

(6-1)将上述步骤(5)的N1训练样本的视觉图像模态、触觉加速度模态和触觉声音模态的行向量输入混合层进行组合处理,得到一个混合矩阵H=[HI,HA,HS];(6-1) Input the row vectors of the visual image modality, the tactile acceleration modality and the tactile sound modality of the N 1 training samples in the above step (5) into the mixing layer for combined processing to obtain a mixed matrix H=[H I , H A , H S ];

(6-2)对步骤(6-1)的混合矩阵H中的每个样本的混合行向量进行调整,生成一个多模态融合后的二维混合矩阵,二维混合矩阵的大小为

Figure BDA0001356686690000131
如图3所示,其中,d'是二维矩阵的长度,取值范围为
Figure BDA0001356686690000132
(6-2) Adjust the mixing row vector of each sample in the mixing matrix H of step (6-1) to generate a two-dimensional mixing matrix after multimodal fusion, and the size of the two-dimensional mixing matrix is
Figure BDA0001356686690000131
As shown in Figure 3, where d' is the length of the two-dimensional matrix, and the value range is
Figure BDA0001356686690000132

(7)将上述步骤(6)得到的多模态融合后的混合矩阵输入到神经网络的混合网络层,通过多尺度特征映射,获得多模态混合卷积特征,包括以下步骤:(7) Input the mixed matrix after multi-modal fusion obtained in the above step (6) into the mixed network layer of the neural network, and obtain multi-modal mixed convolution features through multi-scale feature mapping, including the following steps:

(7-1)将上述步骤(6-2)得到的多模态融合后的混合矩阵输入到混合网络中,混合矩阵的大小为d'×d”,该混合网络有Ψ'个尺度通道,Ψ'个尺度通道的大小分别为r1,r2,…,rΨ',每个尺度通道产生K'个不同的输入权重,从而随机生成Ψ'×K'个混合特征图,将混合网络随机产生第Φ'个尺度通道混合初始权重记为

Figure BDA0001356686690000133
Figure BDA0001356686690000134
逐列组成,其中上角标hybrid表示三模态融合,
Figure BDA0001356686690000135
表示混合网络的初始权重,
Figure BDA0001356686690000136
表示产生第ζ'个混合特征图的初始权重,1≤Φ'≤Ψ',1≤ζ'≤K',第Φ'个尺度通道局部感受野的大小为rΦ'×rΦ',那么
Figure BDA0001356686690000137
Figure BDA0001356686690000138
(7-1) Input the multimodal fusion matrix obtained in the above step (6-2) into the hybrid network, the size of the hybrid matrix is d'×d", and the hybrid network has Ψ' scale channels, The sizes of the Ψ' scale channels are r 1 , r 2 ,...,r Ψ' , and each scale channel generates K' different input weights, thereby randomly generating Ψ'×K' mixed feature maps, and the mixed network Randomly generate the Φ'th scale channel mixing initial weight and record it as
Figure BDA0001356686690000133
Depend on
Figure BDA0001356686690000134
Column-by-column composition, where the superscript hybrid indicates three-modal fusion,
Figure BDA0001356686690000135
represents the initial weights of the hybrid network,
Figure BDA0001356686690000136
Represents the initial weight of generating the ζ'th mixed feature map, 1≤Φ'≤Ψ', 1≤ζ'≤K', the size of the local receptive field of the Φ'th scale channel is r Φ' ×r Φ' , then
Figure BDA0001356686690000137
Figure BDA0001356686690000138

进而得到第Φ'个尺度通道第ζ'个特征图的大小为(d'-rΦ'+1)×(d”-rΦ'+1);Then, the size of the ζ'th feature map of the Φ'th scale channel is (d'-r Φ' +1)×(d"-r Φ' +1);

(7-2)使用奇异值分解方法,对上述第Φ'个尺度通道初始权重矩阵

Figure BDA0001356686690000139
进行正交化处理,得到正交矩阵
Figure BDA00013566866900001310
正交化的输入权重更能提取出更为完备的特征,
Figure BDA00013566866900001311
的每一列
Figure BDA00013566866900001312
Figure BDA00013566866900001313
的正交基,第Φ'个尺度通道的第ζ'个特征图的输入权重
Figure BDA00013566866900001314
是由
Figure BDA00013566866900001315
形成的方阵;(7-2) Using the singular value decomposition method, the initial weight matrix for the above Φ'th scale channel
Figure BDA0001356686690000139
Orthogonalization is performed to obtain an orthogonal matrix
Figure BDA00013566866900001310
Orthogonalized input weights can extract more complete features,
Figure BDA00013566866900001311
each column of
Figure BDA00013566866900001312
Yes
Figure BDA00013566866900001313
The orthonormal basis of , the input weights of the ζ'th feature map of the Φ'th scale channel
Figure BDA00013566866900001314
By
Figure BDA00013566866900001315
formed square matrix;

利用下式,计算第Φ'个尺度通道的第ζ'特征图中的卷积节点(i',j')混合卷积特征:Using the following formula, calculate the mixed convolution feature of the convolution node (i', j') in the ζ'th feature map of the Φ'th scale channel:

Figure BDA0001356686690000141
Figure BDA0001356686690000141

Φ'=1,2,3...,Ψ',Φ'=1,2,3...,Ψ',

i',j'=1,...,(d'-rΦ'+1),i',j'=1,...,(d'- rΦ' +1),

ζ'=1,2,3...,K',ζ'=1,2,3...,K',

Figure BDA0001356686690000142
是第Φ'个尺度通道的第ζ'特征图中的卷积节点(i',j')混合卷积特征,x'是与节点(i',j')对应的矩阵;
Figure BDA0001356686690000142
is the convolution node (i', j') mixed convolution feature in the ζ'-th feature map of the Φ'-th scale channel, and x' is the matrix corresponding to the node (i', j');

(8)对上述混合卷积特征,进行混合多尺度平方根池化,池化尺度有Ψ'个尺度,大小分别为e1,e2,…,eΨ',第Φ'个尺度下池化图和特征图大小相同,为(d'-rΦ'+1)×(d”-rΦ'+1),根据上述步骤(7)得到的混合卷积特征,利用下式计算混合池化特征:(8) Perform hybrid multi-scale square root pooling for the above mixed convolution features, the pooling scale has Ψ' scales, and the sizes are e 1 , e 2 ,...,e Ψ' , and the pooling graph at the Φ'th scale The size is the same as the feature map, which is (d'-r Φ' +1)×(d"-r Φ' +1). According to the mixed convolution feature obtained in the above step (7), the following formula is used to calculate the mixed pooling feature :

Figure BDA0001356686690000143
Figure BDA0001356686690000143

p',q'=1,...,(d'-rΦ'+1),p',q'=1,...,(d'-r Φ' +1),

若节点(i',j')不在d'-rΦ'+1,则

Figure BDA0001356686690000144
为零,If the node (i', j') is not in d'-r Φ' +1, then
Figure BDA0001356686690000144
zero,

Φ'=1,2,3...,Ψ',Φ'=1,2,3...,Ψ',

ζ'=1,2,3...,K',ζ'=1,2,3...,K',

其中,

Figure BDA0001356686690000145
表示第Φ'个尺度通道的第ζ'个池化图的组合节点(p',q')的混合池化特征;in,
Figure BDA0001356686690000145
Represents the mixed pooling feature of the combined node (p', q') of the ζ'th pooling graph of the Φ'th scale channel;

(9)根据上述混合池化特征,重复步骤(5),将不同尺度的混合池化特征向量进行全连接,得到混合网络的组合特征矩阵

Figure BDA0001356686690000146
其中K'表示每个尺度通道产生不同特征图的个数;(9) According to the above hybrid pooling feature, repeat step (5) to fully connect the hybrid pooling feature vectors of different scales to obtain the combined feature matrix of the hybrid network
Figure BDA0001356686690000146
where K' represents the number of different feature maps generated by each scale channel;

(10)根据上述步骤(9)得到的混合网络的组合特征矩阵Hhybric,利用下式,根据训练样本的个数N1,计算神经网络的训练样本输出权重β:(10) According to the combined feature matrix H hybric of the hybrid network obtained in the above step (9), use the following formula to calculate the training sample output weight β of the neural network according to the number N 1 of training samples:

Figure BDA0001356686690000151
Figure BDA0001356686690000152
like
Figure BDA0001356686690000151
but
Figure BDA0001356686690000152

Figure BDA0001356686690000153
Figure BDA0001356686690000154
like
Figure BDA0001356686690000153
but
Figure BDA0001356686690000154

其中,T是训练样本

Figure BDA0001356686690000155
的期望值,C为正则化系数,取值为任意值,本发明一个实施例中,C的取值为5,上标T表示矩阵转置;where T is the training sample
Figure BDA0001356686690000155
The expected value of , C is the regularization coefficient, and the value is any value. In an embodiment of the present invention, the value of C is 5, and the superscript T represents the matrix transposition;

(11)利用上述步骤(3)中三个模态初始权重正交化后的正交矩阵

Figure BDA0001356686690000156
Figure BDA0001356686690000157
对经过预处理的待分类数据集D2,得到待分类样本的三模态混合特征向量Htest;利用上述步骤(3),可以得到待分类物体三个模态的卷积特征向量;利用上述步骤(4),可以得到待分类物体的三个模态的池化特征向量;利用上述步骤(5),可以得到待分类物体的三个模态的全连接特征向量;利用上述步骤(6),可以得到待分类物体的多模态融合后的混合矩阵;利用上述步骤(7),可以得到待分类物体的多模态混合卷积特征;利用上述步骤(8),可以得到待分类物体的多模态混合池化特征;利用上述步骤(9),可以得到待分类物体的三模态混合特征向量Htest。(11) Use the orthogonal matrix obtained by orthogonalizing the initial weights of the three modes in the above step (3)
Figure BDA0001356686690000156
and
Figure BDA0001356686690000157
For the preprocessed data set D 2 to be classified, the three-modal mixed feature vector H test of the sample to be classified is obtained; using the above step (3), the convolution feature vectors of the three modalities of the object to be classified can be obtained; using the above Step (4), the pooled feature vectors of the three modalities of the object to be classified can be obtained; using the above step (5), the fully connected feature vectors of the three modalities of the object to be classified can be obtained; using the above step (6) , the mixed matrix of the multimodal fusion of the object to be classified can be obtained; using the above step (7), the multimodal mixed convolution feature of the object to be classified can be obtained; using the above step (8), the mixed convolution feature of the object to be classified can be obtained. Multi-modal mixed pooling feature; using the above step (9), the three-modal mixed feature vector H test of the object to be classified can be obtained.

(12)根据上述步骤(10)的训练样本输出权重β和上述步骤(11)的三模态混合特征向量Htest,利用下式计算出N2个待分类样本的预测标签με,实现基于多模态融合深度学习的物体材质分类,(12) According to the training sample output weight β in the above step (10) and the three-modal mixed feature vector H test in the above step (11), use the following formula to calculate the predicted labels μ ε of the N 2 samples to be classified, and realize the Multi-modal fusion deep learning object material classification,

με=Htestβ 1≤ε≤M。μ ε =H test β 1≤ε≤M.

Claims (1)

1. an object material classification method based on multi-mode fusion deep learning is characterized by comprising the following steps:
(1) let the number of training samples be N1The training sample material type is M1Each class of material training sample is marked with a label of
Figure FDA0002241220340000011
Wherein 1 is less than or equal to M1≤N1Separately collecting all N1Visual image I of a training sample1Tactile acceleration A1And a tactile sound S1Establishing an inclusion I1、A1And S1Data set D of1,I1The image size of (2) is 320 × 480;
setting the number of objects to be classified as N2The kind of the material of the object to be classified is M2Each class of object to be classified is labeled as
Figure FDA0002241220340000012
Wherein 1 is less than or equal to M2≤M1Separately collecting all N2Visual image I of an object to be classified2Tactile acceleration A2And a tactile sound S2Establishing an inclusion I2、A2And S2Data set D of2,I2The image size of (2) is 320 × 480;
(2) for the above data set D1And a data set D2The method comprises the following steps of carrying out visual image preprocessing on a visual image, carrying out tactile acceleration preprocessing on a tactile acceleration signal and carrying out tactile sound preprocessing on a tactile sound signal to respectively obtain a visual image, a tactile acceleration spectrogram and a tactile sound spectrogram, and comprises the following steps:
(2-1) image I with image size of 320X 480 by using down-sampling method1And image I2Down-sampling to obtain I1And I2A visual image of size 32 × 32 × 3;
(2-2) separately converting the tactile acceleration A into a plurality of tactile accelerations A by short-time Fourier transform1And tactile acceleration A2Converting to frequency domain, the window length of Hamming window in short-time Fourier transform is 500, the window offset is 100, the sampling frequency is 10kHz, and the tactile acceleration A is obtained respectively1And tactile acceleration A2The first 500 low-frequency channels are selected from the spectrogram to be used as spectrum images, and the spectrum images are subjected to down-sampling to obtain A1And A2A haptic acceleration spectrum image of size 32 × 32 × 3;
(2-3) separately converting the tactile sounds S by short-time Fourier transform1And a tactile sound S2Conversion to frequency domain, short timeThe window length of Hamming window in Fourier transform is 500, window offset is 100, sampling frequency is 10kHz, and tactile sound S is obtained respectively1And a tactile sound S2The first 500 low-frequency channels are selected from the spectrogram to be used as spectrum images, and the spectrum images are subjected to down-sampling to obtain S1And S2A sound spectrum image of size 32 × 32 × 3;
(3) obtaining convolution characteristics of a visual modality, a tactile acceleration modality and a tactile sound modality through multi-scale feature mapping, and comprising the following steps:
(3-1) subjecting the I obtained in the step (2) to1And I2A 32X 3 visual image, A1And A2Magnitude of 32 × 32 × 3 and S1And S2The size of the input image is d × d × 3, the local receptive field in the neural network has Ψ scale channels, and the Ψ scale channels have r sizes respectively1,r2,…,rΨGenerating K different input weights for each scale channel so as to randomly generate psi multiplied by K feature maps, and recording initial weights of a visual image, a tactile acceleration frequency spectrogram and a sound frequency spectrogram of a phi scale channel randomly generated by a neural network as
Figure FDA0002241220340000021
And
Figure FDA0002241220340000022
Figure FDA0002241220340000023
and
Figure FDA0002241220340000024
are respectively composed of
Figure FDA0002241220340000025
And
Figure FDA0002241220340000026
the method comprises the steps of composing column by column, wherein an upper corner mark I represents visual modals of a training sample and an object to be classified, an upper corner mark A represents a tactile acceleration modals of the training sample and the object to be classified, S represents a tactile sound modals of the training sample and the object to be classified,
Figure FDA0002241220340000027
it is shown that the initial weight is,
Figure FDA0002241220340000028
representing the initial weight for generating the zeta-th feature map, phi is more than or equal to 1 and less than or equal to psi, zeta is more than or equal to 1 and less than or equal to K, and the size of the phi-th scale local receptive field is rΦ×rΦ
Figure FDA0002241220340000029
Figure FDA00022412203400000210
And obtaining the size (d-r) of all K characteristic maps of the phi-th scale channelΦ+1)×(d-rΦ+1);
(3-2) initial weight matrix for the phi-th scale channel using singular value decomposition method
Figure FDA00022412203400000211
Performing orthogonalization processing to obtain an orthogonal matrix
Figure FDA00022412203400000212
And
Figure FDA00022412203400000213
Figure FDA00022412203400000214
and
Figure FDA00022412203400000215
each column of
Figure FDA00022412203400000216
And
Figure FDA00022412203400000217
are respectively as
Figure FDA00022412203400000218
Orthogonal basis of (1), input weight of the ζ -th feature map of the Φ -th scale channel
Figure FDA00022412203400000219
Figure FDA00022412203400000220
And
Figure FDA00022412203400000221
are respectively composed of
Figure FDA00022412203400000222
And
Figure FDA00022412203400000223
forming a square matrix;
calculating the convolution characteristics of the nodes (i, j) in the zeta-th feature map of the phi-th scale channel of the visual, tactile acceleration and tactile sound modalities respectively by using the following formula:
Figure FDA00022412203400000224
Figure FDA00022412203400000225
and
Figure FDA00022412203400000226
a convolution feature of a node (i, j) of a ζ -th feature graph in a Φ -th scale channel respectively representing a visual modality, a tactile acceleration modality, and a tactile sound modality, x being a matrix corresponding to the node (i, j);
(4) performing multi-scale square root pooling on convolution characteristics of the visual modality, the tactile acceleration modality and the tactile sound modality, wherein the pooling scales have psi scales, and the magnitudes of the psi scales are e1,e2,…,eΨSize of pooling at the phi-th scale eΦIndicating the distance between the pooling center and the edge, the pooling map and the feature map being of the same size and being (d-r)Φ+1)×(d-rΦ+1) calculating the pooling feature from the convolution feature obtained in step (3) using the following formula:
Figure FDA0002241220340000031
if node i is not (0, (d-r)Φ+1)), node j is not at (0, (d-r)Φ+1)), then
Figure FDA0002241220340000032
And
Figure FDA0002241220340000033
are all zero, and the total number of the active carbon particles is zero,
Φ=1,2,3...,Ψ,
ζ=1,2,3...,K,
wherein,
Figure FDA0002241220340000034
and
Figure FDA0002241220340000035
pooling features of nodes (p, q) of a ζ -th pooling graph in a Φ -th scale channel representing a visual modality, a haptic acceleration modality, and a haptic sound modality, respectively;
(5) obtaining full-connection feature vectors of three modes according to the pooling features, and the method comprises the following steps:
(5-1) connecting all the pooled features of the pooled graphs of the visual image modality, the tactile acceleration modality and the tactile sound modality of the omega training sample in the pooled features of the step (4) into a row vector respectively
Figure FDA0002241220340000036
And
Figure FDA0002241220340000037
wherein omega is more than or equal to 1 and less than or equal to N1
(5-2) traversal of N1Repeating the step (5-1) for each training sample to obtain N1The row vector combination of the visual image modality, the tactile acceleration modality, and the tactile sound modality of the training sample is recorded as:
Figure FDA0002241220340000038
wherein,
Figure FDA0002241220340000041
a matrix of combined feature vectors representing the visual modalities,
Figure FDA0002241220340000042
a matrix of tactile acceleration modal characteristics is represented,
Figure FDA0002241220340000043
a matrix of feature vectors representing haptic sound modalities;
(6) the method comprises the following steps of performing multi-mode fusion on fully connected feature vectors of three modes to obtain a multi-mode fused mixing matrix:
(6-1) reacting N in the step (5)1The method comprises the steps of inputting a mixed layer of a visual image modality, a tactile acceleration modality and a tactile sound modality of a training sample in a row vector mode, and performing combination processing to obtain a mixed matrix H ═ HI,HA,HS];
(6-2) adjusting the mixing row vector of each sample in the mixing matrix H in the step (6-1) to generate a multi-mode fused two-dimensional mixing matrix, wherein the size of the two-dimensional mixing matrix is d' × d ",
Figure FDA0002241220340000044
wherein d' is the length of the two-dimensional matrix and has a value range of
Figure FDA0002241220340000045
(7) Inputting the multi-modal fused mixing matrix obtained in the step (6) into a mixing network layer of a neural network, and obtaining multi-modal mixed convolution characteristics through multi-scale characteristic mapping, wherein the method comprises the following steps:
(7-1) inputting the multi-modal fused mixing matrix obtained in the step (6-2) into a mixing network, wherein the size of the mixing matrix is d 'multiplied by d', the mixing network is provided with psi 'scale channels, and the sizes of the psi' scale channels are r1,r2,…,rΨ'Generating K 'different input weights for each scale channel, thereby randomly generating psi' multiplied by K 'mixed feature maps, and recording phi' scale channel mixed initial weights randomly generated by the mixed network as
Figure FDA0002241220340000046
Figure FDA00022412203400000414
By
Figure FDA0002241220340000047
Column by column, wherein the superscript hybrid represents a tri-modal fusion,
Figure FDA0002241220340000048
an initial weight of the hybrid network is represented,
Figure FDA0002241220340000049
representing the initial weight for generating the zeta 'th mixed feature map, 1 ≦ phi' ≦ psi ',1 ≦ zeta' ≦ K ', and the size of the phi' th scale channel local receptive field is rΦ'×rΦ'Then, then
Figure FDA00022412203400000412
Figure FDA00022412203400000413
Further, the size of the zeta ' th characteristic diagram of the phi ' th scale channel is obtained as (d ' -r)Φ'+1)×(d”-rΦ'+1);
(7-2) using a singular value decomposition method to initialize a weight matrix for the phi' -th scale channel
Figure FDA0002241220340000051
Performing orthogonalization processing to obtain an orthogonal matrix
Figure FDA0002241220340000052
Figure FDA00022412203400000512
Each column of
Figure FDA0002241220340000053
Is that
Figure FDA0002241220340000054
The input weight of the ζ 'th feature map of the Φ' th scale channel
Figure FDA0002241220340000055
Is formed by
Figure FDA0002241220340000056
Forming a square matrix;
calculating the convolution node (i ', j') mixed convolution characteristics in the zeta 'th characteristic graph of the phi' th scale channel by using the following formula:
Figure FDA0002241220340000057
Figure FDA0002241220340000058
is a convolution node (i ', j ') mixed convolution feature in the ζ ' th feature graph of the Φ ' th scale channel, and x ' is a matrix corresponding to the node (i ', j ');
(8) performing mixed multi-scale square root pooling on the mixed convolution characteristics, wherein the pooling scales have psi' scales and the sizes are e respectively1,e2,…,eΨ'The pooling map and the feature map at the phi 'th scale have the same size and are (d' -r)Φ'+1)×(d”-rΦ'+1), calculating the mixed pooling feature according to the mixed convolution feature obtained in the step (7) by using the following formula:
Figure FDA0002241220340000059
if node i 'is not (0, (d' -r)Φ’+1)), node j 'is not (0, (d' -r)Φ’+1)), then
Figure FDA00022412203400000510
The number of the carbon atoms is zero,
Φ'=1,2,3...,Ψ',
ζ'=1,2,3...,K';
wherein,
Figure FDA00022412203400000511
hybrid pooling characteristics of the combined nodes (p ', q') of the ζ 'th pooling map representing the Φ' th scale channel;
(9) according to the mixed pooling characteristics, adopting the method of the step (5) to fully connect the mixed pooling characteristic vectors with different scales to obtain a combined characteristic matrix of the mixed network
Figure FDA0002241220340000061
Wherein K' represents the number of different characteristic graphs generated by each scale channel;
(10) the combination characteristic matrix H of the hybrid network obtained according to the step (9)hybricUsing the following formula, based on the number N of training samples1Computing training sample output weights for the neural network β:
if it is
Figure FDA0002241220340000062
Then
Figure FDA0002241220340000063
If it is
Figure FDA0002241220340000064
Then
Figure FDA0002241220340000065
Where T is a training sample
Figure FDA0002241220340000066
C is a regularization coefficient, the value is an arbitrary value, and superscript T represents matrix transposition;
(11) utilizing the orthogonal matrix after the initial weight orthogonalization of the three modes in the step (3)
Figure FDA0002241220340000067
And
Figure FDA0002241220340000068
for the preprocessed data set D to be classified2Obtaining a combined feature matrix H of the mixed network of the samples to be classified by using the methods from the step (3) to the step (9)test
(12) According to the training sample output weight β of the step (10) and the combined feature matrix H of the mixed network of the samples to be classified of the step (11)testBy usingN is calculated by the following formula2Prediction tag mu of sample to be classifiedεRealizes object material classification based on multi-mode fusion deep learning,
με=Htestβ 1≤ε≤M2
CN201710599106.1A 2017-07-21 2017-07-21 An object material classification method based on multimodal fusion deep learning Active CN107463952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710599106.1A CN107463952B (en) 2017-07-21 2017-07-21 An object material classification method based on multimodal fusion deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710599106.1A CN107463952B (en) 2017-07-21 2017-07-21 An object material classification method based on multimodal fusion deep learning

Publications (2)

Publication Number Publication Date
CN107463952A CN107463952A (en) 2017-12-12
CN107463952B true CN107463952B (en) 2020-04-03

Family

ID=60546004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710599106.1A Active CN107463952B (en) 2017-07-21 2017-07-21 An object material classification method based on multimodal fusion deep learning

Country Status (1)

Country Link
CN (1) CN107463952B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108734210B (en) * 2018-05-17 2021-10-15 浙江工业大学 An object detection method based on cross-modal multi-scale feature fusion
CN108846375B (en) * 2018-06-29 2019-06-18 山东大学 A multimodal collaborative learning method and device based on neural network
CN109190638A (en) * 2018-08-09 2019-01-11 太原理工大学 Classification method based on the online order limit learning machine of multiple dimensioned local receptor field
EP3620978A1 (en) * 2018-09-07 2020-03-11 Ibeo Automotive Systems GmbH Method and device for classifying objects
CN109447124B (en) * 2018-09-28 2019-11-19 北京达佳互联信息技术有限公司 Image classification method, device, electronic equipment and storage medium
CN109508740B (en) * 2018-11-09 2019-08-13 郑州轻工业学院 Object hardness identification method based on Gaussian mixed noise production confrontation network
CN109902585B (en) * 2019-01-29 2023-04-07 中国民航大学 Finger three-mode fusion recognition method based on graph model
CN110020596B (en) * 2019-02-21 2021-04-30 北京大学 Video content positioning method based on feature fusion and cascade learning
CN110659427A (en) * 2019-09-06 2020-01-07 北京百度网讯科技有限公司 City function division method and device based on multi-source data and electronic equipment
CN110942060B (en) * 2019-10-22 2023-05-23 清华大学 Material recognition method and device based on laser speckle and mode fusion
CN110909637A (en) * 2019-11-08 2020-03-24 清华大学 Outdoor mobile robot terrain recognition method based on visual-touch fusion
CN111028204B (en) * 2019-11-19 2021-10-08 清华大学 A cloth defect detection method based on multi-modal fusion deep learning
CN110861853B (en) * 2019-11-29 2021-10-19 三峡大学 A smart garbage sorting method combining vision and touch
CN111590611B (en) * 2020-05-25 2022-12-02 北京具身智能科技有限公司 Article classification and recovery method based on multi-mode active perception
CN113111902B (en) * 2021-01-02 2024-10-15 大连理工大学 Pavement material identification method based on voice and image multi-mode collaborative learning
CN112893180A (en) * 2021-01-20 2021-06-04 同济大学 Object touch classification method and system considering friction coefficient abnormal value elimination
CN113780460A (en) * 2021-09-18 2021-12-10 广东人工智能与先进计算研究院 Material identification method and device, robot, electronic equipment and storage medium
CN114358084B (en) * 2022-01-07 2025-02-14 吉林大学 A tactile material classification method based on DDQN and generative adversarial network
CN114723963B (en) * 2022-04-26 2024-06-04 东南大学 Task action and object physical attribute identification method based on visual touch signal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715260A (en) * 2015-03-05 2015-06-17 中南大学 Multi-modal fusion image sorting method based on RLS-ELM
CN105512609A (en) * 2015-11-25 2016-04-20 北京工业大学 Multi-mode fusion video emotion identification method based on kernel-based over-limit learning machine
CN105956351A (en) * 2016-07-05 2016-09-21 上海航天控制技术研究所 Touch information classified computing and modelling method based on machine learning
CN106874961A (en) * 2017-03-03 2017-06-20 北京奥开信息科技有限公司 A kind of indoor scene recognition methods using the very fast learning machine based on local receptor field
WO2017100903A1 (en) * 2015-12-14 2017-06-22 Motion Metrics International Corp. Method and apparatus for identifying fragmented material portions within an image

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715260A (en) * 2015-03-05 2015-06-17 中南大学 Multi-modal fusion image sorting method based on RLS-ELM
CN105512609A (en) * 2015-11-25 2016-04-20 北京工业大学 Multi-mode fusion video emotion identification method based on kernel-based over-limit learning machine
WO2017100903A1 (en) * 2015-12-14 2017-06-22 Motion Metrics International Corp. Method and apparatus for identifying fragmented material portions within an image
CN105956351A (en) * 2016-07-05 2016-09-21 上海航天控制技术研究所 Touch information classified computing and modelling method based on machine learning
CN106874961A (en) * 2017-03-03 2017-06-20 北京奥开信息科技有限公司 A kind of indoor scene recognition methods using the very fast learning machine based on local receptor field

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Deep Learning for Surface Material Classification Using Haptic and Visual Information;Haitian Zheng et al.;《IEEE TRANSACTIONS ON MULTIMEDIA》;20161130;第2407-2416页 *
Multi-Modal Local Receptive Field Extreme Learning Machine for Object Recognition;Fengxue Li et al.;《2016 International Joint Conference on Neural Networks (IJCNN)》;20161103;第1696-1701页 *
基于神经网络的三维模型视觉特征分析;韦伟;《计算机工程与应用》;20080721;第174-178页 *

Also Published As

Publication number Publication date
CN107463952A (en) 2017-12-12

Similar Documents

Publication Publication Date Title
CN107463952B (en) An object material classification method based on multimodal fusion deep learning
CN108764313B (en) Supermarket commodity recognition method based on deep learning
Dering et al. A convolutional neural network model for predicting a product's function, given its form
CN111639679B (en) A Few-Sample Learning Method Based on Multi-scale Metric Learning
CN107798349B (en) A transfer learning method based on deep sparse autoencoder
CN100487720C (en) Face comparison device
CN104408405B (en) Face representation and similarity calculating method
CN108536780B (en) Cross-modal object material retrieval method based on tactile texture features
CN109559758A (en) A method of texture image is converted by haptic signal based on deep learning
Kim et al. Label-preserving data augmentation for mobile sensor data
Ryumin et al. Automatic detection and recognition of 3D manual gestures for human-machine interaction
CN104504406B (en) A kind of approximate multiimage matching process rapidly and efficiently
CN109447996A (en) Hand Segmentation in 3-D image
CN104867106A (en) Depth map super-resolution method
CN103714331A (en) Facial expression feature extraction method based on point distribution model
Kaur et al. Scene perception system for visually impaired based on object detection and classification using multimodal deep convolutional neural network
CN106462773B (en) Pattern recognition system and method using GABOR function
Barbhuiya et al. Alexnet-CNN based feature extraction and classification of multiclass ASL hand gestures
CN104699781A (en) Specific absorption rate image retrieval method based on double-layer anchor chart hash
Lin et al. Using CNN to classify hyperspectral data based on spatial-spectral information
Zhang et al. A framework for the fusion of visual and tactile modalities for improving robot perception.
CN110705572A (en) Image recognition method
CN107451578A (en) Deaf-mute's sign language machine translation method based on somatosensory device
CN113688864A (en) Human-object interaction relation classification method based on split attention
CN105894048A (en) Food safety detection method based on mobile phone

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant