CN110706253A

CN110706253A - Target tracking method, system and device based on apparent feature and depth feature

Info

Publication number: CN110706253A
Application number: CN201910884524.4A
Authority: CN
Inventors: 胡卫明; 李晶
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-01-17
Anticipated expiration: 2039-09-19
Also published as: CN110706253B

Abstract

The invention belongs to the technical field of computer vision tracking, and in particular relates to a target tracking method, system and device based on apparent features and depth features, and aims to solve the problem of low tracking accuracy caused by ignoring depth information of target scenes in existing target tracking methods. The method of the system includes obtaining the target area and the search area of the target to be tracked in the t-th frame image according to the target position of the t-1 frame and the preset target size; extracting the target area and searching The apparent features and depth features of the area; based on the preset weights, the apparent features and depth features of the target area and the search area are weighted and averaged respectively to obtain their respective fusion features; according to the fusion features of the target area and the search area, through correlation The filter obtains the response map of the target; the position corresponding to the peak of the response map is taken as the target position of the t-th frame. The invention extracts the depth information of the target scene and improves the accuracy of the target tracking.

Description

Target tracking method, system and device based on apparent feature and depth feature

技术领域technical field

本发明属于计算机视觉跟踪技术领域，具体涉及一种基于表观特征和深度特征的目标跟踪方法、系统、装置。The invention belongs to the technical field of computer vision tracking, and in particular relates to a target tracking method, system and device based on appearance features and depth features.

背景技术Background technique

目标跟踪是计算机视觉领域中最基本的问题之一，其任务是估计视频序列中的对象或图像区域的运动轨迹。目标跟踪在实际场景中具有非常广泛的应用，通常充当较大计算机视觉系统中的一个组件。例如，自动驾驶和基于视觉的主动安全系统都依赖于跟踪车辆、骑自行车的人和行人的位置。在机器人系统中，跟踪感兴趣的目标是视觉感知中非常重要的一个方面，进而从相机传感器提取高级信息以用于决策和导航。除了机器人相关的应用，目标跟踪还经常被用于自动视频分析，在自动运动分析中，首先就是通过检测和跟踪比赛中涉及的运动员和物体来提取信息。其他的应用还包括增强现实技术和动态结构技术，它们的任务通常是跟踪不同的局部图像区域。从应用的多样性可以看出，目标跟踪问题本身是非常多样化的。Object tracking is one of the most fundamental problems in the field of computer vision, and its task is to estimate the motion trajectories of objects or image regions in video sequences. Object tracking has a very wide range of applications in real-world scenarios, often acting as a component in a larger computer vision system. For example, both autonomous driving and vision-based active safety systems rely on tracking the location of vehicles, cyclists, and pedestrians. In robotic systems, tracking objects of interest is a very important aspect of visual perception, which in turn extracts high-level information from camera sensors for decision-making and navigation. In addition to robotics-related applications, object tracking is often used for automatic video analysis, where information is first extracted by detecting and tracking players and objects involved in the game. Other applications include augmented reality and dynamic structures, whose task is usually to track different local image regions. As can be seen from the diversity of applications, the object tracking problem itself is very diverse.

近几十年来，目标跟踪领域虽然取得了突破性的进展，产生了众多经典的研究成果。但是，目标跟踪领域依旧存在着很多理论和技术问题有待解决，特别是跟踪过程中背景干扰、光照变化、尺度变化、遮挡等在开放环境中遇到的复杂问题。所以如何能够在复杂场景下自适应、实时和鲁棒的跟踪目标一直是广大研究者需要解决的问题，研究价值和研究空间依然很大。In recent decades, although breakthroughs have been made in the field of target tracking, many classic research results have been produced. However, there are still many theoretical and technical problems to be solved in the field of target tracking, especially the complex problems encountered in the open environment such as background interference, illumination changes, scale changes, and occlusion during the tracking process. Therefore, how to adaptively, real-time and robustly track targets in complex scenes has always been a problem that researchers need to solve, and the research value and research space are still very large.

对于单目标跟踪，特征的好坏直接决定跟踪性能的好坏。早期基于手工特征的判别式模型只能够提取到目标对象的一些浅层特征，并不能够很好的描述目标对象的本质，而近期出现的卷积神经网络，通过分层结构，可以学习到目标对象不同层次的特征表达，但是却忽略了目标场景全局的深度信息。深度信息可以作为一种辅助特征为目标提供全局信息，解决目标被遮挡等问题，从而提高模型在复杂场景中的鲁棒性。因此，本发明提出了一种基于表观特征和深度特征的目标跟踪方法。For single-target tracking, the quality of the feature directly determines the tracking performance. The early discriminative models based on hand-crafted features could only extract some shallow features of the target object, and could not describe the essence of the target object well. However, the recently emerged convolutional neural network can learn the target object through a hierarchical structure. The feature representation of objects at different levels, but ignores the global depth information of the target scene. Depth information can be used as an auxiliary feature to provide global information for the target and solve problems such as occlusion of the target, thereby improving the robustness of the model in complex scenes. Therefore, the present invention proposes a target tracking method based on appearance features and depth features.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中的上述问题，即为了解决现有目标跟踪方法忽略目标场景的深度信息导致跟踪精度低的问题，本发明第一方面，提出了一种基于表观特征和深度特征的目标跟踪的方法，该方法包括：In order to solve the above problem in the prior art, that is, in order to solve the problem that the existing target tracking method ignores the depth information of the target scene, resulting in low tracking accuracy, the first aspect of the present invention proposes a target based on appearance features and depth features. A method of tracking, which includes:

步骤S10，根据t-1帧的目标位置和预设的目标尺寸，获取待追踪的目标在第t帧图像的区域，将该区域作为目标区域；并基于所述t-1帧的目标位置的中心点获取N倍于所述目标区域大小的区域，将其作为搜索区域；Step S10, according to the target position of the t-1 frame and the preset target size, obtain the area of the target to be tracked in the t-th frame image, and use this area as the target area; and based on the target position of the t-1 frame. The center point obtains an area N times the size of the target area, and uses it as the search area;

步骤S20，通过表观特征提取网络、深度特征提取网络分别提取所述目标区域、所述搜索区域对应的表观特征、深度特征；Step S20, through the apparent feature extraction network and the depth feature extraction network, respectively extract the target area, the apparent feature and the depth feature corresponding to the search area;

步骤S30，基于预设权重，分别对所述目标区域、所述搜索区域对应的表观特征、深度特征进行加权平均，得到目标区域的融合特征、搜索区域的融合特征；Step S30, based on a preset weight, perform a weighted average on the corresponding apparent features and depth features of the target area and the search area, respectively, to obtain a fusion feature of the target area and a fusion feature of the search area;

步骤S40，根据所述目标区域的融合特征、所述搜索区域的融合特征，通过相关滤波器得到所述待追踪目标的响应图；将所述响应图的峰值对应的位置作为第t帧的目标位置；Step S40, according to the fusion feature of the target area and the fusion feature of the search area, obtain the response map of the target to be tracked through a correlation filter; take the position corresponding to the peak value of the response map as the target of the t-th frame Location;

其中，in,

所述表观特征提取网络基于卷积神经网络构建，用于根据输入的图像获取其对应的表观特征；The apparent feature extraction network is constructed based on a convolutional neural network, and is used to obtain its corresponding apparent feature according to the input image;

所述深度特征提取网络基于ResNet网络构建，用于根据输入的图像获取其对应的深度特征。The deep feature extraction network is constructed based on the ResNet network, and is used to obtain its corresponding deep features according to the input image.

在一些优选的实施方式中，步骤S10中“根据t-1帧的目标位置和预设的目标尺寸，获取待追踪的目标在第t帧图像的区域，将该区域作为目标区域”，其方法为：若t等于1，根据预设的目标位置和预设的目标尺寸，获取目标待追踪目标的目标区域；若t大于1，则根据t-1帧的目标位置和预设的目标尺寸，获取待追踪目标的目标区域。In some preferred embodiments, in step S10 "according to the target position of frame t-1 and the preset target size, obtain the area of the target to be tracked in the t-th frame image, and use this area as the target area", the method of is: if t is equal to 1, obtain the target area of the target to be tracked according to the preset target position and the preset target size; if t is greater than 1, then according to the target position of the frame t-1 and the preset target size, Get the target area of the target to be tracked.

在一些优选的实施方式中，所述表观特征提取网络其结构为：两个卷积层、一个相关滤波层，每个卷积层后连接一个最大池化层和ReLU激活函数；所述网络在训练过程中采用反向传播算法进行训练。In some preferred embodiments, the apparent feature extraction network is structured as follows: two convolutional layers and one correlation filtering layer, each convolutional layer is connected to a maximum pooling layer and a ReLU activation function; the network During the training process, the back-propagation algorithm is used for training.

在一些优选的实施方式中，所述深度特征提取网络其结构为：5个卷积层、5个反卷积层；所述深度特征提取网络在训练过程中通过双目图像互相重建进行训练。In some preferred embodiments, the structure of the deep feature extraction network is: 5 convolution layers and 5 deconvolution layers; the deep feature extraction network is trained by mutual reconstruction of binocular images during the training process.

在一些优选的实施方式中，所述深度特征提取网络在提取深度特征的过程中，若t等于1，其提取方法为：In some preferred embodiments, in the process of extracting the depth feature of the deep feature extraction network, if t is equal to 1, the extraction method is:

基于深度特征提取网络获取第一帧图像的深度特征；Obtain the depth feature of the first frame image based on the depth feature extraction network;

基于所述第一帧图像的深度特征、所述预设的目标位置，获取所述目标区域、所述搜索区域的深度特征。Based on the depth feature of the first frame of image and the preset target position, the depth feature of the target area and the search area is acquired.

在一些优选的实施方式中，所述相关滤波器其在对所述目标区域进行滤波时，通过尺度变换方法得到不同的尺度，根据不同的尺度对所述目标区域进行放大或者缩小，然后进行滤波。所述尺度变换方法为：In some preferred embodiments, when the correlation filter filters the target area, different scales are obtained by a scale transformation method, and the target area is enlarged or reduced according to the different scales, and then filtered. . The scaling method is:

其中，a为尺度系数，s为尺度池，S为预设的尺度数，a^s为尺度。Among them, a is the scale coefficient, s is the scale pool, S is the preset number of scales, and a ^s is the scale.

在一些优选的实施方式中，在步骤S40之后还包括相关滤波器状态值的更新，其方法为：In some preferred embodiments, after step S40, the update of the state value of the correlation filter is also included, and the method is as follows:

获取相关滤波器在t-1帧的状态值；Obtain the state value of the correlation filter at frame t-1;

基于所述状态值、所述第t帧的目标位置、预设的学习率，更新相关滤波器t帧的状态值。Based on the state value, the target position of the t-th frame, and the preset learning rate, the state value of the correlation filter frame t is updated.

本发明的第二方面，提出了一种基于表观特征和深度特征的目标跟踪的系统，该系统包括获取区域模块、提取特征模块、特征融合模块、输出位置模块；In a second aspect of the present invention, a system for target tracking based on apparent features and depth features is proposed, the system includes an acquisition region module, a feature extraction module, a feature fusion module, and an output position module;

所述获取区域模块，配置为根据t-1帧的目标位置和预设的目标尺寸，获取待追踪的目标在第t帧图像的区域，将该区域作为目标区域；并基于所述t-1帧的目标位置的中心点获取N倍于所述目标区域大小的区域，将其作为搜索区域；The acquisition area module is configured to acquire the area of the target to be tracked in the t-th frame image according to the target position of the t-1 frame and the preset target size, and use this area as the target area; and based on the t-1 The center point of the target position of the frame obtains an area N times the size of the target area, and takes it as the search area;

所述提取特征模块，配置为通过表观特征提取网络、深度特征提取网络分别提取所述目标区域、所述搜索区域对应的表观特征、深度特征；The feature extraction module is configured to extract the target area, the apparent feature and the depth feature corresponding to the search area through an apparent feature extraction network and a depth feature extraction network, respectively;

所述特征融合模块，配置为基于预设权重，分别对所述目标区域、所述搜索区域对应的表观特征、深度特征进行加权平均，得到目标区域的融合特征、搜索区域的融合特征；The feature fusion module is configured to perform a weighted average of the apparent features and depth features corresponding to the target area and the search area, respectively, based on a preset weight, to obtain a fusion feature of the target area and a fusion feature of the search area;

所述输出位置模块，配置为根据所述目标区域的融合特征、所述搜索区域的融合特征，通过相关滤波器得到所述待追踪目标的响应图；将所述响应图的峰值对应的位置作为第t帧的目标位置；The output position module is configured to obtain a response map of the target to be tracked through a correlation filter according to the fusion feature of the target area and the fusion feature of the search area; take the position corresponding to the peak value of the response map as The target position of the t-th frame;

其中，in,

本发明的第三方面，提出了一种存储装置，其中存储有多条程序，所述程序应用由处理器加载并执行以实现上述的基于表观特征和深度特征的目标跟踪方法。In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the program applications are loaded and executed by a processor to implement the above-mentioned method for target tracking based on apparent features and depth features.

本发明的第四方面，提出了一种处理装置，包括处理器、存储装置；处理器，适用于执行各条程序；存储装置，适用于存储多条程序；所述程序适用于由处理器加载并执行以实现上述的基于表观特征和深度特征的目标跟踪方法。In a fourth aspect of the present invention, a processing device is proposed, including a processor and a storage device; the processor is adapted to execute various programs; the storage device is adapted to store multiple programs; the programs are adapted to be loaded by the processor And execute to realize the above-mentioned target tracking method based on apparent feature and depth feature.

本发明的有益效果：Beneficial effects of the present invention:

本发明提取目标场景的深度信息，提高了目标跟踪的精度。本发明通过将相关滤波融入到卷积神经网络中，使得学习到的卷积特征能够与相关滤波紧密耦合，从而更适应目标跟踪任务。由于相关滤波是在频域推导的，保持了较高的效率，因此能够保证算法在实时性跟踪的前提下大幅提升跟踪效果。The invention extracts the depth information of the target scene and improves the accuracy of the target tracking. The present invention integrates the correlation filtering into the convolutional neural network, so that the learned convolution features can be closely coupled with the correlation filtering, so as to be more suitable for the target tracking task. Since the correlation filtering is derived in the frequency domain and maintains high efficiency, it can ensure that the algorithm can greatly improve the tracking effect under the premise of real-time tracking.

并且，本发明通过将深度特征与表观特征相融合，为单一特征不能够很好的表达目标的属性提供了一种补充特征，由于深度特征是从目标场景整个帧中提取到的，因此具有全局信息，包含了表观特征没有的深度信息，解决了目标被部分遮挡与形变等问题，使跟踪算法具有更好的鲁棒性。Moreover, the present invention provides a supplementary feature for a single feature that cannot well express the attributes of the target by fusing the depth feature with the apparent feature. Since the depth feature is extracted from the entire frame of the target scene, it has The global information includes depth information that is not available in the apparent features, which solves the problems of partial occlusion and deformation of the target, and makes the tracking algorithm more robust.

附图说明Description of drawings

通过阅读参照以下附图所做的对非限制性实施例所做的详细描述，本申请的其他特征、目的和优点将会变得更明显。Other features, objects and advantages of the present application will become more apparent upon reading the detailed description of non-limiting embodiments taken with reference to the following drawings.

图1是本发明一种实施例的基于表观特征和深度特征的目标跟踪方法的流程示意图；1 is a schematic flowchart of a target tracking method based on apparent features and depth features according to an embodiment of the present invention;

图2是本发明一种实施例的基于表观特征和深度特征的目标跟踪方法的框架示意图；2 is a schematic diagram of a framework of a target tracking method based on apparent features and depth features according to an embodiment of the present invention;

图3是本发明一种实施例的基于表观特征和深度特征的目标跟踪方法的训练过程的框架示意图；3 is a schematic frame diagram of a training process of a target tracking method based on apparent features and depth features according to an embodiment of the present invention;

图4是本发明一种实施例的基于表观特征和深度特征的目标跟踪方法的实际应用的示例图。FIG. 4 is an example diagram of a practical application of a target tracking method based on an apparent feature and a depth feature according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, not All examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅用于解释相关发明，而非对该发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与有关发明相关的部分。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related invention, but not to limit the invention. In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict.

本发明的基于表观特征和深度特征的目标跟踪方法，包括以下步骤：The target tracking method based on apparent features and depth features of the present invention includes the following steps:

其中，in,

为了更清晰地对本发明基于表观特征和深度特征的目标跟踪方法进行说明，下面结合附图1对本发明方法一种实施例中各步骤进行展开详述。In order to describe the target tracking method based on appearance features and depth features of the present invention more clearly, each step in an embodiment of the method of the present invention will be described in detail below with reference to FIG. 1 .

在本发明中，采用一台具有2.8G赫兹中央处理器和1G字节内存的计算机，网络的训练过程在Pytorch框架下实现，整个网络的训练和测试过程均采用多个NVIDIA TITAN XPGPU并行处理，并用python语言编制了整个目标跟踪技术的工作程序，实现了本发明的方法。In the present invention, a computer with a 2.8G Hz central processing unit and a 1G-byte memory is used, the training process of the network is implemented under the Pytorch framework, and the training and testing processes of the entire network are processed in parallel by multiple NVIDIA TITAN XPGPUs, And the working program of the whole target tracking technology is compiled with python language, and the method of the present invention is realized.

下文优选实施例中，先对表观特征提取网络、深度特征提取网络、相关滤波器进行详述，然后再对采用表观特征提取网络、深度特征提取网络、相关滤波器获取待追踪目标的位置的基于表观特征和深度特征的目标跟踪方法进行详述。In the preferred embodiment below, the apparent feature extraction network, the deep feature extraction network, and the correlation filter are first described in detail, and then the apparent feature extraction network, the deep feature extraction network, and the correlation filter are used to obtain the position of the target to be tracked. The object tracking method based on apparent features and deep features is described in detail.

1、表观特征提取网络、深度特征提取网络、相关滤波器的训练1. Training of apparent feature extraction network, deep feature extraction network and correlation filter

步骤A1，构建训练数据集Step A1, build a training dataset

在本发明中，训练集的数据来源于OTB100数据集，其中包含100个逐帧标注的视频、11种目标表观变化属性、2种评价指标。11种属性分别为：光照变化、尺度变化、遮挡、非刚性形变、运动模糊、快速运动、水平旋转、垂直旋转、移出视野、背景杂乱和低分辨率。In the present invention, the data of the training set comes from the OTB100 data set, which includes 100 videos annotated frame by frame, 11 kinds of target apparent change attributes, and 2 kinds of evaluation indicators. The 11 attributes are: illumination change, scale change, occlusion, non-rigid deformation, motion blur, fast motion, horizontal rotation, vertical rotation, out of view, background clutter, and low resolution.

两种评价指标分别为中心点位置误差(Center location error,CLE)和矩形框重叠率(Center location error,CLE)。针对于第一种基于中心点位置误差的评价指标，也就是精度图，定义为被跟踪目标的中心位置与人工标记矩形框位置中心点之间的平均欧几里德距离，数学表达式如公式(1)所示：The two evaluation indicators are the center location error (CLE) and the overlap rate of the rectangular box (Center location error, CLE). For the first evaluation index based on the position error of the center point, that is, the accuracy map, it is defined as the average Euclidean distance between the center position of the tracked target and the center point of the manually marked rectangular frame. The mathematical expression is as follows: (1) shows:

其中，(x_g,y_g)表示人工标记的矩形框的位置(ground truth)，(x_p,y_p)是当前帧中预测的目标位置。Among them, (x _g , y _g ) represents the position (ground truth) of the artificially labeled rectangular box, and (x _p , y _p ) is the predicted target position in the current frame.

如果δ_gp小于给定的阈值，那么这一帧的结果就被认定为是成功的。在精度图中，δ_gp的阈值设定为20个像素。精度图并不能给出估计的目标尺寸和形状的对比，因为中心位置误差量化了像素的差异。因此，我们经常使用鲁棒性更强的成功率图来评价算法。针对于第二种基于重叠率的评价标准，也就是成功率图：假设预测的矩形框是r_t，人工标注的矩形框是r_a，那么重叠率(Overlap Score，OS)的计算公式如公式(2)所示：If δ _gp is less than the given threshold, then the result of this frame is considered to be successful. In the accuracy map, the threshold for δ _gp is set to 20 pixels. The accuracy map does not give a comparison of the estimated object size and shape, because the center position error quantifies the difference in pixels. Therefore, we often use more robust success rate graphs to evaluate algorithms. For the second evaluation criterion based on the overlap rate, that is, the success rate map: Assuming that the predicted rectangular frame is r _t and the manually marked rectangular frame is _ra , then the calculation formula of the overlap ratio (Overlap Score, OS) is as follows: (2) shows:

S＝|r_t∩r_a|/|r_t∪r_a| (2)S＝|r _t ∩r _a |/|r _t ∪r _a | (2)

其中，∪和∩分别表示两个区域的并集和交集，|·|表示区域内像素的个数。OS用于确定跟踪算法是否已成功跟踪当前视频帧中的目标。具有大于阈值的OS分数的帧被称为成功跟踪目标的帧。在成功率图中，阈值在0-1之间变化，因此产生的结果图为变化曲线图。我们使用精度图和成功率图的曲线下面积来表示算法的性能。Among them, ∪ and ∩ represent the union and intersection of two regions, respectively, and |·| represents the number of pixels in the region. The OS is used to determine if the tracking algorithm has successfully tracked the target in the current video frame. Frames with OS scores greater than the threshold are referred to as frames that successfully track the target. In the success rate graph, the threshold varies between 0-1, so the resulting graph is a graph of variation. We use the area under the curve of the accuracy plot and success rate plot to represent the performance of the algorithm.

步骤A2，离线训练表观特征提取网络Step A2, offline training apparent feature extraction network

首先准备训练数据，ImageNet中的VID数据集，包含3000多段视频。其次，设计网络结构，由于目标跟踪的实时性也是评价算法非常重要的指标，因此本发明设计了一个轻量级网络，我们使用Siamese Network作为网络的基本结构，一共包含两个卷积层，输入数据的大小是125*125，在每一层卷积后都连接了最大值池化层和Relu激活函数。在此基础上，添加了相关滤波层，并且推导了网络的反向传播。整个过程可以描述为：给出搜索区域的特征

获取期望的响应

在真实位置取得最高响应。其目标函数求解如公式(3)(4)(5)所示：First prepare the training data, the VID dataset in ImageNet, which contains more than 3000 videos. Secondly, the network structure is designed. Since the real-time performance of target tracking is also a very important indicator for evaluating the algorithm, the present invention designs a lightweight network. We use the Siamese Network as the basic structure of the network, which contains two convolution layers in total. The size of the data is 125*125, and the max pooling layer and the Relu activation function are connected after each layer of convolution. On this basis, a correlation filtering layer is added, and the back-propagation of the network is derived. The whole process can be described as: Given the characteristics of the search area

get the expected response

Get the highest response in the real location. The objective function solution is shown in formula (3)(4)(5):

其中，θ为网络参数，y为标准高斯响应，γ为正则化系数，L(θ)为目标损失函数，D为视频总帧数，l为滤波器通道，

为学习到的相关滤波器，⊙为矩阵点乘，

为提取到的搜索区域特征，z为搜索区域，

为相关滤波器，

为傅里叶逆变换，

为标准高斯响应的离散傅里叶变换的复共轭数，k为当前滤波器，

为目标区域的特征，

为目标区域特征的复共轭，λ为正则化系数，·^*表示取复共轭数。Among them, θ is the network parameter, y is the standard Gaussian response, γ is the regularization coefficient, L(θ) is the target loss function, D is the total number of video frames, l is the filter channel,

is the learned correlation filter, ⊙ is the matrix dot product,

is the extracted search area feature, z is the search area,

is the correlation filter,

is the inverse Fourier transform,

is the complex conjugate number of the discrete Fourier transform of the standard Gaussian response, k is the current filter,

is the feature of the target area,

is the complex conjugate of the feature of the target area, λ is the regularization coefficient, and ^* indicates the number of complex conjugates.

目标函数应该包含显示的正则化，否则，目标将得到非收敛条件。在常规参数优化中使用权重衰减方法来隐含这种正则化。此外，为了限制特征映射值的大小并增加训练过程的稳定性，在卷积层的末尾添加一个局部响应归一化层(Local ResponseNormalization，LRN)。并基于深度学习框架Pytorch对检测分支和学习分支进行反向传播，当误差向后传播到实值特征映射时，其余的反向传播就可以作为传统的CNN优化进行传导。由于相关滤波器层中反向传播的所有操作仍然是在傅里叶频域中的Hadamard运算，可以保持(相关滤波器)DCF的高效性并将离线训练应用于大规模数据集。在完成离线训练之后，将获得一个特定的特征提取器，用于在线判性别相关滤波跟踪算法。The objective function should contain the regularization shown, otherwise, the objective will get a non-convergence condition. This regularization is implied in conventional parameter optimization using a weight decay method. Furthermore, to limit the size of the feature map values and increase the stability of the training process, a Local Response Normalization (LRN) layer is added at the end of the convolutional layer. And based on the deep learning framework Pytorch, the detection branch and the learning branch are back-propagated. When the error is back-propagated to the real-valued feature map, the rest of the back-propagation can be conducted as a traditional CNN optimization. Since all operations of backpropagation in the correlation filter layer are still Hadamard operations in the Fourier frequency domain, it is possible to keep the (correlation filter) DCF efficient and apply offline training to large-scale datasets. After the offline training is completed, a specific feature extractor will be obtained for the online gender-related filter tracking algorithm.

步骤A3，训练深度特征网络Step A3, train deep feature network

在测试时给定单个图像I，目标是学习可以预测每个像素场景深度的函数f，如公式(6)所示：Given a single image I at test time, the goal is to learn a function f that can predict the depth of the scene at each pixel, as in Equation (6):

其中，

为深度信息。in,

for depth information.

大多数现有的基于学习的方法将其视为有监督学习问题。其中它们一般都具有彩色输入图像及其在训练时的相应目标深度值。而作为一种替代方案，可以将深度估计作为训练期间的图像重建问题。具体来说，我们可以输入两幅图片，分别是用标准双目相机在同一时间获取的左、右彩色图像I^l和I^r，l、r代表左、右。尝试去找到一种对应关系d^r，将其应用于左图像时，能够重建出右图像，而不是试图直接预测深度信息。将重建的图像Il(dr)称为

同样，也可以使用给定的右图像估计出左图像，

假设图像被校正，d就对应于图像视差，也就是模型将学会预测的每像素的标量值。这样将只需要单个左图像作为卷积层的输入，而右图像仅在训练期间使用。使用这种新颖的左右一致性损失实现两个视差图之间的一致性可以得到更准确的结果。而深度特征网络的结构由编码器和解码器的ResNet网络所组成，一共包含5个卷积层与5个反卷积层。解码器使用来自编码器激活块的跳跃式传递，使其能够解析更高分辨率的细节。关于训练的损失函数，在每个尺度f定义一个损失函数C_f，总的损失是每个尺度上的算术和，

损失模块又是三个主要的损失函数的组合，如公式(7)所示：Most existing learning-based methods treat it as a supervised learning problem. where they generally have color input images and their corresponding target depth values at training time. As an alternative, depth estimation can be treated as an image reconstruction problem during training. Specifically, we can input two images, which are left and right color images I ^l and I ^r acquired at the same time with a standard binocular camera, where l and r represent left and right. Instead of trying to predict depth information directly, try to find a correspondence ^dr that, when applied to the left image, can reconstruct the right image. Call the reconstructed image Il(dr) as

Similarly, the left image can also be estimated using the given right image,

Assuming the image is rectified, d corresponds to the image disparity, the scalar value per pixel that the model will learn to predict. This will only require a single left image as the input to the convolutional layer, and the right image will only be used during training. Using this novel left-right consistency loss to achieve consistency between two disparity maps leads to more accurate results. The structure of the deep feature network is composed of the ResNet network of the encoder and the decoder, which contains a total of 5 convolutional layers and 5 deconvolutional layers. The decoder uses skip passes from the encoder activation blocks, enabling it to resolve higher resolution details. Regarding the training loss function, a loss function C _f is defined at each scale f, and the total loss is the arithmetic sum at each scale,

The loss module is again a combination of three main loss functions, as shown in formula (7):

其中，C_ap是表观匹配损失函数，表示重建的图像与相应的训练输入相似程度，C_ds是差异平滑度损失函数，C_lr是左右差异一致性损失函数，

代表左、右图像的表观匹配损失函数，表示左、右图像的差异平滑度损失函数，

表示左、右图像的左右差异一致性损失函数，α_ap、α_ds、α_lr为三个损失函数的权重系数。每个主要的损失函数都包含左图像变体和右图像变体，但仅左图像输入到卷积层。where C _ap is the apparent matching loss function, indicating the similarity between the reconstructed image and the corresponding training input, C _ds is the difference smoothness loss function, C _lr is the left-right difference consistency loss function,

represents the apparent matching loss function of the left and right images, represents the difference smoothness loss function of the left and right images,

Represents the left and right difference consistency loss functions of the left and right images, and α _ap , α _ds , and α _lr are the weight coefficients of the three loss functions. Each of the main loss functions contains left and right image variants, but only the left image is input to the convolutional layer.

步骤A4，训练相关滤波器Step A4, training correlation filters

如图3所示，基于步骤A1构建的训练集，通过步骤A2训练好的表观特征提取网络和步骤A3训练好的深度特征提取网络得到数据集对应的表观特征、深度特征。As shown in Figure 3, based on the training set constructed in step A1, the apparent features and depth features corresponding to the dataset are obtained through the apparent feature extraction network trained in step A2 and the deep feature extraction network trained in step A3.

为了增加跟踪性能的鲁棒性，防止目标受到背景的干扰，在图3中，目标模板即目标区域，根据t-1帧的目标位置和预设的目标尺寸获取，选取目标区域周围2倍的范围作为搜索区域，输入到表观特征提取网络中。这样做的目的是为了增加更多的背景信息，防止目标漂移，增加了模型的判别性。同样，在深度特征提取网络中分别提取目标模板和搜索区域的深度特征。为了获取更多的深度特征，在视频的第一帧，首先将整个图像输入到深度特征提取网络中，提取到整个图像的深度信息，然后在目标区域处裁剪处深度信息，即我们需要的深度特征。In order to increase the robustness of the tracking performance and prevent the target from being interfered by the background, in Figure 3, the target template is the target area, which is obtained according to the target position of the t-1 frame and the preset target size, and selects two times the target area around the target area. The range is used as the search area, which is input into the apparent feature extraction network. The purpose of this is to add more background information, prevent target drift, and increase the discriminativeness of the model. Likewise, the deep features of the target template and the search region are extracted separately in the deep feature extraction network. In order to obtain more depth features, in the first frame of the video, the entire image is first input into the depth feature extraction network, the depth information of the entire image is extracted, and then the depth information is cropped at the target area, which is the depth we need. feature.

由于提取到的表观特征和深度特征的维数不匹配，表观特征是32维，而深度特征是1维，因此，为了避免特征融合过程中由于维数相差较大而导致深度特征对算法的影响被削弱，我们引入了一个权重系数α对特征进行融合，融合过程如公式(8)所示：Since the dimensions of the extracted apparent features and depth features do not match, the apparent features are 32-dimensional, while the depth features are 1-dimensional. Therefore, in order to avoid the depth feature pairing algorithm due to the large difference in dimensions during the feature fusion process The influence of is weakened, and we introduce a weight coefficient α to fuse the features. The fusion process is shown in formula (8):

其中，

为第k帧提取的表观特征，为第k帧提取的深度特征，ψ(x)_k表示第k帧融合特征。in,

The apparent features extracted for the kth frame, is the depth feature extracted for the kth frame, ψ(x) _k represents the kth frame fusion feature.

将融合特征经过循环移位，产生出正负样本，通过公式

训练出相关滤波器模板。其中，f(x_i)为估计出的目标响应，y_i为标准高斯响应，W为相关滤波器。The fusion feature is cyclically shifted to generate positive and negative samples, and the formula is

The relevant filter templates are trained. Among them, f(x _i ) is the estimated target response, y _i is the standard Gaussian response, and W is the correlation filter.

为了训练相关滤波器，关于预测响应图和理想响应图之间差异的优化公式如公式(9)所示：For training the correlation filter, the optimized formula for the difference between the predicted response map and the ideal response map is given by Equation (9):

其中，

为目标区域融合之后的特征，

为搜索区域融合之后的特征，

为目标区域融合之后的特征的复共轭。in,

is the feature after fusion of the target region,

is the feature after search area fusion,

is the complex conjugate of the feature after fusion of the target region.

在新的一帧到来时，为了检测目标，首先从上一帧估计的目标区域中提取表观特征和深度特征，然后将它们组合在一起获得统一的特征。在相关滤波器层的帮助下，模板在搜索区域上的响应图可以通过公式(10)进行计算：When a new frame arrives, in order to detect the target, the apparent features and depth features are first extracted from the target region estimated in the previous frame, and then they are combined together to obtain a unified feature. With the help of the correlation filter layer, the response map of the template over the search area can be calculated by formula (10):

其中，

为相关滤波器的复共轭。in,

is the complex conjugate of the correlation filter.

接着通过在响应图上的最大值来获得当前视频帧中目标的位置。为了确保所提出的模型的鲁棒性，滤波器h以预定义的学习率η来进行更新，如公式(11)所示：The position of the object in the current video frame is then obtained by the maximum value on the response map. To ensure the robustness of the proposed model, the filter h is updated with a predefined learning rate η, as shown in Equation (11):

其中，

为第k+1帧的相关滤波器，

为第k帧的相关滤波器。in,

is the correlation filter of the k+1th frame,

is the correlation filter of the kth frame.

至于目标的尺度变化，我们使用具有比例因子的金字塔图像块进行尺度滤波

其中，a为尺度系数，s为尺度池，S为要取的尺度数。比如我们算法中采用3个尺度，就是S＝3，我们就可以得出尺度池s＝(-1,0,1)，然后a的s次方就是我们要取的尺度，即(0.97,1,1.03)，然后把这个尺度乘到目标区域上，0.97意思就是范围缩小0.97倍，目标就会变小，1.03就是放大，然后进行滤波，响应值最大的就是我们要的尺度。As for the scale change of the target, we use pyramid image patches with scale factors for scale filtering

Among them, a is the scale coefficient, s is the scale pool, and S is the number of scales to be taken. For example, we use 3 scales in our algorithm, that is, S=3, we can get the scale pool s=(-1,0,1), and then the s power of a is the scale we want to take, that is (0.97,1 , 1.03), and then multiply this scale to the target area, 0.97 means that the range is reduced by 0.97 times, the target will become smaller, 1.03 is to enlarge, and then filter, the largest response value is the scale we want.

得出的目标响应图对应着该点是目标的概率值，因此，选择响应图中最大的点为估计的目标位置，当得出新一帧中目标所在位置时，再更新运动模型。在线跟踪期间，只是随着时间的推移而更新滤波器。可以用增量模式表示目标的优化问题，如公式(12)所示：The obtained target response map corresponds to the probability value that the point is the target. Therefore, the largest point in the response map is selected as the estimated target position, and the motion model is updated when the target position in a new frame is obtained. During online tracking, the filter is only updated over time. The optimization problem of the objective can be expressed in incremental mode, as shown in Equation (12):

其中，ε为输出损失，p为第p个样本，t为当前的样本，β_t为当前样本的影响因子，

为第p个样本的相关滤波器，为第t个样本的特征。Among them, ε is the output loss, p is the p-th sample, t is the current sample, β _t is the impact factor of the current sample,

is the correlation filter of the p-th sample, is the feature of the t-th sample.

参数β_t＞0是样本x_t的影响因子，与此同时，方程中的闭式解也可以延伸到时间序列，如公式(13)所示：The parameter β _t > 0 is the influence factor of the sample x _t . At the same time, the closed-form solution in the equation can also be extended to the time series, as shown in formula (13):

其中，

为第p个样本的相关滤波器，

为样本x_t的特征，

为第k个滤波器得到的样本x_t的特征，

为第k个滤波器得到的样本x_t的特征的复共轭。in,

is the correlation filter of the p-th sample,

is the feature of the sample _xt ,

is the feature of the sample x _t obtained by the kth filter,

is the complex conjugate of the features of the sample _xt obtained for the kth filter.

这种增量更新的优点是我们不需要保存大量的样本集，因此只需要占用很小的空间。The advantage of this incremental update is that we don't need to save a large sample set, so it only needs to take up very little space.

2、基于表观特征和深度特征的目标跟踪方法2. Target tracking method based on apparent features and depth features

本发明实施例的一种基于表观特征和深度特征的目标跟踪方法，包括以下步骤：A target tracking method based on appearance features and depth features according to an embodiment of the present invention includes the following steps:

步骤S10，根据t-1帧的目标位置和预设的目标尺寸，获取待追踪的目标在第t帧图像的区域，将该区域作为目标区域；并基于所述t-1帧的目标位置的中心点获取N倍于所述目标区域大小的区域，将其作为搜索区域。Step S10, according to the target position of the t-1 frame and the preset target size, obtain the area of the target to be tracked in the t-th frame image, and use this area as the target area; and based on the target position of the t-1 frame. The center point acquires an area N times the size of the target area, and takes it as the search area.

在本实施例中，在视频的第一帧会用矩形框将目标位置和目标尺寸给定。在其他帧的情况下，会根据前一帧的目标位置和预先设定的目标尺寸获取目标在当前帧图像中的位置区域，将其作为目标区域，并以前一帧的目标区域为中心获取2倍于目标区域大小的区域，作为搜索区域。这样可以增加跟踪性能的鲁棒性，防止目标受到背景的干扰。In this embodiment, the target position and target size are given by a rectangular frame in the first frame of the video. In the case of other frames, the position area of the target in the current frame image will be obtained according to the target position of the previous frame and the preset target size, which will be used as the target area, and the target area of the previous frame will be obtained as the center. An area twice the size of the target area is used as the search area. This increases the robustness of the tracking performance and prevents the target from being disturbed by the background.

本实施例中的目标尺寸根据实际的应用情况而设定。The target size in this embodiment is set according to the actual application.

步骤S20，通过表观特征提取网络、深度特征提取网络分别提取所述目标区域、所述搜索区域对应的表观特征、深度特征。Step S20 , extract the target area, the apparent feature and the depth feature corresponding to the search area through an apparent feature extraction network and a depth feature extraction network, respectively.

在本实施例中，将步骤S10中获取的目标区域和搜索区域分别输入表观特征提取网络、深度特征提取网络，得到各自对应的表观特征和深度特征。In this embodiment, the target area and the search area obtained in step S10 are respectively input into the apparent feature extraction network and the depth feature extraction network to obtain their corresponding apparent features and depth features.

提取视频第一帧图像的深度特征时，首先将整个图像输入到深度特征提取网络中，提取到整个图像的深度信息，然后在目标位置处裁剪处深度信息，即我们需要的深度特征。When extracting the depth feature of the first frame image of the video, first input the whole image into the depth feature extraction network, extract the depth information of the whole image, and then crop the depth information at the target position, that is, the depth feature we need.

其中，表观特征一共包含32维，大小是125*125*32，深度特征的大小限制到125*125*1。Among them, the apparent feature contains a total of 32 dimensions, the size is 125*125*32, and the size of the depth feature is limited to 125*125*1.

步骤S30，基于预设权重，分别对所述目标区域、所述搜索区域对应的表观特征、深度特征进行加权平均，得到目标区域的融合特征、搜索区域的融合特征。Step S30, based on a preset weight, weighted average of the apparent features and depth features corresponding to the target area and the search area, respectively, to obtain a fusion feature of the target area and a fusion feature of the search area.

在本实施例中，由于提取到的表观特征和深度特征的维数不匹配，表观特征是32维，而深度特征是1维，因此，为了避免特征融合过程中由于维数相差较大而导致深度特征对算法的影响被削弱，我们引入了一个权重系数对特征进行融合，分别得到目标区域和搜索区域的融合特征。In this embodiment, since the dimensions of the extracted apparent features and depth features do not match, the apparent features are 32-dimensional, while the depth features are 1-dimensional. Therefore, in order to avoid a large difference in dimensions during the feature fusion process As a result, the influence of depth features on the algorithm is weakened. We introduce a weight coefficient to fuse the features, and obtain the fusion features of the target area and the search area respectively.

步骤S40，根据所述目标区域的融合特征、所述搜索区域的融合特征，通过相关滤波器得到所述待追踪目标的响应图；将所述响应图的峰值对应的位置作为第t帧的目标位置。Step S40, according to the fusion feature of the target area and the fusion feature of the search area, obtain the response map of the target to be tracked through a correlation filter; take the position corresponding to the peak value of the response map as the target of the t-th frame Location.

在本实施例中，基于目标区域和搜索区域的融合特征，经过训练好的相关滤波器，得到响应图，通过在响应图上的最大值来获得当前视频帧中待追踪的目标的位置。In this embodiment, based on the fusion feature of the target area and the search area, a trained correlation filter is used to obtain a response graph, and the position of the target to be tracked in the current video frame is obtained by the maximum value on the response graph.

在获取到当前帧的目标位置，基于上一帧滤波器的状态值和预设的学习率，更新相关滤波器。After obtaining the target position of the current frame, the relevant filter is updated based on the state value of the filter in the previous frame and the preset learning rate.

步骤S10-步骤S40，可以参考图4，其中基于目标区域图像(Current image)和搜索区域图像(Search area image)分别通过表观特征提取网络(ANET)和深度特征提取网络(DNET)，得到表观特征和深度特征，基于预设的权重α进行加权平均，得到Current image和Search area image各自的融合特征，基于相关滤波器(DCF)得到响应图(Response map)，通过在响应图上的最大值来获得目标的位置。Step S10-Step S40, can refer to Fig. 4, wherein based on the target area image (Current image) and the search area image (Search area image) through the apparent feature extraction network (ANET) and the depth feature extraction network (DNET), respectively, to obtain the table. The visual features and depth features are weighted and averaged based on the preset weight α to obtain the respective fusion features of the current image and the search area image, and the response map is obtained based on the correlation filter (DCF). value to get the position of the target.

本发明第二实施例的一种基于表观特征和深度特征的目标跟踪系统，如图2所示，包括：获取区域模块100、提取特征模块200、特征融合模块300、输出位置模块400；A target tracking system based on apparent features and depth features according to the second embodiment of the present invention, as shown in FIG. 2 , includes: a region acquisition module 100, a feature extraction module 200, a feature fusion module 300, and an output position module 400;

获取区域模块100，配置为根据t-1帧的目标位置和预设的目标尺寸，获取待追踪的目标在第t帧图像的区域，将该区域作为目标区域；并基于所述t-1帧的目标位置的中心点获取N倍于所述目标区域大小的区域，将其作为搜索区域；The acquisition area module 100 is configured to acquire the area of the target to be tracked in the t-th frame image according to the target position of the t-1 frame and the preset target size, and use this area as the target area; and based on the t-1 frame The center point of the target position obtains an area N times the size of the target area, and takes it as the search area;

提取特征模块200，配置为通过表观特征提取网络、深度特征提取网络分别提取所述目标区域、所述搜索区域对应的表观特征、深度特征；The feature extraction module 200 is configured to extract the target area, the apparent feature and the depth feature corresponding to the search area respectively through the apparent feature extraction network and the depth feature extraction network;

特征融合模块300，配置为基于预设权重，分别对所述目标区域、所述搜索区域对应的表观特征、深度特征进行加权平均，得到目标区域的融合特征、搜索区域的融合特征；The feature fusion module 300 is configured to perform a weighted average of the apparent features and depth features corresponding to the target area and the search area, respectively, based on a preset weight, to obtain a fusion feature of the target area and a fusion feature of the search area;

输出位置模块400，配置为根据所述目标区域的融合特征、所述搜索区域的融合特征，通过相关滤波器得到所述待追踪目标的响应图；将所述响应图的峰值对应的位置作为第t帧的目标位置；The output position module 400 is configured to obtain a response map of the target to be tracked through a correlation filter according to the fusion feature of the target area and the fusion feature of the search area; take the position corresponding to the peak value of the response map as the first the target position of the t frame;

其中，in,

所述技术领域的技术人员可以清楚的了解到，为描述的方便和简洁，上述描述的系统的具体的工作过程及有关说明，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the technical field can clearly understand that, for the convenience and brevity of description, for the specific working process and related description of the system described above, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here.

需要说明的是，上述实施例提供的基于表观特征和深度特征的目标跟踪系统，仅以上述各功能模块的划分进行举例说明，在实际应用中，可以根据需要而将上述功能分配由不同的功能模块来完成，即将本发明实施例中的模块或者步骤再分解或者组合，例如，上述实施例的模块可以合并为一个模块，也可以进一步拆分成多个子模块，以完成以上描述的全部或者部分功能。对于本发明实施例中涉及的模块、步骤的名称，仅仅是为了区分各个模块或者步骤，不视为对本发明的不当限定。It should be noted that, the target tracking system based on apparent features and depth features provided in the above embodiments is only illustrated by the division of the above functional modules. In practical applications, the above functions can be allocated by different The modules or steps in the embodiments of the present invention are further decomposed or combined. For example, the modules in the above-mentioned embodiments can be combined into one module, or can be further split into multiple sub-modules to complete all the above-described or some functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing each module or step, and should not be regarded as an improper limitation of the present invention.

本发明第三实施例的一种存储装置，其中存储有多条程序，所述程序适用于由处理器加载并实现上述的基于表观特征和深度特征的目标跟踪方法。A storage device according to the third embodiment of the present invention stores a plurality of programs, and the programs are suitable for being loaded by a processor and implementing the above-mentioned target tracking method based on appearance features and depth features.

本发明第四实施例的一种处理装置，包括处理器、存储装置；处理器，适于执行各条程序；存储装置，适于存储多条程序；所述程序适于由处理器加载并执行以实现上述的基于表观特征和深度特征的目标跟踪方法。A processing device according to a fourth embodiment of the present invention includes a processor and a storage device; the processor is adapted to execute various programs; the storage device is adapted to store multiple programs; the programs are adapted to be loaded and executed by the processor In order to realize the above-mentioned target tracking method based on appearance features and depth features.

所述技术领域的技术人员可以清楚的了解到，未描述的方便和简洁，上述描述的存储装置、处理装置的具体工作过程及有关说明，可以参考前述方法实例中的对应过程，在此不再赘述。Those skilled in the technical field can clearly understand that the undescribed convenience and brevity are not described. Repeat.

本领域技术人员应该能够意识到，结合本文中所公开的实施例描述的各示例的模块、方法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，软件模块、方法步骤对应的程序可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。为了清楚地说明电子硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以电子硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Those skilled in the art should be aware that the modules and method steps of each example described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two, and the programs corresponding to the software modules and method steps Can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or as known in the art in any other form of storage medium. In order to clearly illustrate the interchangeability of electronic hardware and software, the components and steps of each example have been described generally in terms of functionality in the foregoing description. Whether these functions are performed in electronic hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods of implementing the described functionality for each particular application, but such implementations should not be considered beyond the scope of the present invention.

术语“第一”、“第二”等是用于区别类似的对象，而不是用于描述或表示特定的顺序或先后次序。The terms "first," "second," etc. are used to distinguish between similar objects, and are not used to describe or indicate a particular order or sequence.

术语“包括”或者任何其它类似用语旨在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备/装置不仅包括那些要素，而且还包括没有明确列出的其它要素，或者还包括这些过程、方法、物品或者设备/装置所固有的要素。The term "comprising" or any other similar term is intended to encompass a non-exclusive inclusion such that a process, method, article or device/means comprising a list of elements includes not only those elements but also other elements not expressly listed, or Also included are elements inherent to these processes, methods, articles or devices/devices.

至此，已经结合附图所示的优选实施方式描述了本发明的技术方案，但是，本领域技术人员容易理解的是，本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下，本领域技术人员可以对相关技术特征作出等同的更改或替换，这些更改或替换之后的技术方案都将落入本发明的保护范围之内。So far, the technical solutions of the present invention have been described with reference to the preferred embodiments shown in the accompanying drawings, however, those skilled in the art can easily understand that the protection scope of the present invention is obviously not limited to these specific embodiments. Without departing from the principle of the present invention, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after these changes or substitutions will fall within the protection scope of the present invention.

Claims

1. A method for object tracking based on appearance features and depth features, the method comprising:

step S10, acquiring the area of the target to be tracked in the t frame image according to the target position of the t-1 frame and the preset target size, and taking the area as the target area; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;

step S20, respectively extracting the apparent features and the depth features corresponding to the target area and the search area through an apparent feature extraction network and a depth feature extraction network;

step S30, respectively carrying out weighted average on the apparent features and the depth features corresponding to the target region and the search region based on preset weights to obtain fusion features of the target region and the search region;

step S40, obtaining a response graph of the target to be tracked through a relevant filter according to the fusion characteristics of the target area and the fusion characteristics of the search area; taking the position corresponding to the peak value of the response image as the target position of the t frame;

wherein,

the apparent feature extraction network is constructed based on a convolutional neural network and is used for acquiring corresponding apparent features according to an input image;

the depth feature extraction network is constructed based on a ResNet network and is used for acquiring the corresponding depth features according to the input image.

2. The method for tracking the target according to claim 1, wherein in step S10, "acquiring the region of the target to be tracked in the image of the t-th frame according to the target position of the t-1 frame and the preset target size, and using the region as the target region", the method comprises: if t is equal to 1, acquiring a target area of a target to be tracked according to a preset target position and a preset target size; and if t is larger than 1, acquiring a target area of the target to be tracked according to the target position of the t-1 frame and the preset target size.

3. The method for tracking an object based on the appearance features and the depth features according to claim 1, wherein the appearance feature extraction network has a structure that: the filter comprises two convolutional layers and a relevant filter layer, wherein a maximum pooling layer and a ReLU activation function are connected behind each convolutional layer; the network is trained in a training process by adopting a back propagation algorithm.

4. The method for tracking the target based on the appearance feature and the depth feature of claim 1, wherein the depth feature extraction network has a structure that: 5 convolutional layers, 5 deconvolution layers; the depth feature extraction network is trained through mutual reconstruction of binocular images in the training process.

5. The target tracking method based on the appearance features and the depth features according to claim 2, wherein in the process of extracting the depth features, if t is equal to 1, the extraction method of the depth feature extraction network is as follows:

acquiring the depth feature of the first frame image based on a depth feature extraction network;

and acquiring the depth features of the target area and the search area based on the depth feature of the first frame image and the preset target position.

6. The object tracking method based on the appearance feature and the depth feature of claim 1, wherein the correlation filter obtains different scales by a scale transformation method when filtering the object region, and enlarges or reduces the object region according to the different scales and then filters the object region. The scale transformation method comprises the following steps:

wherein a is a scale coefficient, S is a scale pool, S is a preset scale degree, a^sIs a scale.

7. The method for tracking an object based on an apparent feature and a depth feature according to any one of claims 1 to 6, further comprising updating the state values of the correlation filters after step S40 by:

acquiring the state value of a correlation filter in a t-1 frame;

and updating the state value of the t frame of the relevant filter based on the state value, the target position of the t frame and a preset learning rate.

8. A target tracking system based on appearance features and depth features is characterized by comprising an acquisition region module, a feature extraction module, a feature fusion module and an output position module;

the acquisition region module is configured to acquire a region of a target to be tracked in a t-th frame image according to the target position of the t-1 frame and a preset target size, and the region is used as a target region; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;

the feature extraction module is configured to extract the apparent features and the depth features corresponding to the target area and the search area respectively through an apparent feature extraction network and a depth feature extraction network;

the feature fusion module is configured to perform weighted average on the target region and the apparent feature and the depth feature corresponding to the search region respectively based on preset weights to obtain a fusion feature of the target region and a fusion feature of the search region;

the output position module is configured to obtain a response map of the target to be tracked through a relevant filter according to the fusion feature of the target area and the fusion feature of the search area; taking the position corresponding to the peak value of the response image as the target position of the t frame;

wherein,

9. A storage device having stored therein a plurality of programs, wherein said program applications are loaded and executed by a processor to implement the apparent feature and depth feature based object tracking method of any of claims 1-7.

10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the apparent and depth feature based object tracking method of any of claims 1-7.