CN117576380A

CN117576380A - Target autonomous detection tracking method and system

Info

Publication number: CN117576380A
Application number: CN202410057608.1A
Authority: CN
Inventors: 钟宇; 陈功; 傲厚军; 王宇; 李航; 张俊傲
Original assignee: Chengdu Fluid Power Innovation Center
Current assignee: Chengdu Fluid Power Innovation Center
Priority date: 2024-01-16
Filing date: 2024-01-16
Publication date: 2024-02-20

Abstract

The invention relates to the technical field of target tracking, and discloses a target autonomous detection tracking method and a target autonomous detection tracking system, wherein the method comprises the following steps: s1, target detection: circularly detecting the image until a target is detected; s2, initializing a tracker template: initializing a tracker template; s3, target tracking: tracking the target. The invention solves the problems of failure in tracking the target or tracking drift in the prior art.

Description

A method and system for autonomous target detection and tracking

技术领域Technical field

本发明涉及目标跟踪技术领域，具体是一种目标自主检测跟踪方法及系统。The invention relates to the technical field of target tracking, specifically a method and system for autonomous target detection and tracking.

背景技术Background technique

目标跟踪技术一直是计算机视觉研究领域中非常重要且具有挑战性的任务，其融合了计算机科学、统计学习、模式识别、机器学习、图像处理、等诸多学科。随着计算机视觉和深度学习人工智能的发展，目标跟踪技术主要应用于跟踪目标的状态、精确定位目标的位置，在安防监控、智能交通、人机交互、航空航天、军事侦察等诸多方面都有着广泛的应用。Target tracking technology has always been a very important and challenging task in the field of computer vision research, which integrates computer science, statistical learning, pattern recognition, machine learning, image processing, and many other disciplines. With the development of computer vision and deep learning artificial intelligence, target tracking technology is mainly used to track the status of targets and accurately locate targets. It has applications in security monitoring, intelligent transportation, human-computer interaction, aerospace, military reconnaissance and many other aspects. Wide range of applications.

目标跟踪是计算机视觉中的一个基础问题,跟踪的目的是用来确定我们感兴趣的目标在视频序列中连续的位置,即获取运动目标的参数,如位置、尺寸、速度、加速度以及运动轨迹等,从而进行进一步的处理和分析,实现对运动目标的行为分析与理解,以完成更高级的任务。Target tracking is a basic problem in computer vision. The purpose of tracking is to determine the continuous position of the target we are interested in in the video sequence, that is, to obtain the parameters of the moving target, such as position, size, speed, acceleration, and motion trajectory. , so as to carry out further processing and analysis to achieve behavioral analysis and understanding of moving targets to complete more advanced tasks.

目标跟踪根据跟踪目标的个数可以分为单目标跟踪和多目标跟踪等。单目标跟踪，指代先确定跟踪的目标（例如第一个帧中找出想要跟踪的目标），然后再后续帧中确定目标的位置，这是一个时间+空间联系的问题，所以会有目标消失、目标外观变化、背景干扰、目标移动等的挑战。目标跟踪算法根据建模方式可以分为生成式算法和判别式算法，根据跟踪序列长短可以分为短时跟踪算法和长时跟踪算法。Target tracking can be divided into single target tracking and multiple target tracking according to the number of tracking targets. Single target tracking refers to determining the target to be tracked first (for example, finding the target you want to track in the first frame), and then determining the target's position in subsequent frames. This is a time + space connection problem, so there will be Challenges include target disappearance, target appearance changes, background interference, target movement, etc. Target tracking algorithms can be divided into generative algorithms and discriminative algorithms according to the modeling method. According to the length of the tracking sequence, they can be divided into short-term tracking algorithms and long-term tracking algorithms.

生成式经典的跟踪算法主要通过提取目标特征，建立目标模型，在后续帧中进行模板匹配，逐步迭代实现目标定位。判别式跟踪算法为现阶段的主流跟踪算法，将跟踪问题作为分类问题进行处理，判别式算法框架中大多采用的是相关滤波框架，将目标特征和背景特征分别进行提取建模，分别作为正负样本来训练相应的分类器，然后在后续帧中将候选样本输入到分类器中，得到概率值最大样本即跟踪目标。The generative classic tracking algorithm mainly extracts target features, establishes a target model, performs template matching in subsequent frames, and gradually iterates to achieve target positioning. The discriminative tracking algorithm is the mainstream tracking algorithm at this stage. It handles the tracking problem as a classification problem. Most of the discriminative algorithm frameworks use a correlation filtering framework, which extracts and models target features and background features respectively, and treats them as positive and negative respectively. Samples are used to train the corresponding classifier, and then the candidate samples are input into the classifier in subsequent frames to obtain the sample with the maximum probability value, which is the tracking target.

但由于实际场景复杂多变,视频中目标的遮挡与消失、形态变化、出镜重现、快速移动、运动模糊、尺度变化及光照等因素都会影响跟踪的准确性,使其在实际应用中仍然面临许多挑战，设计个快速的、鲁棒的跟踪算法仍然非常困难,仍然是计算机视觉中最活跃的研究领域之一。However, due to the complexity and change of actual scenes, factors such as occlusion and disappearance of targets in videos, morphological changes, reappearance, rapid movement, motion blur, scale changes and lighting will affect the accuracy of tracking, making it still difficult to solve in practical applications. With many challenges, designing fast and robust tracking algorithms remains very difficult and remains one of the most active research areas in computer vision.

现有的目标跟踪方案中，当目标出现遮挡、形变模糊或光照影响时很容易出现跟踪失败。单目标跟踪算法的生成模型算法如光流法、Meanshift算法、粒子滤波，Camshift跟踪算法等由于没有考虑背景信息，当目标出现遮挡、形变模糊或光照影响时很容易出现跟踪失败，而且生成式跟踪算法效率低下。相比于生成式跟踪算法，判别式跟踪如CSK算法、KCF算法、DSST算法等基于相关滤波的判别式跟踪算法可以更好地得到目标在当前图像中的具体位置，且识别精度高、跟踪速度快，因此被广泛研究。随着深度学习的发展，采用CNN卷积神经网络提取图像特征，直接使用神经网络构建端到端的孪生网络目标跟踪模型。In existing target tracking solutions, tracking failure is prone to occur when the target is blocked, deformed, blurred, or affected by lighting. Generative model algorithms for single target tracking algorithms such as optical flow method, Meanshift algorithm, particle filter, Camshift tracking algorithm, etc. do not consider background information. When the target is blocked, deformed, blurred, or affected by lighting, tracking failure is likely to occur, and generative tracking The algorithm is inefficient. Compared with generative tracking algorithms, discriminative tracking algorithms based on correlation filtering such as CSK algorithm, KCF algorithm, DSST algorithm, etc. can better obtain the specific location of the target in the current image, and have high recognition accuracy and tracking speed. It is fast and therefore widely studied. With the development of deep learning, CNN convolutional neural network is used to extract image features, and the neural network is directly used to build an end-to-end twin network target tracking model.

虽然目标跟踪逐渐发展成熟，但是面对实际场景的时候仍然有很多挑战，尤其是当目标被部分遮挡或者完全遮挡时，以及跟踪运动目标，目标出镜导致跟踪目标的丢失，很容易使得滤波模板学习到遮挡障碍物的特征，导致滤波模板被污染，跟踪目标失败或跟踪漂移。在现有的技术方案中为保证长时间或者稳定的跟踪目标采用重检测方案来约束跟新跟踪器目标从而实现目标的跟踪。Although target tracking has gradually matured, there are still many challenges when facing actual scenes, especially when the target is partially or completely occluded, and when tracking moving targets, the target appears out of the camera, resulting in the loss of the tracking target, which easily makes filter template learning Characteristics of obstructing obstacles may lead to contamination of the filter template, failure to track the target, or tracking drift. In existing technical solutions, in order to ensure long-term or stable tracking of targets, a re-detection scheme is used to constrain the tracking of new tracker targets to achieve target tracking.

现有技术的缺点：Disadvantages of existing technology:

1.现有的相关滤波跟踪器，当目标出现遮挡、姿态变化、运行模糊或目标出境等情况造成模型漂移和目标丢失，无法完成后续跟踪和对目标的重找回；1. Existing correlation filter trackers, when the target is blocked, changes in posture, runs blurred, or the target exits the country, the model drifts and the target is lost, making it impossible to complete subsequent tracking and retrieval of the target;

2.单独的目标跟踪器在初始帧对目标进行初始化时需要手动进行设置，无法自动检测目标完成初始进行跟踪。2. A separate target tracker needs to be set manually when initializing the target in the initial frame, and cannot automatically detect the target to complete the initial tracking.

发明内容Contents of the invention

为克服现有技术的不足，本发明提供了一种目标自主检测跟踪方法及系统，解决现有技术存在的跟踪目标失败或跟踪漂移的问题。In order to overcome the shortcomings of the existing technology, the present invention provides a target autonomous detection and tracking method and system to solve the problems of target tracking failure or tracking drift in the existing technology.

本发明解决上述问题所采用的技术方案是：The technical solution adopted by the present invention to solve the above problems is:

一种目标自主检测跟踪方法，包括以下步骤：A target autonomous detection and tracking method includes the following steps:

S1，目标检测：对图像进行循环检测，直到检测到目标；S1, target detection: cyclically detect the image until the target is detected;

S2，跟踪器模板初始化：进行跟踪器模板的初始化；S2, tracker template initialization: initialize the tracker template;

S3，目标跟踪：对目标进行跟踪。S3, target tracking: track the target.

作为一种优选的技术方案，步骤S1中，获取目标所在区域的定位框信息，输出目标在图像中定位框位置的中心点（x，y）和宽高(w，h)；其中，x表示中心点的X轴坐标，y表示中心点的Y轴坐标，w表示定位框的宽度，h表示定位框的高度。As a preferred technical solution, in step S1, the positioning frame information of the area where the target is located is obtained, and the center point (x, y) and width and height (w, h) of the positioning frame position of the target in the image are output; where x represents The X-axis coordinate of the center point, y represents the Y-axis coordinate of the center point, w represents the width of the positioning box, and h represents the height of the positioning box.

作为一种优选的技术方案，步骤S2包括以下步骤：As a preferred technical solution, step S2 includes the following steps:

S21，利用检测的定位框的（x，y，w，h）参数和设定填充参数来确定目标搜索区域的大小，并根据目标搜索区域的大小创建初始化余弦窗和高斯理想响应标签；S21, use the (x, y, w, h) parameters of the detected positioning frame and set filling parameters to determine the size of the target search area, and create an initialization cosine window and Gaussian ideal response label according to the size of the target search area;

S22，提取目标所在区域的灰度特征，将余弦窗与提取的特征图相乘解决边缘效应；S22, extract the grayscale features of the target area, and multiply the cosine window and the extracted feature map to solve the edge effect;

S23，将得到的灰度特征与余弦窗相乘的结果经过傅里叶变换转换到频域内，与经过傅里叶变换后的高斯理想响应标签相乘，从而完成跟踪器模板的初始化。S23, the result of multiplying the obtained grayscale feature and the cosine window is converted into the frequency domain through Fourier transform, and multiplied by the Gaussian ideal response label after Fourier transform, thereby completing the initialization of the tracker template.

作为一种优选的技术方案，步骤S3包括以下步骤：As a preferred technical solution, step S3 includes the following steps:

S31，得到目标搜索区域的频域特征；S31, obtain the frequency domain characteristics of the target search area;

S32，得到目标在当前帧图像中的定位点坐标和其响应值；S32, obtain the positioning point coordinates of the target in the current frame image and its response value;

S33，在获取当前帧目标位置后，以目标位置为中心提取目标搜索区域的灰度特征，对跟踪器模板进行加权平均后更新模板，作为下一帧的跟踪滤波模型。S33. After obtaining the target position of the current frame, extract the grayscale features of the target search area with the target position as the center, perform a weighted average on the tracker template and then update the template as the tracking filter model of the next frame.

作为一种优选的技术方案，步骤S31中，实时读取当前帧的图像数据，以上一帧图像跟踪预测的目标中心点为中心，截取初始化阶段时目标搜索区域大小的图像，并提取目标区域的灰度特征，经过与余弦窗相乘后进行傅里叶变换得到目标搜索区域的频域特征。As a preferred technical solution, in step S31, the image data of the current frame is read in real time, and the target center point predicted by tracking and prediction of the previous frame image is taken as the center, an image of the size of the target search area in the initialization stage is intercepted, and the target area is extracted. The grayscale features are multiplied by the cosine window and then Fourier transformed to obtain the frequency domain features of the target search area.

作为一种优选的技术方案，步骤S32中，与跟踪滤波器模型进行匹配计算，将计算结果通过逆傅里叶变换后获得搜索区域的响应图，响应图中最大峰值的坐标为目标的预测位置的中心坐标，将该坐标换算至原图，得到目标在当前帧图像中的定位点坐标和其响应值。As a preferred technical solution, in step S32, a matching calculation is performed with the tracking filter model, and the calculation result is subjected to inverse Fourier transformation to obtain a response map of the search area. The coordinates of the maximum peak in the response map are the predicted positions of the target. The center coordinates of the target are converted to the original image to obtain the positioning point coordinates of the target in the current frame image and its response value.

S34，利用跟踪质量评估模块对每一次跟踪预测的结果进行跟踪质量评估，根据评估的结果判断是否启动目标检测模块进行全局的重检测：若判定为需要重检测则启动目标检测模块进行全幅图像范围的目标检测，选取检测目标输出置信度高于设定阈值的目标作为找回的跟踪目标，并输出目标的定位框信息，对跟踪器模型进行重新初始化；否则继续跟踪器进行跟踪。S34, use the tracking quality evaluation module to evaluate the tracking quality of each tracking prediction result, and determine whether to start the target detection module for global re-detection based on the evaluation results: if it is determined that re-detection is needed, start the target detection module to perform the full image range. For target detection, select the target whose detection target output confidence is higher than the set threshold as the retrieved tracking target, and output the target's positioning box information to reinitialize the tracker model; otherwise, continue tracking with the tracker.

作为一种优选的技术方案，步骤S34中，跟踪质量评估模块评估策略为：利用跟踪预测输出的响应值和平均峰值相关能能量来进行综合评估判断。As a preferred technical solution, in step S34, the tracking quality assessment module assessment strategy is to use the response value and the average peak correlation energy of the tracking prediction output to make a comprehensive assessment and judgment.

作为一种优选的技术方案，步骤S34中，平均峰值相关能能量的计算方式为：As a preferred technical solution, in step S34, the average peak correlation energy is calculated as:

其中，表示平均峰值相关能能量，/>表示跟踪预测结果响应图中最大值，/>表示跟踪预测结果响应图中最小值，/>表示响应图中（w,h）位置上的响应值，/>表示求平均函数。in, Represents the average peak correlation energy energy,/> Represents the maximum value in the response graph of the tracking prediction result,/> Represents the minimum value in the response graph of the tracking prediction result,/> Represents the response value at the (w, h) position in the response graph,/> Represents the averaging function.

一种目标自主检测跟踪系统，用于实现所述的一种目标自主检测跟踪方法，包括依次连接的以下模块：A target autonomous detection and tracking system, used to implement the above-mentioned target autonomous detection and tracking method, including the following modules connected in sequence:

目标检测模块：用以，对图像进行循环检测，直到检测到目标；Target detection module: used to perform cyclic detection on images until the target is detected;

跟踪器模板初始化模块：用以，进行跟踪器模板的初始化；Tracker template initialization module: used to initialize the tracker template;

目标跟踪模块：用以，对目标进行跟踪。Target tracking module: used to track targets.

本发明相比于现有技术，具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

（1）本发明通过检测模块可自动的对目标进行检测，并完成对跟踪器模版的初始化，自动开始对目标进行跟踪；(1) The present invention can automatically detect the target through the detection module, complete the initialization of the tracker template, and automatically start tracking the target;

（2）本发明的重检测机制在目标出现跟踪漂移以及丢失情况时，可进行全局检测找回目标，保证了跟踪的持续性和鲁棒性；(2) The re-detection mechanism of the present invention can perform global detection to retrieve the target when the target drifts or is lost, ensuring the continuity and robustness of tracking;

（3）本发明结合了相互滤波的跟踪快速性和基于深度学习检测算法的准确性，可以时跟踪算法进行持续跟踪兼容了较好的速度和精度。(3) The present invention combines the tracking speed of mutual filtering and the accuracy of the deep learning detection algorithm, and can perform continuous tracking with a real-time tracking algorithm and is compatible with better speed and accuracy.

附图说明Description of the drawings

图1为本发明所述的一种目标自主检测跟踪方法的流程图；Figure 1 is a flow chart of a target autonomous detection and tracking method according to the present invention;

图2为Yolov3的网络结构示意图；Figure 2 is a schematic diagram of the network structure of Yolov3;

图3为KCF单目标跟踪算法的流程图。Figure 3 is the flow chart of the KCF single target tracking algorithm.

具体实施方式Detailed ways

下面结合实施例及附图，对本发明作进一步的详细说明，但本发明的实施方式不限于此。The present invention will be further described in detail below with reference to the examples and drawings, but the implementation of the present invention is not limited thereto.

实施例1Example 1

如图1至图3所示，一种目标自主检测跟踪方法，针对基于KCF的相关滤波跟踪算法，无法自动确定目标进行跟踪，以及在跟踪过程中目标快速移动、被遮挡、超出视野范围等原因导致目标丢失，无法有效对目标进行连续性跟踪问题。开始获取视频序列，通过目标检测模块在全幅图像范围进行循环检测，直到检测算法检测到目标，As shown in Figures 1 to 3, a target autonomous detection and tracking method is aimed at the KCF-based correlation filter tracking algorithm, which cannot automatically determine the target for tracking, and the target moves quickly, is occluded, and exceeds the field of view during the tracking process. As a result, the target is lost and the target cannot be effectively tracked continuously. Start to acquire the video sequence, and use the target detection module to perform cyclic detection in the full image range until the detection algorithm detects the target.

获取目标区域矩形框信息，输出目标在图像中定位框位置的中心点（x，y）和宽高(w，h)。对跟踪器模版的初始化首先利用检测的定位框的（x，y，w，h）参数来和设定padding（填充）参数来确定目标搜索区域的大小，并根据目标搜索区域的大小创建初始化余弦窗和高斯理想响应标签；然后提取目标区域的Gray灰度特征，将余弦窗与提取的特征图相乘解决边缘效应；最后将得到的二维数据（灰度特征与余弦窗相乘的结果）经过傅里叶变换转换到频域内，与经过傅里叶变换后的高斯理想响应标签相乘，从而完成跟踪器模板的初始化。程序运行时，通过判断视频是否结束，若结束则直接结束整个程序，若未结束则进入目标跟踪。Obtain the rectangular frame information of the target area and output the center point (x, y) and width and height (w, h) of the target positioning frame in the image. To initialize the tracker template, first use the (x, y, w, h) parameters of the detected positioning frame and set the padding parameters to determine the size of the target search area, and create an initialization cosine based on the size of the target search area. window and Gaussian ideal response label; then extract the Gray grayscale features of the target area, and multiply the cosine window and the extracted feature map to solve the edge effect; finally, the obtained two-dimensional data (the result of multiplying the grayscale features and the cosine window) After Fourier transform, it is converted into the frequency domain and multiplied with the Gaussian ideal response label after Fourier transform to complete the initialization of the tracker template. When the program is running, it determines whether the video ends. If it ends, it will directly end the entire program. If it does not end, it will enter the target tracking.

跟踪阶段，实时读取当前帧的图像数据，以上一帧图像跟踪预测的目标中心点为中心，截取初始化阶段时目标搜索区域大小的图像，并提取目标区域的gray灰度特征，经过与余弦窗相乘后进行傅里叶变换得到目标搜索区域的频域特征，然后与跟踪滤波器模型进行匹配计算，将计算结果通过逆傅里叶变换后获得搜索区域的响应图，响应图中最大峰值的坐标为目标的预测位置的中心坐标，将该坐标换算至原图，即得到目标在当前帧图像中的定位点坐标和其响应值。在获取当前帧目标位置后，以目标位置为中心提取目标搜索区域的灰度特征，对跟踪器模板进行加权平均后更新模板，作为下一帧的跟踪滤波模型。In the tracking phase, the image data of the current frame is read in real time, and the target center point predicted by the previous frame image tracking is taken as the center. The image of the size of the target search area in the initialization phase is intercepted, and the gray grayscale features of the target area are extracted, and the cosine window is used After multiplication, Fourier transform is performed to obtain the frequency domain characteristics of the target search area, and then the matching calculation is performed with the tracking filter model. The calculation result is subjected to the inverse Fourier transform to obtain the response map of the search area. The maximum peak value in the response map is The coordinates are the center coordinates of the predicted position of the target. By converting the coordinates to the original image, the coordinates of the target's positioning point in the current frame image and its response value are obtained. After obtaining the target position of the current frame, extract the grayscale features of the target search area with the target position as the center, perform a weighted average on the tracker template and then update the template as the tracking filter model for the next frame.

同时在跟踪阶段，利用跟踪质量评估模块对每一次跟踪预测的结果进行跟踪质量评估，根据评估的结果判断是否启动目标检测模块进行全局的重检测，若判定为需要重检测则启动目标检测模块进行全幅图像范围的目标检测，选取检测目标输出置信度高于阈值的目标作为找回的跟踪目标，并输出目标的定位矩形框信息，对跟踪器模型进行重新初始化；否则继续跟踪器进行跟踪。跟踪质量评估模块评估策略为利用跟踪预测输出的响应值和平均峰值相关能能量（Average Peak-to-Correlation Energy，APCE）来进行综合评估判断。APCE为平均峰值相关能能量，即当目标受到干扰遮挡时，其输出的响应图会呈现多峰震荡剧烈的状态，平均峰值能量明显减少，APCE的计算方式如下：At the same time, in the tracking stage, the tracking quality assessment module is used to evaluate the tracking quality of each tracking prediction result. Based on the evaluation results, it is judged whether to start the target detection module for global re-detection. If it is determined that re-detection is needed, the target detection module is started. For target detection in the full image range, the target whose detection target output confidence is higher than the threshold is selected as the retrieved tracking target, and the positioning rectangular box information of the target is output, and the tracker model is reinitialized; otherwise, the tracker continues to track. The evaluation strategy of the tracking quality evaluation module is to use the response value of the tracking prediction output and the average peak-to-correlation energy (APCE) to conduct a comprehensive evaluation and judgment. APCE is the average peak correlation energy. That is, when the target is blocked by interference, the output response graph will show a state of multi-peak oscillation and the average peak energy will be significantly reduced. The calculation method of APCE is as follows:

其中和/>表示跟踪预测结果响应图中最大值和最小值，/>表示响应图中（w,h）位置上的响应值。in and/> Represents the maximum and minimum values in the response graph of the tracking prediction results,/> Represents the response value at the (w, h) position in the response graph.

具体的评估策略为：在实时的跟踪过程中连续记录当前帧往前历史6帧的跟踪目标的响应值和APCE值。当跟踪预测的目标响应值的绝对值小于某特定阈值门限，则启动检测模块进行重检测；当跟踪预测当前帧的APCE值小于某特定阈值门限，则启动检测模块进行重检测；当记录的APCE值最新三帧的平均值小于前3帧平均值某特定比值门限，则启动检测模块进行重检测；当响应值在记录的历史连续6次下降且当前帧响应值小于历史前6帧某特定比值阈值且记录APCE连续6次处于下降趋势，则启动检测模块进行重检测。The specific evaluation strategy is: during the real-time tracking process, continuously record the response value and APCE value of the tracking target from the current frame to the previous 6 frames. When the absolute value of the target response value predicted by tracking is less than a certain threshold, the detection module is started for re-detection; when the APCE value of the current frame predicted by tracking is less than a certain threshold, the detection module is started for re-detection; when the recorded APCE If the average value of the latest three frames is less than a certain ratio threshold of the average value of the previous three frames, the detection module will be started for re-detection; when the response value drops for 6 consecutive times in the recorded history and the current frame response value is less than a certain ratio of the previous six frames in history. If the threshold is exceeded and the recorded APCE is in a downward trend for 6 consecutive times, the detection module will be started for re-detection.

Yolov3网络主要由三部分组成：主干网络(backbone)、加强特征提取层(Neck)、网络预测头(Head)。主干网络：为带有大残差块堆叠而成的Darknet-53特征提取网络，网络中没有使用池化层而是利用步长为2的卷积代替下采样。Darknet-53网络可以获得32倍、16倍和8倍的下采样的特征图，以获得不同的感受野。加强特征提取层：从主干特征网络中引出不同尺度的特征图，通过自下而上的路径和横向连接来进行多尺度特征融合，这样既可以利用顶层的高语义信息又可以利用底层的高分辨率信息。网络预测头(Head):利用不同尺度的特征层多尺度输出以适应检测小尺度、中尺度、大尺度的目标。Yolov3 network mainly consists of three parts: backbone network (backbone), enhanced feature extraction layer (Neck), and network prediction head (Head). Backbone network: It is a Darknet-53 feature extraction network stacked with large residual blocks. Instead of using a pooling layer, the network uses convolution with a stride of 2 instead of downsampling. The Darknet-53 network can obtain 32x, 16x, and 8x downsampled feature maps to obtain different receptive fields. Strengthen the feature extraction layer: extract feature maps of different scales from the backbone feature network, and perform multi-scale feature fusion through bottom-up paths and lateral connections, so that both the high semantic information of the top layer and the high resolution of the bottom layer can be used rate information. Network prediction head (Head): Utilizes multi-scale output of feature layers of different scales to adapt to the detection of small-scale, medium-scale, and large-scale targets.

Yolov3的网络结构如图2所示。The network structure of Yolov3 is shown in Figure 2.

Yolov3目标检测模型的输入是416×416大小的图像，最后经过主干特征提取网络后分别得到三个不同大小的特征图y1、y2、y3，输出分别为13×13×C维度、26×26×C维度、52×52×C维度，其中C为（3×（4+1+L）），3表示每个网格的锚点框预测3个候选框参数，每个候选框预测的参数（tx,ty,tw,th,tcof,L）分别为目标的定位中心(tx,ty)、宽高(tw,th)、置信度tcof、和类别信息L。在通过Yolov3目标检测算法检测出多个候选边界框，最后使用非极大值抑制算法去掉冗余的预测框，只保留置信度最大的预测定位框作为目标的检测输出框。The input of the Yolov3 target detection model is an image of 416×416 size. Finally, after passing through the backbone feature extraction network, three feature maps of different sizes y1, y2, and y3 are obtained. The outputs are 13×13×C dimension and 26×26× respectively. C dimension, 52×52×C dimension, where C is (3×(4+1+L)), 3 means that the anchor point box of each grid predicts 3 candidate frame parameters, and the parameters predicted by each candidate frame ( tx, ty, tw, th, tcof, L) are respectively the positioning center (tx, ty), width and height (tw, th), confidence level tcof, and category information L of the target. After detecting multiple candidate bounding boxes through the Yolov3 target detection algorithm, the non-maximum suppression algorithm is finally used to remove redundant prediction boxes, and only the predicted positioning box with the highest confidence is retained as the target detection output box.

Yolov3的主干网络Darknet-53是由5个大的ResX组件组成，其中X分别为1、2、8、8、4分别表示残差块组件重复的次数。ResX是由一个DBL模块和X个Resunit模块组成，一个DBL模块由一个二维卷积Conv_2D+批量归一化BN(Batch Normalization)+激活函数LeakyRelu三个模块组成；一个残差Resunit块由两个DBL模块和一条残差边组成。若输入的图像大小为416×416最后经过主干特征提取网络后分别得到三个不同大小的特征图：13×13维度特征图、26×26维度特征图、52×52维度特征图。在网络的加强特征提取部分，首先将(13×13)特征图经过一个DBL*5模块（由5个DBL模块的堆叠组合），然后上采样与(26×26)特征图进行拼接，再经过一个DBL*5模块并上采样后与(52×52)特征图进行拼接，最后得到三层特征输出层。Yolov3's backbone network Darknet-53 is composed of 5 large ResX components, where X are 1, 2, 8, 8, and 4 respectively representing the number of repetitions of the residual block component. ResX is composed of a DBL module and It consists of a module and a residual edge. If the input image size is 416×416, three feature maps of different sizes will be obtained after passing through the backbone feature extraction network: 13×13 dimensional feature map, 26×26 dimensional feature map, and 52×52 dimensional feature map. In the enhanced feature extraction part of the network, the (13×13) feature map is first passed through a DBL*5 module (a stacked combination of 5 DBL modules), and then upsampled and spliced with the (26×26) feature map, and then passed through A DBL*5 module is upsampled and spliced with the (52×52) feature map, and finally a three-layer feature output layer is obtained.

在预测输出部分的预测头Head是由DBL+Conv_1×1的卷积组成用于输出预测参数。最后在32倍下采样得到的预测头维度为13×13×C，其采样倍数高，对原图的感受野大，用于大目标的检测；16倍下采样的预测头融合了32倍下采样的特征其输出维度为26×26×C，负责中尺度目标的预测；最后是8倍下采样融合了32倍、16倍下采样的特征，预测头输出维度为52×52×C,其在原图的感受较小，主要负责小尺度目标的预测。The prediction head Head in the prediction output part is composed of DBL+Conv_1×1 convolution and is used to output prediction parameters. Finally, the dimension of the prediction head obtained by 32 times down sampling is 13×13 The output dimension of the sampled features is 26×26×C, which is responsible for the prediction of medium-scale targets; finally, 8x downsampling combines 32x and 16x downsampling features, and the output dimension of the prediction head is 52x52xC. The feeling in the original image is smaller, and it is mainly responsible for the prediction of small-scale targets.

关键点位将目标检测与跟踪器结合起来，实现了跟踪目标的自动检测跟踪，以及当跟踪质量降低和目标遮挡超出视野范围等导致目标跟丢情况，根据跟踪质量评估后自主启动目标检测算法进行重新检测找回目标继续跟踪，保证了目标跟踪的持续性和鲁棒性。跟踪质量评估模块综合考虑了响应值和APCE值来进行策略判断是否启动目标检测，将基于深度学习的目标检测的检测精度与相关滤波模板跟踪的速度有效的融合，实现了持续稳定的跟踪。Key points combine target detection and tracker to realize automatic detection and tracking of tracking targets. When the tracking quality is reduced and the target is blocked beyond the field of view, resulting in the target being lost, the target detection algorithm is automatically started based on the tracking quality assessment. Re-detect and retrieve the target to continue tracking, ensuring the persistence and robustness of target tracking. The tracking quality evaluation module comprehensively considers the response value and APCE value to determine whether to start target detection, and effectively integrates the detection accuracy of target detection based on deep learning with the speed of related filter template tracking to achieve sustained and stable tracking.

将目标检测与跟踪器结合起来，实现了跟踪目标的自动检测跟踪，当跟踪质量降低和目标遮挡超出视野范围等导致目标跟丢情况，根据跟踪质量评估后进行重新检测，综合考虑跟踪目标的响应值和APCE值的策略来进行跟踪质量评估，兼容了检测精度与相关滤波的跟踪速度。The target detection and tracker are combined to realize the automatic detection and tracking of the tracking target. When the tracking quality is reduced or the target is blocked beyond the field of view, causing the target to be lost, re-detection will be performed based on the tracking quality evaluation, and the response of the tracking target will be comprehensively considered. The tracking quality evaluation is carried out using the strategy of APCE value and APCE value, which is compatible with the detection accuracy and the tracking speed of correlation filtering.

针对目标检测模块中的Yolov3检测算法，可以有SSD、FastRCNN等目标检测算法进行代替，同样可是实现对目标的检测功能。For the Yolov3 detection algorithm in the target detection module, it can be replaced by target detection algorithms such as SSD and FastRCNN, which can also realize the target detection function.

针对跟踪器模版中KCF目标跟踪算法，可以采用MOSSE算法、CSK算法等相关滤波跟踪算法，也能实现对目标的跟踪功能。For the KCF target tracking algorithm in the tracker template, related filter tracking algorithms such as MOSSE algorithm and CSK algorithm can be used to achieve the target tracking function.

基于相关滤波的KCF单目标跟踪算法，流程如图3所示。The process of KCF single target tracking algorithm based on correlation filtering is shown in Figure 3.

KCF首先提取第一帧目标区域的图像块的特征Gray，在利用cos_windows进行滤波，然后通过快速傅里叶变换转换到频域，训练初始滤波器。检测阶段时利用滤波器模型计算相关响应，选取最大响应峰值处作为目标的运动位置；提取相应峰值为中心的图像区域图像特征，学习并更新期望输出的滤波器模型，循环执行上述步骤，更新滤波器来跟踪目标在下一帧图像上的位置。KCF first extracts the feature Gray of the image block in the target area of the first frame, uses cos_windows for filtering, and then converts it to the frequency domain through fast Fourier transform to train the initial filter. In the detection stage, the filter model is used to calculate the relevant response, and the maximum response peak is selected as the target's motion position; the image features of the image area centered on the corresponding peak are extracted, the filter model of the desired output is learned and updated, and the above steps are executed in a loop to update the filter to track the position of the target on the next frame of image.

如上所述，可较好地实现本发明。As described above, the present invention can be better implemented.

本说明书中所有实施例公开的所有特征，或隐含公开的所有方法或过程中的步骤，除了互相排斥的特征和/或步骤以外，均可以以任何方式组合和/或扩展、替换。All features disclosed in all embodiments in this specification, or all steps in methods or processes implicitly disclosed, except for mutually exclusive features and/or steps, may be combined and/or extended or replaced in any manner.

以上所述，仅是本发明的较佳实施例而已，并非对本发明作任何形式上的限制，依据本发明的技术实质，在本发明的精神和原则之内，对以上实施例所作的任何简单的修改、等同替换与改进等，均仍属于本发明技术方案的保护范围之内。The above are only preferred embodiments of the present invention and do not limit the present invention in any form. According to the technical essence of the present invention and within the spirit and principles of the present invention, any simple modifications to the above embodiments may be made. Modifications, equivalent substitutions and improvements, etc., all still fall within the protection scope of the technical solution of the present invention.

Claims

1. The autonomous target detecting and tracking method is characterized by comprising the following steps:

s1, target detection: circularly detecting the image until a target is detected;

s2, initializing a tracker template: initializing a tracker template;

s3, target tracking: tracking the target.

2. The autonomous detection and tracking method of a target according to claim 1, wherein in step S1, positioning frame information of an area where the target is located is obtained, and a center point (x, y) and a width and height (w, h) of the positioning frame position of the target in the image are output; wherein X represents the X-axis coordinate of the center point, Y represents the Y-axis coordinate of the center point, w represents the width of the positioning frame, and h represents the height of the positioning frame.

3. The method of autonomous target detection and tracking according to claim 1, wherein step S2 comprises the steps of:

s21, determining the size of a target search area by using the (x, y, w, h) parameters of the detected positioning frame and the set filling parameters, and creating an initialized cosine window and a Gaussian ideal response label according to the size of the target search area;

s22, extracting gray scale features of the region where the target is located, and multiplying a cosine window by the extracted feature map to solve the edge effect;

s23, the obtained result of multiplying the gray features by the cosine window is converted into a frequency domain through Fourier transformation and multiplied by the Gaussian ideal response label after Fourier transformation, and therefore initialization of the tracker template is completed.

4. The method of autonomous target detection and tracking according to claim 1, wherein step S3 comprises the steps of:

s31, obtaining frequency domain characteristics of a target search area;

s32, obtaining the positioning point coordinates and the response values of the targets in the current frame image;

s33, after the target position of the current frame is obtained, gray features of a target search area are extracted by taking the target position as a center, and a template is updated after weighted average is carried out on the tracker template to be used as a tracking filtering model of the next frame.

5. The method of autonomous target detection and tracking according to claim 4, wherein in step S31, image data of a current frame is read in real time, an image of a target search area size in an initialization stage is intercepted with a target center point of tracking prediction of a previous frame image as a center, gray scale characteristics of the target area are extracted, and fourier transform is performed after multiplication with a cosine window to obtain frequency domain characteristics of the target search area.

6. The autonomous detection and tracking method according to claim 4, wherein in step S32, a matching calculation is performed with the tracking filter model, a response chart of the search area is obtained after the calculation result is subjected to inverse fourier transform, the coordinate of the maximum peak in the response chart is the center coordinate of the predicted position of the target, and the coordinate is converted to the original chart, so as to obtain the positioning point coordinate of the target in the current frame image and the response value thereof.

7. The method according to any one of claims 1 to 6, wherein step S3 includes the steps of:

s34, carrying out tracking quality evaluation on the result of each tracking prediction by using a tracking quality evaluation module, and judging whether to start a target detection module to carry out global re-detection according to the evaluated result: if the detection is judged to be needed to be re-detected, starting a target detection module to detect the target in the full-frame image range, selecting a target with the detection target output confidence higher than a set threshold as a retrieved tracking target, outputting positioning frame information of the target, and re-initializing a tracker model; otherwise, continuing the tracker to track.

8. The method according to claim 7, wherein in step S34, the tracking quality evaluation module evaluates a policy as follows: and carrying out comprehensive evaluation judgment by utilizing the response value of the tracking prediction output and the average peak correlation energy.

9. The method for autonomous target detection and tracking according to claim 8, wherein in step S34, the average peak correlation energy is calculated by:

wherein,represents the average peak correlation energy, +.>Representing the maximum value in the response graph of the tracking prediction result, +.>Representing the minimum value in the trace prediction response map, < >>Representing the response value at the (w, h) position in the response map, < >>Representing the averaging function.

10. An autonomous target detection and tracking system, characterized by being used for realizing the autonomous target detection and tracking method according to any one of claims 1 to 9, comprising the following modules connected in sequence:

the target detection module: the image detection device is used for circularly detecting the image until a target is detected;

tracker template initialization module: the tracker template is used for initializing;

a target tracking module: for tracking the target.