CN109829436B

CN109829436B - A Multi-Face Tracking Method Based on Deep Apparent Features and Adaptive Aggregation Networks

Info

Publication number: CN109829436B
Application number: CN201910106309.1A
Authority: CN
Inventors: 柯逍; 郑毅腾; 朱敏琛
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-02-02
Filing date: 2019-02-02
Publication date: 2022-05-13
Anticipated expiration: 2039-02-02
Also published as: WO2020155873A1; CN109829436A

Abstract

The invention relates to a multi-face tracking method based on deep apparent features and an adaptive aggregation network. First, a face recognition data set is used to train an adaptive aggregation network; then a face detection method based on a convolutional neural network is used to obtain the human face. The position of the face target to be tracked is initialized, and the face features are extracted; then the Kalman filter is used to predict the position of each face tracking target in the next frame, and the position of the face is located again in the next frame. Face extraction features; finally, an adaptive aggregation network is used to aggregate the face feature sets in the tracking track of each tracked face target, and dynamically generate a face depth apparent feature fused with multi-frame information. The position and the fused features are calculated and matched with the face position and its features obtained through detection in the current frame, and the tracking status is updated. The present invention can improve the performance of face tracking.

Description

A Multi-Face Tracking Method Based on Deep Apparent Features and Adaptive Aggregation Networks

技术领域technical field

本发明涉及模式识别与计算机视觉领域，特别是一种基于深度表观特征和自适应聚合网络的多人脸跟踪方法。The invention relates to the field of pattern recognition and computer vision, in particular to a multi-face tracking method based on deep apparent features and an adaptive aggregation network.

背景技术Background technique

近年来，随着社会进步及科技的不断发展，视频人脸识别问题已渐渐成为一个热门研究领域，吸引了国内外众多专家学者的研究兴趣，作为视频人脸识别的入口和基础，人脸检测和跟踪技术得到了快速的发展，广泛应用在智能监控、虚拟现实感知接口、视频会议等领域，由于现实的视频背景是复杂多变的，且人脸作为非刚性目标，在视频序列中可能存在大幅的姿态或表情的变化，在真实场景中实现一个鲁棒的人脸跟踪算法仍然具有很大的挑战。In recent years, with social progress and the continuous development of technology, video face recognition has gradually become a hot research field, attracting the research interest of many experts and scholars at home and abroad. As the entrance and basis of video face recognition, face detection And tracking technology has developed rapidly and is widely used in intelligent monitoring, virtual reality perception interface, video conferencing and other fields. Because the real video background is complex and changeable, and the face is a non-rigid target, it may exist in the video sequence. With large pose or expression changes, it is still a great challenge to implement a robust face tracking algorithm in real scenes.

为了对一个人脸进行分析，我们首先必须捕捉人脸，这可以通过人脸检测技术和人脸跟踪技术来实现，只有在视频中精确定位和跟踪人脸目标，我们才可以对人脸进行更细致的分析，如人脸识别，姿态估计等。目标跟踪技术无疑是智能安防中最为重要的技术之一，人脸跟踪技术便是目前跟踪技术的一种具体应用，其运用跟踪算法处理视频序列中运动的人脸，并保持对这个人脸区域的锁定完成跟踪，该技术在智能安防及视频监控等场景下都具有良好的应用前景。In order to analyze a face, we must first capture the face, which can be achieved through face detection technology and face tracking technology. Only by accurately locating and tracking the face target in the video, we can update the face. Detailed analysis, such as face recognition, pose estimation, etc. Target tracking technology is undoubtedly one of the most important technologies in intelligent security. Face tracking technology is a specific application of current tracking technology. It uses tracking algorithms to process moving faces in video sequences, and keeps track of this face area. This technology has good application prospects in scenarios such as intelligent security and video surveillance.

人脸跟踪在视频监控中扮演着重要的角色，但目前在真实场景中，由于人脸姿态的大幅变化以及跟踪目标之间的重叠与遮挡，导致实际应用还具有较大的困难。Face tracking plays an important role in video surveillance, but currently in real scenes, due to the large changes in face poses and the overlap and occlusion between tracking targets, practical applications are still difficult.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的是提出一种基于深度表观特征和自适应聚合网络的多人脸跟踪方法，能够提升人脸跟踪的性能。In view of this, the purpose of the present invention is to propose a multi-face tracking method based on deep apparent features and an adaptive aggregation network, which can improve the performance of face tracking.

本发明采用以下方案实现：一种基于深度表观特征和自适应聚合网络的多人脸跟踪方法，具体包括以下步骤：The present invention adopts the following scheme to realize: a multi-face tracking method based on deep apparent feature and self-adaptive aggregation network, which specifically includes the following steps:

步骤S1：采用人脸识别数据集训练自适应聚合网络；Step S1: using the face recognition data set to train the adaptive aggregation network;

步骤S2：根据初始的输入视频帧，采用卷积神经网络获取人脸的位置，初始化待跟踪的人脸目标，提取人脸特征并保存；Step S2: According to the initial input video frame, use the convolutional neural network to obtain the position of the face, initialize the face target to be tracked, extract the face features and save;

步骤S3：采用卡尔曼滤波器预测每个人脸目标在下一帧的位置，并在下一帧中再次定位人脸所在位置，并对检测出的人脸提取特征；Step S3: use Kalman filter to predict the position of each face target in the next frame, and locate the position of the face again in the next frame, and extract features for the detected face;

步骤S4：使用步骤S1训练好的自适应聚合网络，对每个跟踪的人脸目标跟踪轨迹中的人脸特征集合进行聚合，动态地生成一个融合多帧信息的人脸深度表观特征,结合预测的位置及融合后的特征,与当前帧中通过检测得到的人脸位置及其特征,进行相似度计算与匹配，更新跟踪状态。Step S4: Use the adaptive aggregation network trained in step S1 to aggregate the face feature set in the tracking track of each tracked face target, and dynamically generate a face deep apparent feature fused with multi-frame information. The predicted position and the fused feature are compared with the detected face position and its features in the current frame, and the similarity is calculated and matched, and the tracking status is updated.

进一步地，步骤S1具体包括以下步骤：Further, step S1 specifically includes the following steps:

步骤S11：收集公开的人脸识别数据集，获得相关人物的图片及姓名；Step S11: collect public face recognition data sets, and obtain pictures and names of relevant persons;

步骤S12：采用融合策略对多个数据集中共有人物的图片进行整合，使用预训练的MTCNN模型进行人脸检测和人脸关键点定位，并应用相似变换进行人脸对齐，同时将训练集中的所有图像都减去其每个通道在训练集上的均值，完成数据预处理，训练自适应聚合网络。Step S12: Integrate pictures of common characters in multiple datasets by using a fusion strategy, use the pre-trained MTCNN model for face detection and face key point location, and apply similarity transformation for face alignment, and at the same time all the training set The images are subtracted from the mean of each channel on the training set, data preprocessing is done, and the adaptive aggregation network is trained.

进一步地，所述自适应聚合网络由深度特征抽取模块和自适应特征聚合模块串联而成，其接受同一个人的一张或多张人脸图像作为输入，输出聚合后的特征，其中深度特征抽取模块采用34层的ResNet作为骨干网络，自适应特征聚合模块含有一个特征聚合层；令B表示输入的样本数量，{z_t}表示深度特征抽取模块的输出特征集合，其中t＝1,2,...,B表示输入样本编号，特征聚合层的计算方式为：Further, the adaptive aggregation network is formed by a deep feature extraction module and an adaptive feature aggregation module in series, which accepts one or more face images of the same person as input, and outputs the aggregated features, wherein the deep feature extraction The module adopts 34 layers of ResNet as the backbone network, and the adaptive feature aggregation module contains a feature aggregation layer; let B represent the number of input samples, {z _t } represents the output feature set of the deep feature extraction module, where t=1, 2, ...,B represents the input sample number, and the calculation method of the feature aggregation layer is:

a＝∑_to_tz_t；a=∑ _t o _t z _t ;

式中，q表示特征向量z_t各个分量的权重，是可以学习的参数，通过将人脸识别信号作为监督信号，利用反向传播和梯度下降方法进行学习，v_t为sigmoid函数的输出，代表每个特征向量z_t的分数，范围在0和1之间，o_t为L1归一化的输出，使得∑_to_t＝1，a为B个特征向量聚合后的一个特征向量。In the formula, q represents the weight of each component of the feature vector z _t , which is a parameter that can be learned. By using the face recognition signal as the supervision signal, the back-propagation and gradient descent methods are used for learning, and v _t is the output of the sigmoid function, representing The score of each eigenvector z _t , ranging between 0 and 1, o _t is the output of L1 normalization, so that ∑ _t o _t =1, a is an eigenvector after B eigenvectors are aggregated.

进一步地，步骤S2具体包括以下步骤：Further, step S2 specifically includes the following steps:

步骤S21：令i表示输入视频的第i帧的编号，初始时i＝1，使用预训练的MTCNN 模型同时检测所有人脸的位置Dⁱ及其对应的面部关键点的位置Cⁱ，其中

j为第j个检测到人脸的编号，Jⁱ为第帧检测到的人脸数量，

其中

表示第i帧中第j个人脸的位置，x,y,w,h分别表示人脸区域的左上角坐标及其宽度和高度，

其中

表示第i帧中第 j个人脸的关键点，c₁,c₂,c₃,c₄,c₅分别表示人脸的左眼，右眼，鼻子，左嘴角，右嘴角的坐标；Step S21: Let i represent the number of the ith frame of the input video, initially i=1, use the pre-trained MTCNN model to simultaneously detect the position D ⁱ of all faces and the position C ⁱ of the corresponding facial key points, wherein

j is the number of the jth detected face, J ⁱ is the number of detected faces in the jth frame,

in

Represents the position of the jth face in the ith frame, x, y, w, h represent the upper left corner coordinates of the face area and its width and height, respectively,

in

Represents the key points of the jth face in the ith frame, c ₁ , c ₂ , c ₃ , c ₄ , and c ₅ respectively represent the coordinates of the left eye, right eye, nose, left corner of the mouth, and right corner of the mouth;

步骤S22：对于每一个人脸的位置D_j ⁱ及其面部关键点坐标

为其分配一个唯一的身份ID_k,k＝1,2,...,Kⁱ，其中k表示第k个跟踪目标的编号，Kⁱ表示在第i帧时跟踪目标的人数，并初始化其对应的跟踪器T_k＝{ID_k,P_k,L_k,E_k,A_k}，其中ID_k表示第k 个跟踪目标的唯一身份标识，P_k表示分配给第k个目标的人脸位置坐标，L_k表示第 k个目标的面部关键点坐标，E_k表示第k个目标的人脸特征列表，A_k表示第k个目标的生命周期，初始化Kⁱ＝Jⁱ，

A_k＝1；Step S22 ^: For each face position D _ji and its facial key point coordinates

It is assigned a unique identity ID _k , k=1,2,...,K ⁱ , where k represents the number of the k-th tracking target, K ⁱ represents the number of people tracking the target at the i-th frame, and initializes its The corresponding tracker T _k ={ID _k ,P _k ,L _k ,E _k ,A _k }, where ID _k represents the unique identification of the k-th tracked target, and P _k represents the face assigned to the k-th target Position coordinates, L _k represents the facial key point coordinates of the k-th target, E _k represents the face feature list of the k-th target, A _k represents the life cycle of the k-th target, initialization K ⁱ =J ⁱ ,

A _k = 1;

步骤S23：对于T_k中的每一个人脸的位置P_k，对图像进行裁剪，得到对应的人脸图像，使用对应的面部关键点位置L_k，应用相似变换进行人脸对齐，得到对齐后的人脸图像；Step S23: For the position P _k of each face in T _k , crop the image to obtain the corresponding face image, use the corresponding face key point position L _k , apply the similarity transformation to align the face, and obtain the aligned face image. face image;

步骤S24：将对齐后的人脸图像输入自适应聚合网络，得到对应的人脸深度表观特征，添加到跟踪器中T_k的特征列表E_k。Step S24: Input the aligned face image into the adaptive aggregation network to obtain the corresponding deep apparent feature of the face, and add it to the feature list E _k of T _k in the tracker.

进一步地，步骤S3具体包括以下步骤：Further, step S3 specifically includes the following steps:

步骤S31：将每个跟踪的人脸目标状态表示为以下形式：Step S31: Represent the state of each tracked face target in the following form:

式中，m表示跟踪的人脸目标状态，u和v表示跟踪人脸区域的中心坐标，s为人脸框的面积，r为人脸框的宽高比，

分别表示(u,v,s,r)在图像坐标空间中的速度；In the formula, m represents the state of the tracked face target, u and v represent the center coordinates of the tracked face area, s is the area of the face frame, r is the aspect ratio of the face frame,

respectively represent the speed of (u, v, s, r) in the image coordinate space;

步骤S32：将每个跟踪器T_k中的人脸位置P_k＝(x,y,w,h)转化为

的形式，其中

表示第i帧中第k个跟踪目标的人脸位置转化后的形式；Step S32: Convert the face position P _k =(x, y, w, h) in each tracker T _k into

in the form of

Represents the transformed form of the face position of the k-th tracking target in the i-th frame;

步骤S33：将

作为第i帧第k个跟踪目标的直接观测结果，其由人脸检测而来，采用基于线性匀速运动模型的卡尔曼滤波器对第k个跟踪目标在第i+1帧中的状态

进行预测；Step S33: put

As the direct observation result of the k-th tracking target in the i-th frame, it is obtained from face detection, and the Kalman filter based on the linear uniform motion model is used to analyze the state of the k-th tracking target in the i+1-th frame.

make predictions;

步骤S34：在第i+1帧中，采用MTCNN模型再次进行人脸检测与面部关键点定位，得到人脸的位置Dⁱ⁺¹和面部关键点Cⁱ⁺¹；Step S34: in the i+1th frame, the MTCNN model is used to perform face detection and facial key point positioning again to obtain the position D ⁱ⁺¹ of the human face and the facial key point C ⁱ⁺¹ ;

步骤S35：对每一个人脸位置

基于其面部关键点

应用相似变换完成人脸对齐，并输入自适应聚合网络提取特征，得到特征集合Fⁱ⁺¹，其中Fⁱ⁺¹表示第i+1 帧中所有人脸的特征集合。Step S35: for each face position

based on its facial key points

Apply similarity transformation to complete face alignment, and input adaptive aggregation network to extract features to obtain feature set F ⁱ⁺¹ , where F ⁱ⁺¹ represents the feature set of all faces in the i+1th frame.

进一步地，步骤S4具体包括以下步骤：Further, step S4 specifically includes the following steps:

步骤S41：对于每个人脸的跟踪器T_k，将其历史运动轨迹中所有特征的集合E_k输入自适应聚合网络，得到聚合特征f_k,其中f_k表示将第k个跟踪目标历史运动轨迹中所有特征向量进行融合之后输出的一个聚合特征；Step S41: For the tracker T _k of each face, input the set E _k of all the features in its historical motion trajectory into the adaptive aggregation network to obtain the aggregated feature f _k , where f _k represents the historical motion trajectory of the kth tracking target. An aggregated feature output after fusion of all feature vectors in ;

步骤S42：将第i帧中由卡尔曼滤波器预测的第k个目标在下一帧的位置状态

转化为

的形式；Step S42: Calculate the position state of the k-th target predicted by the Kalman filter in the i-th frame in the next frame

transform into

form;

步骤S43：结合

和目标k聚合后的特征f_k，以及第i+1帧中的由人脸检测得到的人脸位置Dⁱ⁺¹及其特征集合Fⁱ⁺¹，计算如下关联矩阵：Step S43: Combine

The feature f _k aggregated with the target k, and the face position D ⁱ⁺¹ and its feature set F ⁱ⁺¹ obtained by face detection in the i+1th frame, calculate the following correlation matrix:

G＝[g_jk],j＝1,2,...,Jⁱ⁺¹,k＝1,2,...,Kⁱ；G=[g _jk ], j=1,2,...,J ⁱ⁺¹ ,k=1,2,...,K ⁱ ;

式中，Jⁱ⁺¹为第i+1帧中检测到的人脸数量，Kⁱ为第i帧中的跟踪目标数量，

为第i+1帧中第j个人脸检测框与第i帧中由卡尔曼滤波器预测的第k个目标在第i+1帧中的位置状态

之间的重合程度，

为第i+1帧中第j个人脸特征

与第i帧中第k个目标聚合特征f_k之间的余弦相似度，λ为超参数，用于平衡两个度量的权重；In the formula, J ⁱ⁺¹ is the number of faces detected in the i+1th frame, K ⁱ is the number of tracked targets in the i-th frame,

is the position state of the jth face detection frame in the i+1th frame and the kth target predicted by the Kalman filter in the ith frame in the i+1th frame

the degree of overlap between

is the jth face feature in the i+1th frame

Cosine similarity with the k-th target aggregated feature f _k in the i-th frame, λ is a hyperparameter used to balance the weights of the two metrics;

步骤S44：将关联矩阵G作为代价矩阵，使用匈牙利算法计算得到匹配的结果，将第i+1帧中的人脸检测框

关联到第k个跟踪目标；Step S44: Using the correlation matrix G as the cost matrix, the Hungarian algorithm is used to calculate the matching result, and the face detection frame in the i+1th frame is

Associated with the kth tracking target;

步骤S45：将匹配结果中的下标对应关联矩阵G中的项，并过滤所有小于T_similarity的项g_jk，将其从匹配结果中删除，其中T_similarity为设定的超参数，表示匹配成功的最低相似度阈值；Step S45: Correspond the subscript in the matching result to the item in the association matrix G, and filter all items g _jk less than T _similarity , and delete it from the matching result, where T _similarity is a set hyperparameter, indicating that the matching is successful The minimum similarity threshold of ;

步骤S46：在匹配结果中，若检测框

与第k个跟踪目标关联成功，则更新对应跟踪器T_k中的位置状态

人脸关键点位置

生命周期A_k＝A_k+1，以及将对应的人脸特征

添加到特征列表E_k，若检测框

关联失败，则创建新的跟踪器；Step S46: In the matching result, if the detection frame is

If the association with the k-th tracking target is successful, the position state in the corresponding tracker T _k is updated

Face key point location

Life cycle A _k =A _k +1, and the corresponding face features

add to the feature list E _k , if the detection frame

If the association fails, create a new tracker;

步骤S47：对每一个跟踪器T_k，若其生命周期A_k＞T_age，则删除该跟踪器，其中T_age为设定的超参数，表示一个跟踪目标可以存活的最长时间。Step S47: For each tracker T _k , if its life cycle A _k >T _age , delete the tracker, where T _age is a set hyperparameter, indicating the longest time a tracking target can survive.

与现有技术相比，本发明有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明构建的一种基于深度表观特征和自适应聚合网络的多人脸跟踪方法能够有效地对视频中的人脸进行跟踪，提升了人脸跟踪的准确率，并降低了目标切换的次数。1. A multi-face tracking method based on deep apparent features and an adaptive aggregation network constructed by the present invention can effectively track the faces in the video, improve the accuracy of face tracking, and reduce target switching. number of times.

2、本发明能够在保证跟踪效果的同时对视频中的人脸进行在线跟踪。2. The present invention can track the human face in the video online while ensuring the tracking effect.

3、针对人脸跟踪过程中，预测的人脸位置不确定性较大，同时人脸可能发生大幅姿态变化以及遮挡等问题，本发明提出了利用人脸深度表观特征的方法，通过结合空间位置与深度特征之间的信息，提高了人脸跟踪的性能。3. In the face tracking process, the predicted position of the face has a large uncertainty, and the face may undergo large posture changes and occlusions. The information between position and depth features improves the performance of face tracking.

4、针对人脸跟踪过程中，难以有效利用同一目标跟踪轨迹中的所有特征，并将多个特征集合之间进行有效比较的问题，本发明提出了自适应聚合网络，通过特征聚合模块自适应地学习特征集合中每一个特征的重要程度并有效地进行融合，提升了人脸跟踪的效果。4. In the face tracking process, it is difficult to effectively utilize all the features in the same target tracking trajectory and effectively compare multiple feature sets. It learns the importance of each feature in the feature set and fuses it effectively, which improves the effect of face tracking.

附图说明Description of drawings

图1为本发明实施例的流程示意图。FIG. 1 is a schematic flowchart of an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图及实施例对本发明做进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.

应该指出，以下详细说明都是示例性的，旨在对本申请提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the application. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本申请的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和 /或它们的组合。It should be noted that the terminology used herein is for the purpose of describing specific embodiments only, and is not intended to limit the exemplary embodiments according to the present application. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural as well, furthermore, it is to be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates that There are features, steps, operations, devices, components and/or combinations thereof.

如图1所示，本实施例提供了一种基于深度表观特征和自适应聚合网络的多人脸跟踪方法，具体包括以下步骤：As shown in FIG. 1 , this embodiment provides a multi-face tracking method based on deep apparent features and an adaptive aggregation network, which specifically includes the following steps:

步骤S2：根据初始的输入视频帧，使用基于卷积神经网络的人脸检测方法获取人脸的位置，初始化待跟踪的人脸目标，提取人脸特征并保存；Step S2: according to the initial input video frame, use the face detection method based on convolutional neural network to obtain the position of the face, initialize the face target to be tracked, extract the face features and save;

步骤S3：采用卡尔曼滤波器预测每个人脸目标在下一帧的位置，并在下一帧中再次使用人脸检测方法定位人脸所在位置，并对检测出的人脸提取特征；Step S3: use the Kalman filter to predict the position of each face target in the next frame, and use the face detection method again in the next frame to locate the position of the face, and extract features for the detected face;

在本实施例中，步骤S1具体包括以下步骤：In this embodiment, step S1 specifically includes the following steps:

在本实施例中，所述自适应聚合网络由深度特征抽取模块和自适应特征聚合模块串联而成，其接受同一个人的一张或多张人脸图像作为输入，输出聚合后的特征，其中深度特征抽取模块采用34层的ResNet作为骨干网络，自适应特征聚合模块含有一个特征聚合层；令B表示输入的样本数量，{z_t}表示深度特征抽取模块的输出特征集合，其中t＝1,2,...,B表示输入样本编号，特征聚合层的计算方式为：In this embodiment, the adaptive aggregation network is composed of a deep feature extraction module and an adaptive feature aggregation module in series, which accepts one or more face images of the same person as input, and outputs the aggregated features, wherein The deep feature extraction module uses 34 layers of ResNet as the backbone network, and the adaptive feature aggregation module contains a feature aggregation layer; let B represent the number of input samples, {z _t } represents the output feature set of the deep feature extraction module, where t=1 ,2,...,B represents the input sample number, and the calculation method of the feature aggregation layer is:

a＝∑_to_tz_t；a=∑ _t o _t z _t ;

式中，q表示特征向量z_t各个分量的权重，是可以学习的参数，通过将人脸识别信号作为监督信号，利用反向传播和梯度下降方法进行学习，v_t为sigmoid函数的输出，代表每个特征向量z_t的分数，范围在0和1之间，o_t为L1归一化的输出，使得∑_to_t＝1，a为B个特征向量聚合后的一个特征向量。In the formula, q represents the weight of each component of the feature vector z _t , which is a parameter that can be learned. By using the face recognition signal as a supervision signal, the back-propagation and gradient descent methods are used for learning, and v _t is the output of the sigmoid function, representing The score of each eigenvector z _t , ranging between 0 and 1, o _t is the output of L1 normalization, so that ∑ _t o _t =1, a is an eigenvector after B eigenvectors are aggregated.

在本实施例中，步骤S2具体包括以下步骤：In this embodiment, step S2 specifically includes the following steps:

j为第j个检测到人脸的编号，Jⁱ为第帧检测到的人脸数量，

其中

其中

in

步骤S22：对于每一个人脸的位置D_j ⁱ及其面部关键点坐标

为其分配一个唯一的身份ID_k,k＝1,2,...,Kⁱ，其中k表示第k个跟踪目标的编号，Kⁱ表示在第i帧时跟踪目标的人数，并初始化其对应的跟踪器T_k＝{ID_k,P_k,L_k,E_k,A_k}，其中ID_k表示第k 个跟踪目标的唯一身份标识，P_k表示分配给第k个目标的人脸位置坐标，L_k表示第k个目标的面部关键点坐标，E_k表示第k个目标的人脸特征列表，A_k表示第k个目标的生命周期，初始化Kⁱ＝Jⁱ，

A _k = 1;

在本实施例中，步骤S3具体包括以下步骤：In this embodiment, step S3 specifically includes the following steps:

respectively represent the speed of (u, v, s, r) in the image coordinate space;

步骤S32：将每个跟踪器T_k中的人脸位置P_k＝(x,y,w,h)转化为

的形式，其中

in the form of

步骤S33：将

进行预测；Step S33: put

make predictions;

步骤S35：对每一个人脸位置

基于其面部关键点

based on its facial key points

在本实施例中，步骤S4具体包括以下步骤：In this embodiment, step S4 specifically includes the following steps:

转化为

transform into

form;

步骤S43：结合

之间的重合程度，

为第i+1帧中第j个人脸特征

the degree of overlap between

is the jth face feature in the i+1th frame

Associated with the kth tracking target;

步骤S46：在匹配结果中，若检测框

人脸关键点位置

生命周期A_k＝A_k+1，以及将对应的人脸特征

添加到特征列表E_k，若检测框

Face key point location

Life cycle A _k =A _k +1, and the corresponding face features

add to the feature list E _k , if the detection frame

If the association fails, create a new tracker;

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/ 或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flows of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

以上所述，仅是本发明的较佳实施例而已，并非是对本发明作其它形式的限制，任何熟悉本专业的技术人员可能利用上述揭示的技术内容加以变更或改型为等同变化的等效实施例。但是凡是未脱离本发明技术方案内容,依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与改型，仍属于本发明技术方案的保护范围。The above are only preferred embodiments of the present invention, and are not intended to limit the present invention in other forms. Any person skilled in the art may use the technical content disclosed above to make changes or modifications to equivalent changes. Example. However, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention without departing from the content of the technical solutions of the present invention still belong to the protection scope of the technical solutions of the present invention.

Claims

1. a multi-face tracking method based on deep apparent feature and self-adaptive aggregation network, is characterized in that: comprise the following steps:

Step S1: using the face recognition data set to train the adaptive aggregation network;

Step S2: According to the initial input video frame, use the convolutional neural network to obtain the position of the face, initialize the face target to be tracked, extract the face features and save;

Step S3: use Kalman filter to predict the position of each face target in the next frame, and locate the position of the face again in the next frame, and extract features for the detected face;

Step S4: Use the adaptive aggregation network trained in step S1 to aggregate the face feature set in the tracking track of each tracked face target, and dynamically generate a face deep apparent feature fused with multi-frame information. The predicted position and the fused features are calculated and matched with the face position and its features obtained through detection in the current frame, and the tracking status is updated;

Step S1 specifically includes the following steps:

Step S11: collect public face recognition data sets, and obtain pictures and names of relevant persons;

Step S12: Integrate pictures of common characters in multiple datasets by using a fusion strategy, use the pre-trained MTCNN model for face detection and face key point location, and apply similarity transformation for face alignment, and at the same time all the training set The average value of each channel on the training set is subtracted from the images, data preprocessing is completed, and the adaptive aggregation network is trained;

Step S2 specifically includes the following steps:

Step S21: let i represent the number of the ith frame of the input video, initially i=1, use the pre-trained MTCNN model to simultaneously detect the position D ⁱ of all faces and the position C ⁱ of the corresponding facial key points, wherein

j is the number of the j-th detected face, J ⁱ is the number of detected faces in the i-th frame,

in

in

Step S22: For the position of each face

and its facial keypoint coordinates

It is assigned a unique identity ID _k , k=1,2,...,K ⁱ , where k represents the number of the k-th tracking target, K ⁱ represents the number of people tracking the target at the i-th frame, and initializes its The corresponding tracker T _k ={ID _k ,P _k ,L _k ,E _k ,A _k }, where ID _k represents the unique identification of the k-th tracking target, and P _k represents the face assigned to the k-th target Position coordinates, L _k represents the facial key point coordinates of the k-th target, E _k represents the face feature list of the k-th target, A _k represents the life cycle of the k-th target, initialization K ⁱ =J ⁱ ,

A _k = 1;

Step S23: For the position P _k of each face in T _k , crop the image to obtain the corresponding face image, use the corresponding face key point position L _k , apply the similarity transformation to align the face, and obtain the aligned face image. face image;

Step S24: input the aligned face image into the adaptive aggregation network, obtain the corresponding face depth apparent feature, and add it to the feature list E _k of T _k in the tracker;

Step S3 specifically includes the following steps:

Step S31: Represent the state of each tracked face target in the following form:

In the formula, m represents the state of the tracked face target, u and v represent the center coordinates of the tracked face area, s is the area of the face frame, r is the aspect ratio of the face frame,

respectively represent the speed of (u, v, s, r) in the image coordinate space;

Step S32: Convert the face position P _k =(x, y, w, h) in each tracker T _k into

in the form of

Step S33: put

make predictions;

Step S34: in the i+1th frame, the MTCNN model is used to perform face detection and facial key point positioning again to obtain the position D ⁱ⁺¹ of the human face and the facial key point C ⁱ⁺¹ ;

Step S35: for each face position

based on its facial key points

Apply similarity transformation to complete face alignment, and input adaptive aggregation network to extract features to obtain feature set F ⁱ⁺¹ , where F ⁱ⁺¹ represents the feature set of all faces in the i+1th frame;

Step S4 specifically includes the following steps:

Step S41: For the tracker T _k of each face, input the set E _k of all the features in its historical motion trajectory into the adaptive aggregation network to obtain the aggregated feature f _k , where f _k represents the historical motion trajectory of the kth tracking target. An aggregated feature output after fusion of all feature vectors in ;

Step S42: Calculate the position state of the k-th target predicted by the Kalman filter in the i-th frame in the next frame

transform into

form;

Step S43: Combine

G=[g _jk ], j=1,2,...,J ⁱ⁺¹ ,k=1,2,...,K ⁱ ;

In the formula, J ⁱ⁺¹ is the number of faces detected in the i+1th frame, K ⁱ is the number of tracked targets in the i-th frame,

the degree of overlap between

is the jth face feature in the i+1th frame

Step S44: Using the correlation matrix G as the cost matrix, the Hungarian algorithm is used to calculate the matching result, and the face detection frame in the i+1th frame is

Associated with the kth tracking target;

Step S45: Correspond the subscript in the matching result to the item in the association matrix G, and filter all items g _jk less than T _similarity , and delete it from the matching result, where T _similarity is a set hyperparameter, indicating that the matching is successful The minimum similarity threshold of ;

Step S46: In the matching result, if the detection frame is

Face key point location

Life cycle A _k =A _k +1, and the corresponding face features

add to the feature list E _k , if the detection frame

If the association fails, create a new tracker;

Step S47: For each tracker T _k , if its life cycle A _k >T _age , delete the tracker, where T _age is a set hyperparameter, indicating the longest time a tracking target can survive.

2. a kind of multi-face tracking method based on deep apparent feature and self-adaptive aggregation network according to claim 1, is characterized in that: described self-adaptive aggregation network is connected in series by deep feature extraction module and self-adaptive feature aggregation module It accepts one or more face images of the same person as input, and outputs aggregated features. The deep feature extraction module uses 34 layers of ResNet as the backbone network, and the adaptive feature aggregation module contains a feature aggregation layer; Let B represent the number of input samples, {z _t } represent the output feature set of the deep feature extraction module, where t=1,2,...,B represents the input sample number, and the calculation method of the feature aggregation layer is:

a=∑ _t o _t z _t ;

In the formula, q represents the weight of each component of the feature vector z _t , which is a parameter that can be learned. By using the face recognition signal as the supervision signal, the back-propagation and gradient descent methods are used for learning, and v _t is the output of the sigmoid function, representing The score of each eigenvector z _t , ranging between 0 and 1, o _t is the output of L1 normalization, so that ∑ _t o _t =1, a is an eigenvector after B eigenvectors are aggregated.