CN112347967B

CN112347967B - A Pedestrian Detection Method Fused with Motion Information in Complex Scenes

Info

Publication number: CN112347967B
Application number: CN202011290529.3A
Authority: CN
Inventors: 侯舒娟; 韩羽菲; 李海; 张钦; 宋政育; 武毅
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2023-04-07
Anticipated expiration: 2040-11-18
Also published as: CN112347967A

Abstract

The invention discloses a pedestrian detection method under a complex scene fusing motion information, which consists of two branches of a moving target identification network and a pedestrian detection network, wherein a video is input into the network to respectively obtain a moving target frame and a pedestrian detection proposal candidate frame, the two candidate frames are fused, then are divided into two groups according to the area of the frames, and are sent into two sub-networks to be respectively classified and regressed, and finally output results are combined; compared with other algorithms, the method can achieve higher detection rate for videos with lower resolution and smaller pedestrian size in the picture; the pedestrian detection network provided by the invention utilizes the motion information to reduce missing detection, and has good detection effect on dynamic and static pedestrian targets.

Description

A Pedestrian Detection Method Fused with Motion Information in Complex Scenes

技术领域technical field

本发明属于目标检测技术领域，具体涉及一种复杂场景下融合运动信息的行人检测方法。The invention belongs to the technical field of target detection, and in particular relates to a pedestrian detection method for fusing motion information in complex scenes.

背景技术Background technique

行人是视频监控任务中的重要目标，行人检测也是计算机视觉研究的基础任务和关键技术之一，这种技术用于判断图像或者视频序列中是否存在行人并给予精确定位，在车辆辅助驾驶系统、视频监控、机器人开发等计算机领域被广泛应用。Pedestrians are an important target in video surveillance tasks. Pedestrian detection is also one of the basic tasks and key technologies of computer vision research. This technology is used to judge whether there are pedestrians in images or video sequences and give precise positioning. Computer fields such as video surveillance and robot development are widely used.

受存储资源、拍摄距离等众多因素的影响，一般实际中监控视频往往存在视频画质较差、行人目标在画面中所占比例较小、行人目标存在遮挡等问题，导致复杂场景下的行人检测技术在实际应用中仍然存在严重的漏检、误检等问题。随着机器学习技术和计算机视觉的发展，基于深度学习的目标检测技术被广泛的应用到行人检测任务中，取得了不凡的成绩，为复杂场景下的行人检测算法提供了思路。Affected by many factors such as storage resources and shooting distance, in general practice, surveillance videos often have problems such as poor video quality, small proportion of pedestrian objects in the screen, and occlusion of pedestrian objects, which lead to pedestrian detection in complex scenes. There are still serious problems such as missed detection and false detection in the actual application of the technology. With the development of machine learning technology and computer vision, object detection technology based on deep learning has been widely applied to pedestrian detection tasks, and has achieved extraordinary results, which provides ideas for pedestrian detection algorithms in complex scenes.

2012年Lijun Guo等人在《Pedestrian detection Method of IntegratedMotion Information and APPearance Features》一文中提出一种结合运动信息与表观特征的行人检测方法，用于复杂场景下的行人检测任务，将运动信息融入到基于图像序列的对象分割算法中，通过获取更准确的分割结果来提高对候选检测窗口的检测准确率，然而该算法的精度要远低于R-CNN系列的行人检测网络，且计算量较大，在复杂场景下效果不理想。In 2012, Lijun Guo et al. proposed a pedestrian detection method combining motion information and apparent features in the article "Pedestrian detection Method of Integrated Motion Information and APPearance Features", which is used for pedestrian detection tasks in complex scenes and integrates motion information into In the object segmentation algorithm based on image sequences, the detection accuracy of candidate detection windows is improved by obtaining more accurate segmentation results. However, the accuracy of this algorithm is much lower than that of the R-CNN series of pedestrian detection networks, and the amount of calculation is large. , the effect is not ideal in complex scenes.

2013年张芝英在《基于目标运动信息和HOG特征的行人检测的研究与实现》中设计了一种融合了目标运动信息的行人检测分类器，构成了一种HOG与SVM分类器进行行人检测的组合，然而该算法的检测模块精度不及R-CNN系列的行人检测网络，且运算速度较慢，用于运动信息提取的帧间差分法也还有一定的提升空间。In 2013, Zhang Zhiying designed a pedestrian detection classifier that combines target motion information in "Research and Implementation of Pedestrian Detection Based on Target Motion Information and HOG Features", forming a combination of HOG and SVM classifiers for pedestrian detection However, the accuracy of the detection module of this algorithm is not as good as the pedestrian detection network of the R-CNN series, and the operation speed is relatively slow. There is still room for improvement in the inter-frame difference method for motion information extraction.

2016年Jianan Li等人在《Scale-aware Fast R-CNN for PedestrianDetection》一文中针对监控视频中的小尺寸行人目标提出了一种将大尺寸子网络和小尺寸子网络集成到同一框架中的网络，但是该网络是一种基于图片的行人检测网络，而且采用的Fast R-CNN网络精度较低，针对低分辨率的图片无法达到比较好的检测效果。In 2016, in the article "Scale-aware Fast R-CNN for Pedestrian Detection", Jianan Li et al. proposed a network that integrates large-scale subnetworks and small-scale subnetworks into the same framework for small-scale pedestrian targets in surveillance videos. , but the network is a picture-based pedestrian detection network, and the accuracy of the Fast R-CNN network used is low, and it cannot achieve better detection results for low-resolution pictures.

2016年Liliang Zhang等人在《Is Faster R-CNN Doing Well for PedesrtianDetection》一文中公开了一种让通用目标检测Faster R-CNN网络更适用于行人检测任务的改进网络，在分类模块对小尺寸行人目标有着更高的区分性，但是该网络仍然是一种基于图片的检测网络，在图片画质较低的情况下效果并不理想。In 2016, Liliang Zhang et al. published an improved network that makes the general target detection Faster R-CNN network more suitable for pedestrian detection tasks in the article "Is Faster R-CNN Doing Well for Pedestrian Detection". The target is more discriminative, but the network is still an image-based detection network, which is not ideal when the image quality is low.

2018年Aixin Guo在《Multi-scale Pedestrain Detection based on DeepConvolutional Feature Fusion》中提出了一种基于深度卷积特征融合的多尺度行人检测方案，针对中小尺度行人特征不足的情况，将底层特征和高级语义特征相结合，并引入焦点损失函数来进行难样本挖掘来提高算法精度，该方案虽然在公开数据集中检测率得到了一定的提升，但是只适用于图片画质稍好的情况。In 2018, Aixin Guo proposed a multi-scale pedestrian detection scheme based on deep convolutional feature fusion in "Multi-scale Pedestrain Detection based on Deep Convolutional Feature Fusion". For the lack of medium and small-scale pedestrian features, the underlying features and advanced semantics Features are combined, and the focus loss function is introduced to mine difficult samples to improve the accuracy of the algorithm. Although the detection rate of this scheme has been improved in the public data set, it is only applicable to the situation where the image quality is slightly better.

2019年夏金铭等人在《一种基于Faster R-CNN的行人检测算法》中引入了一种难样本挖掘策略，将复杂环境下的样本挑出并对权重进行调整，使训练更有侧重点，以提升模型的泛化性能，同样这也是一种基于图片的行人检测方案，该方案的查全率略有提升，但是仍然只适用于清晰图片的多尺度行人检测任务。In 2019, Xia Jinming and others introduced a difficult sample mining strategy in "A Pedestrian Detection Algorithm Based on Faster R-CNN", which picks out samples in complex environments and adjusts the weights to make training more focused , to improve the generalization performance of the model, and this is also a picture-based pedestrian detection scheme. The recall rate of this scheme is slightly improved, but it is still only suitable for multi-scale pedestrian detection tasks with clear pictures.

2019年李俊毅等人在《基于YOLO和GMM的视频行人检测方法》一文中提出了在行人检测任务中融合运动信息的方案，用于复杂光照条件下的行人检测任务，但是该方案采用的YOLO算法对小尺寸目标不敏感，而且采用了调低行人检测阈值、利用运动信息去除虚警的方式，导致该方法只适用于所有行人目标都是动态的情况，在有大量静态行人目标的场景下性能较差。In 2019, in the article "Video Pedestrian Detection Method Based on YOLO and GMM", Li Junyi and others proposed a scheme to fuse motion information in pedestrian detection tasks for pedestrian detection tasks under complex lighting conditions, but the YOLO algorithm used in this scheme It is not sensitive to small-sized targets, and adopts the method of lowering the pedestrian detection threshold and using motion information to remove false alarms, so this method is only applicable to the situation where all pedestrian targets are dynamic, and the performance in scenes with a large number of static pedestrian targets poor.

2019年王磊在《面向铁路行车安全的高精度行人检测算法研究与系统设计》中公开了一种利用运动信息辅助边框筛选的检测框架，采用结构相似度来近似估计行人在视频帧间的运动信息，并将运动信息与网络检测结果的置信度得分结合对边框进行二次评估，这种方法的前提是考虑到大部分行人在相邻帧间存在运动位移，所以针对静止行人的情形存在弊端。In 2019, Wang Lei disclosed a detection framework that uses motion information to assist border screening in "Research on High-precision Pedestrian Detection Algorithm and System Design for Railway Traffic Safety", using structural similarity to approximate the motion information of pedestrians between video frames , and combine the motion information with the confidence score of the network detection results to re-evaluate the frame. The premise of this method is that most pedestrians have motion displacement between adjacent frames, so there are disadvantages for the situation of stationary pedestrians.

当前的复杂场景下的行人检测算法大多是针对复杂场景中的某一个问题进行改进，例如低分辨率、小尺寸目标等，在实际场景中，上述问题往往同时存在，针对某一问题的检测网络在实际应用中无法达到很好的效果。Most of the current pedestrian detection algorithms in complex scenes are improved for a certain problem in complex scenes, such as low resolution, small-sized objects, etc. In actual scenes, the above problems often exist at the same time, and the detection network for a certain problem It cannot achieve good results in practical applications.

现有的基于神经网络的复杂场景下的行人检测方法存在以下缺点：(1)一般方法是基于图片进行行人检测，需要测试图片分辨率较高，行人目标在画面中所占比例较大；(2)部分方法设计的行人检测网络虽然针对小尺寸目标进行改进，但是仍采用图片检测，且使用的原始目标检测网络精度较低，在低分辨率场景下检测效果较差；(3)2019年李俊毅等人的方法虽然是基于视频设计的，在进行行人检测时融合了运动信息，但是采用调低行人检测阈值、利用运动信息去除虚警的方式，导致该方法只适用于所有行人目标都是动态的情况，在有大量静态行人目标的场景下性能较差；(4)上述方法，都是针对复杂场景中行人目标尺寸较小或者视频场景分辨率较低之中的单一问题进行网络设计，而实际所获得的监控视频受各种因素限制通常存在上述所有问题。因此，目前尚没有性能更好的融合运动信息的复杂场景下的行人检测方案。The existing pedestrian detection methods based on neural networks in complex scenes have the following disadvantages: (1) the general method is to detect pedestrians based on pictures, which requires a higher resolution of the test pictures, and the proportion of pedestrian targets in the picture is relatively large; 2) Although the pedestrian detection network designed by some methods is improved for small-sized objects, it still uses image detection, and the accuracy of the original object detection network used is low, and the detection effect is poor in low-resolution scenes; (3) 2019 Although the method of Li Junyi et al. is based on video design and incorporates motion information during pedestrian detection, the method of lowering the pedestrian detection threshold and using motion information to remove false alarms makes this method only applicable to all pedestrian targets. In dynamic situations, the performance is poor in scenes with a large number of static pedestrian targets; (4) The above methods are all aimed at the single problem of small pedestrian target size in complex scenes or low resolution of video scenes. However, the surveillance video actually obtained is limited by various factors and usually has all the above-mentioned problems. Therefore, there is currently no pedestrian detection scheme with better performance in complex scenes that fuses motion information.

发明内容Contents of the invention

有鉴于此，本发明的目的是提供一种复杂场景下融合运动信息的行人检测方法，可以解决复杂场景下低分辨率视频的行人检测问题，且对小目标敏感。In view of this, the purpose of the present invention is to provide a pedestrian detection method that integrates motion information in complex scenes, which can solve the problem of pedestrian detection in low-resolution videos in complex scenes, and is sensitive to small targets.

一种复杂场景下融合运动信息的行人检测方法，包括如下步骤：A pedestrian detection method for fusing motion information in complex scenes, comprising the following steps:

步骤1、获取原始视频，处理得到图片序列；Step 1, obtain the original video, and process to obtain the image sequence;

步骤2、将所述图片序列通过RPN网络，获得目标检测提议候选框；Step 2, passing the picture sequence through the RPN network to obtain a target detection proposal candidate frame;

步骤3、将所述原始视频通过运动目标识别算法，获得运动目标框；Step 3, passing the original video through a moving object recognition algorithm to obtain a moving object frame;

步骤4、将步骤3获得的运动目标框与步骤2中的目标检测提议候选框进行融合，得到全部提议候选框；Step 4. Fusion the moving target frame obtained in step 3 with the target detection proposal candidate frame in step 2 to obtain all proposed candidate frames;

步骤5、将步骤4获得的全部提议候选框按照尺寸大小划分成两个群组，分别输入到两个神经网络分别进行分类和回归处理；Step 5. Divide all proposed candidate frames obtained in step 4 into two groups according to size, and input them to two neural networks for classification and regression respectively;

步骤6、将步骤5中两个神经网络输出的行人检测结果一并输出，得到带有行人目标框的视频。Step 6. Output the pedestrian detection results output by the two neural networks in step 5 together to obtain a video with a pedestrian target frame.

较佳的，所述步骤1中，对图片序列中图片先进行缩放后输入卷积网络获取每张图片的特征图，然后再获得目标检测提议候选框，具体为：Preferably, in the step 1, the pictures in the picture sequence are first scaled and then input into the convolutional network to obtain the feature map of each picture, and then the target detection proposal candidate frame is obtained, specifically:

a.利用九种不同尺寸的anchor boxes对图片中每一个像素点进行分类，判断是物体还是背景；a. Use anchor boxes of nine different sizes to classify each pixel in the picture to determine whether it is an object or a background;

b.对anchors进行回归，得到分类的精准参数；b. Regression on the anchors to obtain the precise parameters of the classification;

c.按照anchor的softmax分数进行排序，找到分类最优的2000个；c. Sort according to the softmax score of the anchor, and find the best 2000 classifications;

d.将anchor映射回原图；d. Map the anchor back to the original image;

e.利用NMS算法，对anchor进行排序，输出前256个提议候选框。e. Use the NMS algorithm to sort the anchors and output the first 256 proposed candidate boxes.

较佳的，所述步骤2和3同步执行。Preferably, the steps 2 and 3 are executed synchronously.

较佳的，所述步骤3中，利用GMM算法对原始视频序列进行运动目标识别，获得运动目标框。Preferably, in the step 3, the GMM algorithm is used to identify the moving object on the original video sequence to obtain the moving object frame.

较佳的，所述步骤4中，利用非极大抑制值算法进行所述融合。Preferably, in the step 4, the non-maximum suppression value algorithm is used to perform the fusion.

较佳的，所述步骤5中，将步骤4中获得的全部提议候选框按照面积大小进行排序，将面积排在前50％的图片划为第一群组，将面积排在后50％的划为第二群组。Preferably, in step 5, sort all the proposed candidate frames obtained in step 4 according to the size of the area, divide the pictures with the top 50% of the area into the first group, and rank the pictures with the bottom 50% of the area into the second group.

较佳的，所述步骤5中，在每个神经网络中，先对提议候选框部分的特征图采用Faster-RCNN网络的ROI池化层进行归一化处理；将该部分特征图输入Faster-RCNN网络的全连接层与softmax层计算每个提议目标属于什么类别，利用只包含人物标签的数据集进行预训练；同时再次利用目标框回归获得每个提议目标的位置偏移量。Preferably, in the step 5, in each neural network, first adopt the ROI pooling layer of the Faster-RCNN network to normalize the feature map of the proposed candidate frame part; input this part of the feature map into Faster-RCNN The fully connected layer and the softmax layer of the RCNN network calculate what category each proposed target belongs to, and use the data set that only contains person labels for pre-training; at the same time, the target box regression is used again to obtain the position offset of each proposed target.

较佳的，所述步骤5中，所述两个神经网络的结构相同。Preferably, in the step 5, the structures of the two neural networks are the same.

本发明具有如下有益效果：The present invention has following beneficial effect:

本发明的目的在于克服现有复杂场景下行人检测方法的不足，提出一种融合运动信息的复杂场景下的行人检测方法。该方法由运动目标识别网络和行人检测网络两个支路构成，将视频输入网络，分别获得运动目标框和行人检测提议候选框，将两种候选框根据置信度通过非极大抑制值算法进行融合，然后根据框的面积大小进行排序，尺寸较大的框输入大尺寸子网络，尺寸较小的框输入小尺寸子网络，分别进行分类和回归，最终合并输出结果。该方法提供了一个适用于低分辨率视频且对小目标敏感的行人检测框架，解决了复杂场景下的行人检测问题；本发明针对分辨率较低、行人在画面中尺寸较小的视频，相较于其他算法可以达到更高的检测率；本发明的行人检测网络是利用运动信息减少漏检，对动态与静态行人目标的检测效果都很好；现有的一般方法是降低行人检测算法阈值，用运动信息去除虚警，对静态行人目标的检测效果较差；本发明的行人检测网络相较于现有的一般方法是在精度更高、对小目标更加敏感的通用目标检测架构中进行改进，而且根据行人目标尺寸区分子网络，在行人目标与监控设备距离较远的情况下可以达到更高的识别率。The purpose of the present invention is to overcome the shortcomings of existing pedestrian detection methods in complex scenes, and propose a pedestrian detection method in complex scenes that integrates motion information. The method consists of two branches, the moving object recognition network and the pedestrian detection network. The video is input into the network to obtain the moving object frame and the pedestrian detection proposal candidate frame respectively. Fusion, and then sorted according to the size of the box, the larger box is input into the large-size sub-network, and the smaller-sized box is input into the small-size sub-network, classification and regression are performed respectively, and finally the output results are combined. The method provides a pedestrian detection framework suitable for low-resolution video and sensitive to small targets, which solves the problem of pedestrian detection in complex scenes; the present invention is aimed at videos with lower resolution and smaller pedestrians in the picture. Compared with other algorithms, it can achieve a higher detection rate; the pedestrian detection network of the present invention uses motion information to reduce missed detection, and has a good detection effect on dynamic and static pedestrian targets; the existing general method is to reduce the threshold of pedestrian detection algorithms , using motion information to remove false alarms, the detection effect on static pedestrian targets is poor; compared with the existing general methods, the pedestrian detection network of the present invention is implemented in a general target detection framework with higher precision and more sensitivity to small targets It is improved, and the sub-network is distinguished according to the size of the pedestrian target, and a higher recognition rate can be achieved when the distance between the pedestrian target and the monitoring device is far away.

附图说明Description of drawings

图1为本发明的方法流程图。Fig. 1 is a flow chart of the method of the present invention.

图2为本发明方法的召回率和准确率示意。Figure 2 is a schematic representation of the recall rate and precision rate of the method of the present invention.

具体实施方式Detailed ways

下面结合附图并举实施例，对本发明进行详细描述。The present invention will be described in detail below with reference to the accompanying drawings and examples.

本发明提出一种基于Faster-RCNN网络并融合运动信息的行人检测方法，适用于动态和静态行人目标的检测，该方法利用背景减除法获取目标运动信息，利用RPN网络获取目标提议候选框，将目标运动信息和RPN网络输出结果进行融合，根据候选框大小输入不同分类网络，最终获得行人目标位置，适用于复杂场景(视频分辨率低、行人像素高度小等)下的行人检测任务。The present invention proposes a pedestrian detection method based on the Faster-RCNN network and fusion of motion information, which is suitable for the detection of dynamic and static pedestrian targets. The target motion information is fused with the output of the RPN network, and input into different classification networks according to the size of the candidate frame, and finally the pedestrian target position is obtained, which is suitable for pedestrian detection tasks in complex scenes (low video resolution, small pedestrian pixel height, etc.).

如图1所示，本发明的一种复杂场景下融合运动信息的行人检测方法，包括如下步骤：As shown in Figure 1, the pedestrian detection method of fusion motion information under a kind of complex scene of the present invention, comprises the following steps:

步骤1、获取原始视频。Step 1. Get the original video.

原始视频指监控摄像头在自然场景下拍摄的包含若干行人目标的视频，受存储资源、拍摄距离等因素的限制，视频的分辨率较低，行人目标的像素高度较小、行人目标可能被遮挡。后续将视频分别输入两条并行的支路，进行行人目标框的获取。The original video refers to the video taken by the surveillance camera in a natural scene and contains several pedestrian targets. Due to the limitation of storage resources, shooting distance and other factors, the resolution of the video is low, the pixel height of the pedestrian target is small, and the pedestrian target may be blocked. Subsequently, the video is input into two parallel branches to obtain the pedestrian target frame.

步骤2、通过RPN网络，获得目标检测提议候选框。Step 2. Obtain the target detection proposal candidate frame through the RPN network.

将原始视频输入第一条支路，处理为图片序列，通过M×N缩放后输入卷积网络获取每张图片的特征图，将这些特征图输入RPN网络，获取检测目标提议候选框，得到候选框在特征图中的坐标(x*，y*，w*，h*)，其中(x*，y*)为候选框左上角坐标，w*、h*分别为候选框的宽度和高度。Input the original video into the first branch, process it into a picture sequence, and input the feature map of each picture into the convolutional network after M×N scaling, and input these feature maps into the RPN network to obtain the detection target proposal candidate frame and obtain the candidate The coordinates (x*, y*, w*, h*) of the frame in the feature map, where (x*, y*) is the coordinate of the upper left corner of the candidate frame, and w*, h* are the width and height of the candidate frame, respectively.

步骤3、通过运动目标识别模块，获得运动目标框。Step 3. Obtain the moving target frame through the moving target recognition module.

将原始视频输入第二条支路，利用GMM运动检测算法，获取运动目标在当前帧的坐标(x，y，w，h)，其中(x，y)为运动目标框左上角坐标，w、h分别为运动目标框的宽度和高度。将该坐标进行M×N缩放后，投影到特征层，获取变换后的坐标(x*，y*，w*，h*)。Input the original video into the second branch, and use the GMM motion detection algorithm to obtain the coordinates (x, y, w, h) of the moving target in the current frame, where (x, y) are the coordinates of the upper left corner of the moving target frame, w, h are the width and height of the motion target box, respectively. After the coordinates are M×N scaled, they are projected to the feature layer to obtain the transformed coordinates (x*, y*, w*, h*).

步骤4、将步骤(3)获得的目标框与(2)中的候选框利用非极大抑制值算法进行融合，得到全部提议候选框。Step 4. The target frame obtained in step (3) is fused with the candidate frame in (2) using the non-maximum suppression value algorithm to obtain all proposed candidate frames.

步骤5、将提议候选框按照尺寸划分，输入不同神经网络进行分类和回归。Step 5. Divide the proposed candidate boxes according to size, and input them into different neural networks for classification and regression.

将获得的全部提议候选框按照面积大小进行排序，将面积大的一部分输入大尺寸子网络，面积小的一部分输入小尺寸子网络。两个子网络的结构相同，只是处理的候选框尺寸不同，候选框与特征图在子网络中先经过ROI层进行池化，然后输入后续网络进行分类选出行人目标，并进行回归修正候选框位置，获得最终目标框。All the proposed candidate frames obtained are sorted according to the size of the area, and a part with a large area is input into the large-size sub-network, and a part with a small area is input into the small-size sub-network. The structure of the two sub-networks is the same, but the size of the candidate frame is different. The candidate frame and the feature map are first pooled in the ROI layer in the sub-network, and then input into the subsequent network to classify and select the pedestrian target, and perform regression to correct the position of the candidate frame. , to get the final target box.

步骤6、将两个子网络的行人检测结果一同输出，得到带有行人目标框的视频。Step 6. Output the pedestrian detection results of the two sub-networks together to obtain a video with a pedestrian target frame.

实施例：Example:

(1)获取原始视频。(1) Get the original video.

原始视频指监控摄像头在自然场景下拍摄的包含若干行人目标的视频，受存储资源、拍摄距离等因素的限制，视频的分辨率较低，行人目标的像素高度较小。考虑符合上述要求的数据集，选择香港中文大学的CUHK Square数据集。The original video refers to the video taken by the surveillance camera in a natural scene and contains several pedestrian targets. Due to the limitation of storage resources, shooting distance and other factors, the resolution of the video is low, and the pixel height of the pedestrian target is small. Considering a dataset that meets the above requirements, choose the CUHK Square dataset from the Chinese University of Hong Kong.

(2)通过RPN网络，获得目标检测提议候选框。(2) Through the RPN network, the target detection proposal candidate box is obtained.

该阶段包含以下步骤：This phase includes the following steps:

1)将视频处理为图片序列；1) Process the video into a picture sequence;

2)将图片序列进行缩放后依次输入卷积网络获取特征图；2) Scale the image sequence and then input it into the convolutional network to obtain the feature map;

3)特征图输入RPN网络；3) The feature map is input into the RPN network;

b.对anchors进行回归，得到精准参数；b. Regression on anchors to obtain precise parameters;

c.按照anchor的softmax分数进行排序，找到最优的2000个；c. Sort according to the softmax score of the anchor, and find the best 2000;

d.将anchor映射回原图；d. Map the anchor back to the original image;

e.利用NMS算法，对anchor进行排序，输出前256个。e. Use the NMS algorithm to sort the anchors and output the first 256.

据此，我们就得到了第一条支路的目标提议候选框。Accordingly, we get the target proposal candidate box of the first branch.

(3)通过运动目标识别模块，获得运动目标框。(3) Obtain the moving target frame through the moving target recognition module.

该阶段包含以下步骤：This phase includes the following steps:

1)利用GMM算法对视频进行运动目标识别；1) Use the GMM algorithm to identify moving objects in the video;

a.初始化背景模型、均值、标准差和差分阈值；a. Initialize the background model, mean, standard deviation and difference threshold;

b.设置阈值参数，根据均值是否在阈值范围内，判断当前像素属于前景还是背景；b. Set the threshold parameter, and judge whether the current pixel belongs to the foreground or the background according to whether the mean value is within the threshold range;

c.更新参数，对背景进行学习更新；c. Update parameters to learn and update the background;

d.重复步骤b.c.，直至算法停止，得到每帧中运动目标的位置信息。d. Repeat steps b.c. until the algorithm stops, and obtain the position information of the moving object in each frame.

2)将运动目标坐标进行缩放并投影到特征层；2) Scale and project the coordinates of the moving target to the feature layer;

(4)将步骤(3)获得的目标框与(2)中的候选框利用非极大抑制值算法进行融合，得到全部提议候选框。(4) The target frame obtained in step (3) is fused with the candidate frame in (2) using the non-maximum suppression value algorithm to obtain all proposed candidate frames.

(5)将提议候选框按照尺寸划分，输入不同神经网络进行分类和回归。(5) Divide the proposed candidate frame according to size, and input it into different neural networks for classification and regression.

将步骤(4)中获得的提议候选框按照面积大小进行排序，将面积大的前50％输入大尺寸子网络，面积小的后50％输入小尺寸子网络。每个子网络的输入为原始的特征图和上一步输出的尺寸不同的提议候选框，送入ROI池化层进行尺寸归一化，计算出提议候选框特征图。将该部分特征图输入全连接层与softmax层计算每个提议目标属于什么类别，子网络利用只包含人物标签的数据集进行预训练，因此只输出person类及其概率向量；同时再次利用目标框回归获得每个提议目标的位置偏移量，用于回归更加精确的行人检测框。Sort the proposed candidate boxes obtained in step (4) according to the size of the area, input the top 50% of the larger area into the large-size sub-network, and input the last 50% of the smaller-area sub-network into the small-size sub-network. The input of each sub-network is the original feature map and the proposed candidate boxes of different sizes output in the previous step, which are sent to the ROI pooling layer for size normalization, and the feature map of the proposed candidate box is calculated. Input this part of the feature map into the fully connected layer and the softmax layer to calculate what category each proposed target belongs to. The sub-network uses the data set that only contains the person label for pre-training, so only the person class and its probability vector are output; at the same time, the target box is used again Regression obtains the position offset of each proposed target, which is used to regress a more accurate pedestrian detection frame.

本发明还对该方法的效果进行了仿真验证。The present invention also carries out simulation verification on the effect of the method.

仿真在ubuntu18.04、CUDA10.0、cuDNN7.6、OpenCV3.4、PyTorch环境下进行，使用基于PyTorch框架的Faster R-CNN模型进行改进，基于Python实现。The simulation is carried out in the environment of ubuntu18.04, CUDA10.0, cuDNN7.6, OpenCV3.4, and PyTorch. The Faster R-CNN model based on the PyTorch framework is used for improvement and implemented based on Python.

使用Faster R-CNN作者给出的基于COCO训练的80类权重参数对多组视频进行测试，得到AP(Average Precision)为72.61％，而本发明优化后的Faster R-CNN网络的AP达到了77.63％，平均准确率提高了5.02％。与目前的先进算法，即2019年李俊毅等提出的基于YOLO和GMM的视频行人检测方法相比较，该算法得到的AP为74.82％，本算法的精度提升了2.81％。Using the 80 types of weight parameters based on COCO training given by the author of Faster R-CNN to test multiple groups of videos, the AP (Average Precision) obtained is 72.61%, while the AP of the optimized Faster R-CNN network of the present invention reaches 77.63 %, the average accuracy is increased by 5.02%. Compared with the current advanced algorithm, that is, the video pedestrian detection method based on YOLO and GMM proposed by Li Junyi in 2019, the AP obtained by this algorithm is 74.82%, and the accuracy of this algorithm is improved by 2.81%.

以两个指标来评价实验结果，分别为Recall(召回率)、Precision(准确率)。如图2所示，Recall为成功检测出的行人个数P_d,t和总的行人个数P_d,t+P_n,t的比。Precision为成功检测出的行人个数P_d,t和所有检测个数P_d,t+P_d,f的比。Two indicators are used to evaluate the experimental results, namely Recall (recall rate) and Precision (accuracy rate). As shown in Figure 2, Recall is the ratio of the number of successfully detected pedestrians P _d,t to the total number of pedestrians P _d,t +P _n,t . Precision is the ratio of the number of successfully detected pedestrians P _d,t to all detected numbers P _d,t +P _d,f .

经多组复杂场景视频测试后，目前先进算法的召回率和准确率分别为0.77和0.51，本文算法的召回率和准确率分别为0.91和0.85。After multiple sets of complex scene video tests, the recall rate and precision rate of the current advanced algorithm are 0.77 and 0.51, respectively, and the recall rate and precision rate of the algorithm in this paper are 0.91 and 0.85, respectively.

综上所述，以上仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。To sum up, the above are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A pedestrian detection method fusing motion information in a complex scene is characterized by comprising the following steps:

step 1, acquiring an original video, and processing to obtain a picture sequence;

step 2, the picture sequence is passed through an RPN network to obtain a target detection proposal candidate frame;

step 3, obtaining a moving target frame by the original video through a moving target recognition algorithm;

step 4, fusing the moving target frame obtained in the step 3 with the target detection proposal candidate frame in the step 2 to obtain all proposal candidate frames; the fusion is carried out by using a non-maximum suppression value algorithm;

step 5, dividing all the proposed candidate frames obtained in the step 4 into two groups according to the size, and respectively inputting the groups into two neural networks for classification and regression processing;

and 6, outputting the pedestrian detection results output by the two neural networks in the step 5 together to obtain a video with a pedestrian target frame.

2. The pedestrian detection method fusing motion information under complex scenes as claimed in claim 1, wherein in said step 1, the pictures in the picture sequence are scaled first and then input into a convolutional network to obtain a feature map of each picture, and then a target detection proposal candidate frame is obtained, specifically:

a. classifying each pixel point in the picture by utilizing nine anchor boxes with different sizes, and judging whether the pixel point is an object or a background;

b. regressing the anchors to obtain classified accurate parameters;

c. sorting according to the softmax score of the anchor, and finding 2000 optimal classifications;

d. mapping the anchor back to the original image;

e. the anchors are ranked using the NMS algorithm and the top 256 proposed candidate boxes are output.

3. The pedestrian detection method with fusion of motion information under complex scene as claimed in claim 1, wherein said steps 2 and 3 are executed synchronously.

4. The pedestrian detection method fusing motion information under complex scenes as claimed in claim 1, wherein in said step 3, a moving object frame is obtained by performing moving object recognition on an original video sequence using a GMM algorithm.

5. The method for detecting pedestrians fusing motion information under complex scenes as claimed in claim 1, wherein in step 5, all the proposed candidate boxes obtained in step 4 are sorted according to area size, the pictures with the area in the first 50% are divided into a first group, and the pictures with the area in the second 50% are divided into a second group.

6. The method according to claim 5, wherein in step 5, in each neural network, the characteristic map of the proposed candidate frame portion is normalized by using the ROI pooling layer of the Faster-RCNN network; inputting the partial feature map into a full connection layer and a softmax layer of a Faster-RCNN network to calculate the category of each proposed target, and pre-training by using a data set only containing character tags; while again using target box regression to obtain the position offset for each proposed target.

7. The pedestrian detection method based on fusion of motion information in complex scene as claimed in claim 5, wherein in the step 5, the two neural networks have the same structure.