CN112257527B

CN112257527B - Mobile phone detection method based on multi-target fusion and space-time video sequence

Info

Publication number: CN112257527B
Application number: CN202011079614.5A
Authority: CN
Inventors: 龚勋; 王琛中; 王立
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2022-09-02
Anticipated expiration: 2040-10-10
Also published as: CN112257527A

Abstract

The invention relates to a mobile phone detection method based on multiple target fusion and spatio-temporal video sequences, including training an improved yolo model to obtain a detection model, and inputting video image frames to run the detection model to obtain a first frame predicted value; Perform decoding, remove the frame whose score value is lower than the preset value and implement NMS with the Diou threshold, and suppress the mobile phone frame when only the mobile phone frame appears according to the decoding result of a certain frame of image; use the suppressed result as the target template, Input the video image frame as the candidate frame search area, input it into the fully connected twin network, and select the result with the highest score map similarity to frame the mobile phone in the video image frame; if the set number of frames has been tracked, repeat the above steps until the video image input ends. Based on the lightweight detection network in the One-stage detection algorithm, the invention refines the network structure, training and detection methods, and obtains higher detection accuracy without reducing the detection speed.

Description

Mobile phone detection method based on multi-object fusion and spatiotemporal video sequence

技术领域technical field

本发明涉及图像处理技术领域，尤其涉及一种多目标融合与时空视频序列的手机检测方法。The invention relates to the technical field of image processing, in particular to a mobile phone detection method of multi-target fusion and spatiotemporal video sequences.

背景技术Background technique

检测精度、速度一直是目标检测的核心问题，在进行目标检测过程中，为了获取更精确的检测效果，通常会选择能获得高精度的重量级检测算法，这也就使得系统在移动端设备的推理速度受到了极大的限制。Detection accuracy and speed have always been the core issues of target detection. In the process of target detection, in order to obtain a more accurate detection effect, a heavyweight detection algorithm that can obtain high precision is usually selected, which makes the system in the mobile terminal device. Inference speed is greatly limited.

申请号为202010048048.5的中国发明专利申请公布了一种识别手机防拍照的智能监测方法、设备及可读介质，其通过智能监测系统对海量的手机外观进行机器学习；在需要布置防拍的场所架设摄像探头，所述摄像探头与智能监测系统实时通讯；所述摄像头将拍摄影像实时传输至智能监测系统；通过智能监测系统识别是否存在手机；若存在手机，则所述智能监测系统根据所述拍摄影像，判断是否有利用手机进行拍照的行为；所述智能监测系统判断有利用手机进行拍照的行为，则实时输出告警信息，提醒工作人员进行及时的提醒。使用以Darknet53为Backbone的检测算法进行初步检测，再配合骨骼生成、动作识别等方法进行监测；另外也有一些方法是使用类似算法进行初步定位，再进行从整体到局部的搜索方法进行检测。但是诸如此类的方式都使得检测系统在移动端的推理速度基本达不到实时性。The Chinese invention patent application with the application number of 202010048048.5 discloses an intelligent monitoring method, device and readable medium for identifying mobile phone anti-photography, which performs machine learning on the appearance of a large number of mobile phones through an intelligent monitoring system; camera probe, the camera probe communicates with the intelligent monitoring system in real time; the camera transmits the captured image to the intelligent monitoring system in real time; the intelligent monitoring system identifies whether there is a mobile phone; if there is a mobile phone, the intelligent monitoring system according to the shooting image, to determine whether there is an act of taking pictures with a mobile phone; the intelligent monitoring system determines that there is an act of taking pictures with a mobile phone, and outputs alarm information in real time to remind the staff to give timely reminders. Use the detection algorithm with Darknet53 as the Backbone for preliminary detection, and then cooperate with bone generation, motion recognition and other methods for monitoring; in addition, there are some methods that use similar algorithms for preliminary positioning, and then perform overall to local search methods for detection. However, such methods make the inference speed of the detection system on the mobile terminal basically not real-time.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的缺点，提供了一种多种目标融合与时空视频序列的手机检测方法，解决了现在检测方法存在的不足。The purpose of the present invention is to overcome the shortcomings of the prior art, to provide a mobile phone detection method for fusion of multiple targets and spatiotemporal video sequences, and to solve the shortcomings of the existing detection methods.

本发明的目的通过以下技术方案来实现：基于多种目标融合与时空视频序列的手机检测方法，所述手机检测方法包括：The object of the present invention is achieved through the following technical solutions: a mobile phone detection method based on multiple target fusion and spatiotemporal video sequences, the mobile phone detection method includes:

对改进的yolo模型进行训练得到检测模型，并输入视频图像帧运行检测模型得到第一帧预测值；Train the improved yolo model to obtain the detection model, and input the video image frame to run the detection model to obtain the first frame prediction value;

对所述第一帧预测值进行解码，去掉score值低于预设值的框并以Diou阈值实现NMS(非极大抑制)，并根据某帧图像的解码结果在只出现手机框时，对手机框进行抑制；The predicted value of the first frame is decoded, the frame whose score value is lower than the preset value is removed, and NMS (non-maximum suppression) is realized with the Diou threshold, and when only the mobile phone frame appears according to the decoding result of a certain frame of image, The mobile phone frame is suppressed;

将抑制后的结果作为目标模板，输入视频图像帧作为候选框搜索区域，将二者同时输入到全连接孪生网络，并选择score map相似度最大的结果对视频图像帧中的手机进行画框标记；Take the suppressed result as the target template, input the video image frame as the candidate frame search area, input the two into the fully connected twin network at the same time, and select the result with the largest score map similarity to frame the mobile phone in the video image frame. ;

如果已跟踪设定帧数，则重复上述步骤直到视频图像输入结束。If the set number of frames has been tracked, repeat the above steps until the video image input ends.

进一步地，所述手机检测方法还包括如果没有设定帧数，则重复将抑制后的结果作为目标模板，输入视频图像帧作为候选框搜索区域，将二者同时输入到全连接孪生网络，并选择score map相似度最大的结果对视频图像帧中的手机进行画框标记的步骤。Further, the mobile phone detection method also includes if the number of frames is not set, then repeatedly using the suppressed result as the target template, inputting the video image frame as the candidate frame search area, inputting the two into the fully connected twin network at the same time, and The step of selecting the result with the highest score map similarity to frame the mobile phone in the video image frame.

进一步地，所述手机检测方法还包括在对执行改进的yolo模型进行训练得到检测模型，并输入视频图像帧运行检测模型得到第一帧预测值步骤之前需要执行获取训练集和测试集的步骤。Further, the mobile phone detection method further includes the steps of obtaining a training set and a test set before training the improved yolo model to obtain a detection model, and inputting video image frames to run the detection model to obtain the first frame predicted value.

进一步地，所述获取训练集和测试集的步骤包括：对录制视频进行分帧处理并对处理后的视频图片进行标注，隔帧提取部分图片构建数据集，将数据集按照一定比例划分为训练集和测试集。Further, the step of obtaining the training set and the test set includes: performing frame processing on the recorded video and labeling the processed video pictures, extracting part of the pictures every frame to construct a data set, and dividing the data set into training sets according to a certain proportion. set and test set.

进一步地，所述对所述第一帧预测值进行解码，去掉score值低于预设值的框并以Diou阈值实现NMS，并根据某帧图像的解码结果在只出现手机框时，对手机框进行抑制包括：Further, the first frame predicted value is decoded, the frame whose score value is lower than the preset value is removed, and the Diou threshold is used to realize NMS, and according to the decoding result of a certain frame of image, when only the frame of the mobile phone appears, the frame of the mobile phone is Boxes for suppression include:

根据解码公式bx＝sigmoid(t_x)+cx、by＝sigmoid(t_y)+cy、bw＝p_we^tw、bh＝p_he^th、conf＝sigmoid(raw_conf)和prob＝sigmoid(raw_prob)对所述第一帧预测值进行解码；According to the decoding formulas bx=sigmoid(t _x )+cx, by=sigmoid(t _y )+cy, bw= _p _w e ^tw , bh=ph e ^th , conf=sigmoid(raw_conf) and prob=sigmoid(raw_prob) decoding the predicted value of the first frame;

以0.4的score阈值去掉置信度或者类别概率不满足要求的框并以0.1的Diou阈值实现NMS；Remove the boxes whose confidence or class probability does not meet the requirements with a score threshold of 0.4 and implement NMS with a Diou threshold of 0.1;

针对某帧图像的解码结果，若出现手机框而未出现人体框或者手部框或者摄像头框的情况时，并剔相应图像中关于手机的预测框，从而对手机框进行抑制。For the decoding result of a certain frame of image, if there is a mobile phone frame but no human frame, hand frame or camera frame, the prediction frame about the mobile phone in the corresponding image is removed, so as to suppress the mobile phone frame.

进一步地，对yolo模型的改进包括以下内容：Further, improvements to the yolo model include the following:

对yolov3-tiny增加检测细小物体的s分支，以改善对摄像头等小物体的检测效果；Add the s branch for detecting small objects to yolov3-tiny to improve the detection effect of small objects such as cameras;

在上一步模型结构的基础上，增加SPP(Spatial Pyramid Pooling)、SAM(SpatialAttention Module)、CAM(Channel Attention Module)模块与残差连接，改善特征提取能力。On the basis of the model structure of the previous step, SPP (Spatial Pyramid Pooling), SAM (Spatial Attention Module), and CAM (Channel Attention Module) modules are added to connect with residuals to improve feature extraction capabilities.

本发明具有以下优点：基于多目标融合与时空视频序列的手机检测方法，基于One-stage检测算法中的轻量化检测网络，对网络结构和训练、检测方式进行精细化修改，在不降低检测速度的情况下，获取了较高的检测精度，同时利用跟踪算法对检测到的目标进行跟踪，解决了一些存在较大遮挡与角度倾斜的难样本检测，同时降低系统对资源的消耗，从而在整体上极大地提高了系统在移动端的推理速度。The invention has the following advantages: a mobile phone detection method based on multi-target fusion and spatiotemporal video sequences, and a lightweight detection network based on the One-stage detection algorithm, refined modifications are made to the network structure, training and detection methods, and the detection speed is not reduced. In the case of high detection accuracy, at the same time, the detected target is tracked by using the tracking algorithm, which solves some difficult sample detection with large occlusion and angle inclination, and reduces the consumption of resources by the system, so that the overall On the mobile terminal, the reasoning speed of the system is greatly improved.

附图说明Description of drawings

图1为本发明方法的流程示意图。FIG. 1 is a schematic flow chart of the method of the present invention.

具体实施方式Detailed ways

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。因此，以下对附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的保护范围，而是仅仅表示本申请的选定实施例。基于本申请的实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本申请保护的范围。下面结合附图对本发明做进一步的描述。In order to make the purposes, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of the present application, but not all of the embodiments. The components of the embodiments of the present application generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of the embodiments of the application provided in the accompanying drawings is not intended to limit the scope of the application as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present application. The present invention will be further described below with reference to the accompanying drawings.

如图1所示，本发明涉及一种基于多目标融合与时空视频序列的手机检测方法，其具体包括的内容如下：As shown in Figure 1, the present invention relates to a mobile phone detection method based on multi-target fusion and spatiotemporal video sequences, which specifically includes the following contents:

S1、对实际应用场景的摄像头录制视频进行分帧处理，隔帧随机提取部分图片构建数据集。使用LabelImg标注软件标注每张图像中的手机、人体、手部、摄像头，将数据集按一定比例分为训练集与测试集。S1. Perform frame-by-frame processing on the video recorded by the camera in the actual application scene, and randomly extract some pictures every frame to construct a data set. Use LabelImg labeling software to label the mobile phone, human body, hand, and camera in each image, and divide the data set into training set and test set according to a certain proportion.

S2、使用经改进的yolov3网络训练检测模型，网络训练输入为训练集图片与相应的标注标签，网络输出为预测的t_x,t_y,t_w,t_h偏移值、原始置信度及原始类别概率。S2. Use the improved yolov3 network to train the detection model. The network training input is the training set image and the corresponding label, and the network output is the predicted t _x , t _y , t _w , _th offset value, original confidence and original class probability.

进一步地，在训练过程中，针对置信度损失的focalloss，考虑到yolov3网络模型的正负样本失衡远低于ReTinanet，因而将α的值选取为0.4，置信度损失的计算公式如下：Further, in the training process, for the focalloss of the confidence loss, considering that the positive and negative sample imbalance of the yolov3 network model is much lower than that of ReTinanet, the value of α is selected as 0.4, and the calculation formula of the confidence loss is as follows:

L_focalloss＝-α_t(1-p_t)^μ*^γlog(p_t)L _focalloss = -α _t (1-p _t ) ^μ * ^γ log(p _t )

S3、运行检测模型得到第一帧的预测值。S3. Run the detection model to obtain the predicted value of the first frame.

S4、根据如下的解码公式对预测值进行解码，以0.40的score阈值去掉置信度或类别概率较低的框并以0.1的Diou阈值实现NMS；S4. Decode the predicted value according to the following decoding formula, remove frames with lower confidence or class probability with a score threshold of 0.40, and implement NMS with a Diou threshold of 0.1;

bx＝sigmoid(t_x)+cxbx=sigmoid(t _x )+cx

by＝sigmoid(t_y)+cyby=sigmoid(t _y )+cy

bw＝p_we^tw bw=p _w e ^tw

bh＝p_he^th bh= ^ph e _th

conf＝sigmoid(raw_conf)conf=sigmoid(raw_conf)

prob＝sigmoid(raw_prob)prob=sigmoid(raw_prob)

其中：bx、by、bh、bw分别表示预测框的中心横纵坐标与高宽，p_h和p_w分别表示先验框的高和宽。t_x和t_y表示的是物体中心距离网格左上角位置的预测偏移量，tw和th表示的是物体相对于先验框的预测偏移量，cx和cy则代表网格左上角的坐标，score＝conf(置信度)×prob(类别概率)。Among them: bx, by, bh, bw represent the center horizontal and vertical coordinates and height and width of the prediction frame, respectively, and p _h and p _w represent the height and width of the prior frame, respectively. t _x and _ty represent the predicted offset of the object center from the upper left corner of the grid, tw and th represent the predicted offset of the object relative to the prior frame, and cx and cy represent the upper left corner of the grid. Coordinate, score=conf (confidence)×prob (class probability).

S5、针对某帧图片的解码结果，若出现手机框而未出现人体框或手部框或摄像头框的情况，将对手机框进行抑制。S5. For the decoding result of a certain frame of picture, if the mobile phone frame appears but the human body frame, the hand frame or the camera frame does not appear, the mobile phone frame will be suppressed.

S6、将抑制后的结果作为目标模板，视频图像帧作为候选框搜索区域，两者同时送入全连接孪生网络，得到模板匹配得到相似性度量结果score map。S6. The suppressed result is used as the target template, and the video image frame is used as the candidate frame search area, and the two are simultaneously sent to the fully connected twin network to obtain template matching to obtain the similarity measurement result score map.

S7、选取相似度最大的结果对视频图像帧中的手机进行画框标记。S7. Select the result with the greatest similarity to frame the mobile phone in the video image frame.

S8、判断是否已跟踪设定帧数，如果没有，则重复步骤S6-S8，如果有，则执行步骤S9；S8, determine whether the set number of frames has been tracked, if not, repeat steps S6-S8, if so, execute step S9;

S9、重新步骤S3-S9，直到视频图像输入结束。S9, repeat steps S3-S9 until the video image input ends.

在多目标关联方面，本发明的贡献点如下：In terms of multi-target association, the contribution points of the present invention are as follows:

发现了基于GIoU(Generalized Intersection over Union)的位置损失(本发明使用的位置损失)会出现与基于差方的位置损失相反的不平衡情况，为此偏泛化得统计s、m、l分支的平均标签框大小和平均位置损失，并结合各分支框的数量比例，采用负指数函数(a·e^-b/x)为基本函数进行不平衡的拟合修正，解决了基于GIoU的大小框不平衡的位置损失问题。It is found that the position loss based on GIoU (Generalized Intersection over Union) (the position loss used in the present invention) will have an unbalanced situation opposite to the position loss based on the difference square. The average label box size and average position loss, combined with the ratio of the number of each branch box, use the negative exponential function (a·e ^-b/x ) as the basic function to perform unbalanced fitting correction, which solves the problem of GIoU-based size boxes. Balanced position loss problem.

遵循在数据量足够大时，各分支框的平均位置损失应几乎相等的前提假设，在训练过程中偏泛化得统计第一个warm-up epoch(预热期，即训练初始时，学习率较小的时期。)的s、m、l分支平均标签框大小和平均位置损失，并结合各分支框的数量比例，采用负指数函数(a·e^-b/x)为基本函数进行不平衡的拟合修正，调整后面迭代过程中的各分支位置损失权重，解决了基于GIoU的大小框不平衡的位置损失问题。Following the premise that when the amount of data is large enough, the average position loss of each branch box should be almost equal, and the first warm-up epoch (warm-up period, that is, the learning rate at the beginning of training) is counted during the training process. The average label box size and average position loss of the s, m, and l branches, combined with the number ratio of each branch box, use the negative exponential function (a·e ^-b/x ) as the basic function to unbalance The fitting correction of , adjusts the position loss weight of each branch in the subsequent iteration process, and solves the position loss problem of unbalanced size boxes based on GIoU.

发现了yolo中存在的重写标签问题，即给某物体分配的位移anchor框有几率被后面的物体覆盖，导致被覆盖的无法训练，其具体改进步骤如下：The problem of rewriting labels in yolo is found, that is, the displacement anchor box assigned to an object has a chance to be covered by the following object, which makes it impossible to train the covered ones. The specific improvement steps are as follows:

如果某anchor框已被某原物体赋予标签，则判定为原物体是否有唯一框；If an anchor frame has been labelled by an original object, it is determined whether the original object has a unique frame;

如果原物体有唯一框，则判断现物体是否能够赋值anchor，，如果存在，则现物体取消对某anchor框的赋值，否则现物体往下寻找下一个iou值最高的anchor框进行赋值；If the original object has a unique frame, it is judged whether the existing object can be assigned an anchor. If it exists, the existing object cancels the assignment to an anchor box, otherwise the existing object goes down to find the next anchor box with the highest iou value for assignment;

如果原物体没有唯一框，则判断是否有现物体最高iou的anchor和原物体非最高的iou的anchor覆盖原赋值；如果有，则判断是否有现物体非最高iou的anchor和原物体最高iou的anchor；如果有，则判断现物体是否能够存在赋值anchor，如果存在，则现物体取消对某anchor框的赋值，否则覆盖原赋值；如果有现物体非最高iou的anchor和原物体非最高iou的anchor，则iou低者被覆盖。If the original object does not have a unique frame, it is judged whether there is an anchor with the highest iou of the existing object and an anchor with a non-highest iou of the original object covering the original assignment; anchor; if there is, judge whether the existing object can have an assigned anchor, if so, the existing object cancels the assignment to a certain anchor box, otherwise the original assignment is overwritten; if there is an anchor of the existing object that is not the highest iou and the original object is not the highest iou anchor, the lower iou is covered.

考虑了手机与其他辅助检测目标需要存在一个主辅区分，对手机类的所有损失，乘上了一个优先系数，本发明此系数采用1.10。Considering that the mobile phone and other auxiliary detection targets need to have a primary and secondary distinction, all the losses of the mobile phone category are multiplied by a priority coefficient, and this coefficient is 1.10 in the present invention.

限制了ATSS(Adaptive Training Sample Selection)得出的阈值，在其阈值小于一定值的时候，认为得出的阈值对应的训练样本质量偏低，因此将摒弃阈值的选取方式，只选择待选择训练样本中IoU最高的一个。本发明中该一定值取0.10。The threshold value obtained by ATSS (Adaptive Training Sample Selection) is limited. When the threshold value is less than a certain value, it is considered that the quality of the training sample corresponding to the obtained threshold value is low. Therefore, the threshold selection method will be abandoned, and only the training sample to be selected will be selected. The one with the highest IoU. In the present invention, the certain value is taken as 0.10.

对多目标物体进行了基本不消耗计算资源的关联，减少了认知型检测方式的计算资源消耗。The multi-target objects are associated with basically no computing resources, and the computing resource consumption of the cognitive detection method is reduced.

在时空信息融合方面，本发明利用了时域和空域两个维度的上下文信息，显著地改善了跟踪过程中的遮挡和漂移问题。In terms of spatio-temporal information fusion, the present invention utilizes the context information of two dimensions, temporal domain and spatial domain, and significantly improves the problems of occlusion and drift in the tracking process.

以上所述仅是本发明的优选实施方式，应当理解本发明并非局限于本文所披露的形式，不应看作是对其他实施例的排除，而可用于各种其他组合、修改和环境，并能够在本文所述构想范围内，通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围，则都应在本发明所附权利要求的保护范围内。The foregoing are only preferred embodiments of the present invention, and it should be understood that the present invention is not limited to the forms disclosed herein, and should not be construed as an exclusion of other embodiments, but may be used in various other combinations, modifications, and environments, and Modifications can be made within the scope of the concepts described herein, from the above teachings or from skill or knowledge in the relevant field. However, modifications and changes made by those skilled in the art do not depart from the spirit and scope of the present invention, and should all fall within the protection scope of the appended claims of the present invention.

Claims

1. The mobile phone detection method based on the fusion of various targets and the space-time video sequence is characterized in that: the mobile phone detection method comprises the following steps:

a1, training the improved yolo model to obtain a detection model, wherein the yolo model is specifically improved as follows:

if a certain anchor frame is endowed with a label by a certain original object, judging whether the original object has a unique frame;

if the original object has the unique frame, judging whether the current object can assign an anchor, if so, cancelling the assignment of the current object to a certain anchor frame, otherwise, searching the next anchor frame with the highest iou value for assignment;

if the original object has no unique frame, judging whether the anchor of the highest iou of the existing object and the anchor of the non-highest iou of the original object cover the original assignment; if so, judging whether an anchor of the non-highest iou of the existing object and an anchor of the highest iou of the original object exist or not; if yes, judging whether the current object can have an assignment anchor, if yes, canceling the assignment of the current object to a certain anchor frame, and if not, covering the original assignment; if the anchor of the existing object which is not the highest iou and the anchor of the original object which is not the highest iou exist, covering the person with the low iou;

a2, inputting a video image frame to operate a detection model to obtain a first frame prediction value;

a3, decoding the first frame prediction value, removing a frame with a score value lower than a preset value, realizing NMS (network management system) by using a Diou threshold value, and inhibiting a mobile phone frame according to the decoding result of a certain frame image when only the mobile phone frame appears;

a4, taking the suppressed result as a target template, inputting a video image frame as a candidate frame search area, simultaneously inputting the video image frame and the candidate frame search area to a full-connection twin network, and selecting the result with the largest score map similarity to mark the mobile phone in the video image frame;

a5, if the set frame number has been tracked, repeating steps A2-A4 until the video image input is finished.

2. The method for mobile phone detection based on multi-target fusion and spatio-temporal video sequence according to claim 1, wherein: if the frame number is not set, the steps of repeatedly taking the suppressed result as a target template, inputting a video image frame as a candidate frame search area, simultaneously inputting the video image frame and the candidate frame search area to a full-connection twin network, and selecting the result with the largest score map similarity to mark the frames of the mobile phones in the video image frame are repeated.

3. The mobile phone detection method based on multi-target fusion and spatio-temporal video sequence according to claim 1, characterized in that: the mobile phone detection method further comprises the step of acquiring a training set and a test set before the step of training the improved yolov3 model to obtain a detection model and inputting the video image frame to run the detection model to obtain a first frame predicted value.

4. The mobile phone detection method based on multi-target fusion and spatio-temporal video sequence according to claim 3, characterized in that: the step of obtaining the training set test set comprises: the method comprises the steps of performing frame division processing on a recorded video, labeling processed video pictures, extracting partial pictures at intervals of frames to construct a data set, and dividing the data set into a training set and a test set according to a certain proportion.

5. The mobile phone detection method based on multi-target fusion and spatio-temporal video sequence according to claim 1, characterized in that: the decoding the first frame prediction value, removing a frame with a score value lower than a preset value, realizing NMS (network management system) by using a Diou threshold value, and inhibiting a mobile phone frame when only the mobile phone frame appears according to a decoding result of a certain frame image comprises the following steps:

according to the decoding formula bx-sigmoid (t) _x )+cx、by＝sigmoid(t _y )+cy、bw＝p _w e ^tw 、bh＝p _h e ^th Decoding the first frame prediction value by conf ═ sigmoid (raw _ conf) and prob ═ sigmoid (raw _ prob), wherein bx, by, bh, and bw respectively represent a center horizontal and vertical coordinate and a height and width of a prediction frame, and p represents a center horizontal and vertical coordinate and a height and width of the prediction frame _h And p _w Respectively representing the height and width of the prior box, t _x And t _y The predicted offset of the center of the object from the position of the upper left corner of the grid is represented, the predicted offset of the object relative to the prior frame is represented by tw and th, the cx and cy represent coordinates of the upper left corner of the grid, conf is a confidence coefficient, and prob is a category probability;

removing the box with confidence or category probability not meeting the requirement with score threshold of 0.4 and implementing NMS with Diou threshold of 0.1;

and (3) rejecting a prediction frame related to the mobile phone in the corresponding image if the mobile phone frame does not have a human body frame, a hand frame or a camera frame according to the decoding result of the certain frame of image, so as to restrain the mobile phone frame.

6. The mobile phone detection method based on multi-target fusion and spatio-temporal video sequence according to claim 1, characterized in that: improvements to the yolo model include the following:

increasing an s branch for detecting a small object for yolov3-tiny to improve the detection effect of small objects such as a camera;

on the basis of the model structure of the previous step, the SPP module, the SAM module and the CAM module are added to be connected with the residual error, so that the feature extraction capability is improved.