CN118097112B

CN118097112B - Deep learning-based fusion double-background article carry-over loss detection method

Info

Publication number: CN118097112B
Application number: CN202410297353.6A
Authority: CN
Inventors: 刘星; 唐自兴; 陈苗苗; 杨运红; 江发钦; 杨亮亮; 李志洋; 申雷
Original assignee: Zhuhai Raysharp Technology Co ltd
Current assignee: Zhuhai Raysharp Technology Co ltd
Priority date: 2024-03-15
Filing date: 2024-03-15
Publication date: 2024-08-16
Anticipated expiration: 2044-03-15
Also published as: CN118097112A

Abstract

The invention provides a deep learning-based fusion double-background article carry-over loss detection method, which specifically comprises the following steps: s1, generating an article left lost data set; s2, training and testing a detection model; s3, equipment end application; and the like. According to the deep learning-based fusion double-background article carry-over loss detection method, a double-background modeling and a target detection network are fused, a data set is generated, a target detection model is trained, the trained network model can directly output the target position and carry-over or loss, the problem that the effect of traditional background modeling prospect detection is poor is solved, and the problem that only specific articles can be detected is solved.

Description

Deep learning-based fusion double-background article carry-over loss detection method

Technical Field

The invention relates to the technical field of security monitoring, in particular to a method for detecting the carry-over loss of a fused double-background object based on deep learning.

Background

Article carry-over loss detection is an important branch in the field of safety precaution, and is mainly used in public places, and after the article is abandoned for a period of time, the position of the article is pointed out and an alarm is given out; the article loss detection is mainly used for detecting whether the valuables are removed or not and giving an alarm in time.

The prior art is broadly divided into two types. One is a traditional method based on background modeling: background modeling is carried out on a monitored scene through various different methods, such as mixed Gaussian model modeling, double background modeling and the like, the foreground is detected through background modeling, then the accurate target position is determined through morphological operation, and then whether the object is left or lost is judged. Secondly, a deep learning method based on a target detection convolutional neural network comprises the following steps: and detecting each frame of image through the target detection neural network, directly obtaining a specific target category and a specific coordinate corresponding to the specific target category in the image, and then tracking a suspicious target to judge whether an article is left or lost.

Both of these methods currently have drawbacks. In the first method, firstly, the detected foreground depends on the quality of a background model to a great extent, the scene is usually complicated in an experiment, false report and missing report are easy to occur, and the background establishment and updating processes are complicated; and secondly, the detected foreground can not directly judge whether an article is lost or left, even if the article is judged by a plurality of methods, such as the judgment of comparing the communication area of the foreground area of the current frame with the communication area of the corresponding area in the background, the judgment of comparing the color of the foreground area with the color difference of the surrounding areas, and the like, the error judgment rate is high. The second method, firstly, the target detection neural network is limited by the training data set, so that specific articles such as luggage, knapsack and the like can be detected, all kinds of articles can not be covered, and the quality of the detection effect also depends on the quality of the data set to a great extent; secondly, whether the article is left or lost is judged by tracking, and whether the article is left or lost is also dependent on the quality of a tracking algorithm, so that false alarm and false alarm conditions are easy to occur for some specific scenes such as motion blur, shielding, target string id and the like.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a fused double-background article carry-over loss detection method based on deep learning, which is characterized in that a double-background modeling and a target detection network are fused to generate a data set and train a target detection model, and the trained network model can directly output the target position and carry-over or loss, so that the problem that the effect of the traditional background modeling prospect detection is poor is solved, and the problem that only specific articles can be detected is solved.

In order to achieve the technical scheme, the invention provides a method for detecting the residual loss of a fused double-background object based on deep learning, which specifically comprises the following steps:

s1, generating a data set of article left lost

S11, randomly selecting a certain number of background images and segmentation target images from a coco data set, and putting image ids and corresponding id label information into a list so that corresponding images can be found through the image ids;

S12, selecting a background image from a list, randomly selecting a plurality of segmentation targets from the list, randomly selecting a coordinate position of each segmentation target on the background image as a center point, and then mapping by using mask and image operation;

S13, simulating the process of article movement, wherein max_frame_num is used for representing how much frame data is generated on a background image, then stop and static are randomly generated within the range of max_frame_num, then the movement is carried out according to pixels offset in each frame which are randomly generated, and the initial movement position init_ boxs and the stop position static_ boxs of a target are recorded in the movement process;

S14, marking loss and legacy in a section of frame number of the target moving and staying;

S15, converting the current processed RGB picture into a gray level picture, initializing a fast background and a slow background by using a background picture attached with a segmentation target of a first frame, updating the background according to the proportion of update coefficients, and fusing the current frame, the fast background and the slow background into a three-channel picture for storage;

s16, repeating the steps S11 to S15 on each background image in the cyclic list, and finally generating sixty-thousand pictures and corresponding labeling files in the data set, wherein the labeling comprises two categories of carry-over and loss and target rectangular positions;

S2, training and testing of detection models

S21, selecting yolov frames as a model training main body, cutting and modifying a network layer structure, changing the original yolov network depth into two times of the original network depth, changing the network width into half of the original network depth, and adding a repvgg module to replace an equipment-end unsupported activation layer SiLU to be a supportable LeakyReLU;

S22, dividing the generated data set into a training set and a verification set according to the proportion of 8:1, modifying network and data set configuration files, performing target detection classification of legacy and loss, and training 300 epochs by using a server multi-GPU;

S23, constructing a test set, collecting videos with article carry-over loss, converting the original videos into three-channel fusion images of a current frame and fast and slow backgrounds frame by frame in the same mode and parameters as those used for generating the data set, testing the three-channel fusion images after training the model, adjusting parameters in the data set according to test effects, regenerating the data set, retraining, iterating repeatedly in this way, continuously improving the model detection effect, and finally finding out a proper data set generation parameter combination to obtain a detection model with the effect reaching the expected article carry-over loss;

S3, equipment end application

S31, acquiring a current frame yuv picture from a camera video stream, scaling the current frame yuv picture to the input size required by a model, converting the current frame yuv picture into a gray level picture and converting the gray level picture into a float type;

S32, updating the fast and slow backgrounds, wherein the fast and slow backgrounds of a first frame are directly initialized by using a gray level diagram of a current frame, then updated according to the fast and slow backgrounds and update coefficients, and then the current frame and the fast and slow backgrounds are fused into a three-channel picture and converted into a uint8 type;

S33, deducing an input board end model, and obtaining a detection result through post-processing, wherein the detection result comprises target coordinates, categories and confidence;

S34, performing condition filtering according to manually set parameters, and filtering out targets with confidence in detection results lower than a set threshold and target center point coordinates not in a rule area;

S35, sending out corresponding alarms according to whether the categories in the result are left and lost, and if not, indicating the normal state.

Preferably, in step S11, the type, the position and the mask of the segmented object in the picture are included in the corresponding id label information.

Preferably, in the step S12, before mapping by using mask and image operation, the segmented object may be scaled appropriately by calculating the length-width ratio of the segmented object and the aspect ratio of the background image, so as to avoid that the segmented object is too large or too small compared with the background image.

Preferably, in the step S13, max_frame_num is set to 600, and stop and static are randomly generated within the range of max_frame_num, where static indicates how many frames the article has stopped first, after which the article starts moving, and stop indicates the frame number of the article stop.

Preferably, in the step S14, the frame number range of the moving and stopping of the target is set to 40 frames to 110 frames, the moving of the split target from the initial position init_ boxs is started to be marked as a lost target position in 40 frames to 110 frames, and the stopping of the split target after the moving is marked as a left target position in 40 frames to 110 frames in the fixed position of static_ boxs.

Preferably, in step S15, the background update formula is b= (1-k) ×b+k×a, a is the current frame, B is the background, and k is the update coefficient, and the value range is 0-1.

Preferably, in step S13, in order to optimize false alarms caused by environmental changes, preprocessing may be added when generating a dataset picture, some preprocessing including brightness change, color dithering, noise addition, blurring, random rain addition, random fog addition, and random shading is added on the current frame by using albumentations library, each preprocessing gives a probability of 0.5, each preprocessing is randomly combined to increase data diversity, and then a probability of 0.2 is specified again for each frame of picture if preprocessing is added.

The fusion double-background article carry-over loss detection method based on deep learning has the beneficial effects that:

1) The method has the advantages that the double background modeling and the target detection network are fused, the data set is generated, the target detection model is trained, the trained network model can directly output the target position and carry over or loss, the problem that the effect of traditional background modeling prospect detection is poor is solved, meanwhile, compared with the common target detection neural network which can only detect specific articles, the network of the method can detect any article, the article can be accurately known to be carried over or lost through output, and the judging step of carrying over or loss of the article in the two methods is omitted.

2) The object detection model is trained by constructing an object carry-over loss detection data set, the object carry-over and loss can be directly detected by inputting a current frame, a fast background and a slow background fusion picture into the model, and experiments show that the method is good in detection effect, can detect any object in most scenes, and has good applicability to special scenes such as light change by adding pretreatment to the generated data set; the method has the advantages that the flow is simple, the input picture can directly and accurately obtain the result including the position of the article and the type of the lost article, and the step of judging whether the article is lost or left is not needed.

Drawings

FIG. 1 is a flow chart of the steps of the present invention.

Fig. 2 is a schematic view of the status of an article.

Fig. 3 is a flow chart of generating a lost article carry-over data set in accordance with the present invention.

FIG. 4 is a schematic diagram of iterative training for test model training and testing in the present invention.

Fig. 5 is a flowchart of detecting the article remaining loss of the device-side application in the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.

Examples: a fusion double-background article carry-over loss detection method based on deep learning.

In the invention, the data input into the network model is three-channel data, the first channel is the gray level diagram of the current frame, the second channel is the fast background, the third channel is the slow background, and the fast and slow background updating modes are to combine the last background and the current frame according to a certain proportion, wherein the updating rate of the fast background is larger than that of the slow background. For the analysis of the state of the article, three states can be divided, wherein the schematic diagram of the state of the article is shown in fig. 2, the circle in the diagram represents the area of the article, the darker the color is, the more obvious the article is, the more pure black is, the article is completely developed, and the pure white is the article is completely disappeared:

(1) Article normal state: the articles are still and not moved all the time, the current frame has no difference with the corresponding areas of the fast and slow backgrounds, and the articles are completely displayed; the object is moving or slightly pausing, the object of the current frame in the corresponding area is completely visualized, the fast background is slightly visualized, and the object hardly appears in the slow background, and the difference between the current frame and the fast background is larger and the difference between the current frame and the slow background is larger.

(2) Article carry-over status: the articles stay at one position for a period of time after moving, the current frame and the articles with fast background in the corresponding area are completely displayed, the articles with slow background are displayed to a certain degree, and the current frame is very similar to the fast background but has a certain difference with the slow background.

(3) Article loss status: the articles leave the original stay position for a period of time, the current frame and the articles with fast background in the corresponding area completely disappear, the articles with slow background disappear to a certain extent, and the current frame and the fast background are very similar but have a certain difference with the slow background.

Based on the principle, the method for detecting the fusion double-background article carry-over loss based on deep learning mainly comprises the following three parts: 1. making a lost data set of the article; 2. training and testing a detection model; 3. and (5) applying the equipment end. The detailed process is as follows:

1. production of item carry-over missing data sets:

The method mainly uses an open source data set COCO data set to divide a target mapping, generates a residual loss label and stores picture data, wherein a flow chart for generating the residual loss data set is shown in fig. 3, and the method comprises the following detailed steps:

(1) The background images and the segmented targets are collected, a plurality of background images and segmented target images are randomly selected from the coco data set, and image ids and corresponding id labeling information are put into lists, so that a corresponding picture can be found through the image ids, the labeling information comprises the types, positions, masks and the like of the segmented targets in the picture, and therefore the segmented targets can be obtained according to the labeling information and the corresponding pictures.

(2) Dividing the target map to the background, selecting a background map from the list, randomly selecting a plurality of dividing targets from the list, randomly selecting a coordinate position of each dividing target on the background map as a center point, properly scaling by calculating the length and width of the dividing target and the aspect ratio of the background map, avoiding that the dividing target is too large or too small compared with the background map, and then mapping by using mask and image operation.

(3) Dividing object movement, namely simulating the object movement process, firstly determining how much frame data is generated in total on a background image, setting max_frame_num to be 600 in the embodiment, randomly generating stop and static in the range of max_frame_num, wherein static indicates how many frames the object stays in, which can be 0, and then starting the object to move, stop indicates the frame number of the object stopping, which is required to be greater than static, and then moving according to the randomly generated vec [0] and vec [1], wherein vec [0] and vec [1] respectively represent the movement speeds of x and y directions, namely pixels shifted in each frame, positive numbers represent positive direction movement, and negative numbers represent opposite direction movement, and recording the initial movement position init_ boxs of the object and the stop position static_ boxs, which comprise the central point coordinates and the length, and the central point coordinates of init_ boxs and static_ boxs, which are required to be in the background image range.

(4) The lost and left labels, namely labels lost and left in a frame number of the moving and staying frame number of the target, in this embodiment, the frame number range of the moving and staying frame number of the target is set to 40 frames to 110 frames, if a split target moves and exceeds a certain distance, the split target starts to move from an initial position init_ boxs to mark as a lost target position in 40 frames to 110 frames, and the split target stays in a fixed position static_ boxs 40 to mark as a left target position in 110 frames after moving. In this embodiment, the marked range from 40 frames to 110 frames corresponds to the update rates of the fast and slow backgrounds being set to 0.04 and 0.01, the parameters are set according to the test results and experience, the similarity between the current frame and the fast background is 67% when the target position area 40 frames, the similarity between the current frame and the slow background is 33%, the similarity between the current frame and the fast background reaches 100% when the current frame and the slow background reach 67%, and thus a gradient difference is formed and can be used as an obvious feature of network identification.

(5) The method comprises the steps of fusing and storing pictures, wherein a current frame and fast and slow backgrounds are single-channel gray level pictures, the current frame is required to convert a currently processed RGB picture into a gray level picture, the fast and slow backgrounds are initialized by using a background picture attached with a segmentation target by a first frame, then the current frame and the background are updated according to the ratio of update coefficients, a is the current frame, B is the background, k is the value range of 0-1 of the update coefficients, the corresponding coefficients of the fast and slow backgrounds are set to be 0.04 and 0.01, and then the current frame, the fast and slow backgrounds are fused into a three-channel picture to be stored.

(6) Repeating the steps for each background image in the cyclic list, and finally generating data to integrate more than sixty thousand images and corresponding labeling files, wherein the labeling comprises two categories of left and lost and target rectangular positions.

In addition, in order to optimize false alarms caused by environmental changes such as illumination, pretreatment can be added when a data set picture is generated, a albumentations library is utilized to add pretreatment including brightness change, color dithering, noise addition, blurring, random rain addition, random fog addition, random shading addition and the like, each pretreatment is given with a probability of 0.5, so that various pretreatment can be randomly combined to increase data diversity, and then whether each frame picture is pretreated is further given with a probability, for example, 1/5, and the pretreatment is added on the current frame.

2. Test model training and testing

Considering that the running memory and time requirement of the equipment end require the use of a lightweight network, and meanwhile, some layers which are not supported by the equipment end model cannot be contained, the model training main body selects yolov frames, some cutting and modification are carried out on the network layer structure, the original yolov network depth is changed to be twice as much as the original network depth, the network width is changed to be half as much as the original network depth, the parameter is unchanged, the running memory of the equipment end can be reduced, meanwhile, a repvgg module is added to improve the model effect, and the activation layer SiLU which is not supported by the equipment end is replaced by LeakyReLU which is supportable. Dividing the generated data set into a training set and a verification set according to the proportion of 8:1, modifying network and data set configuration files, enabling the target detection category number to be 2, namely carry-over and loss, selecting multi-scale training, enabling a suitable value to occupy the GPU memory as much as possible, enabling the rest parameters to be set by default, and training 300 epochs by using a server multi-GPU.

Constructing a test set, collecting videos with articles left lost, covering a plurality of different scenes as much as possible, including special scenes such as illumination changes, converting the original videos into three-channel fusion images of a current frame and fast and slow backgrounds frame by frame in the same way and parameters as the data set is generated, testing the three-channel fusion images by using the test set after training the models, adjusting and generating update coefficients of the parameters such as the fast and slow backgrounds in the data set according to test effects, marking the frame range with the left and lost frames, adding the probability of picture preprocessing and the like, then regenerating the data set for retraining, repeatedly iterating, continuously improving the model detection effect, finally finding a proper data set generation parameter combination, and obtaining the article left lost detection model with the expected effect. An iterative training diagram is shown in fig. 4.

3. Device-side application

The trained PC end model cannot be directly used at the equipment end, the model is converted into a board end model which can be operated at the equipment end through a corresponding conversion tool, the model is converted into an RGB model, and the RGB model is more convenient to use than the NV12 model, because the current frame and the fast and slow background 3 channels are fused and then are RGB images, the RGB images can be directly input into the RGB model, if the NV12 model is used, the RGB images are required to be converted into the NV12 format and then input, and meanwhile, the input width, namely the size of a processed picture, of the model can be properly reduced in consideration of performance consumption.

The article carry-over loss detection flow chart is shown in fig. 5, and the detailed steps are as follows: firstly, obtaining a current frame yuv picture from a camera video stream, scaling the current frame yuv picture to the required input size of a model, converting the current frame yuv picture into a gray level picture and converting the gray level picture into a float type, updating a fast background and a slow background, directly initializing the fast background and the slow background of the first frame by the current frame gray level picture, updating the fast background and the slow background according to update coefficients, fusing the current frame and the fast background and the slow background into a three-channel picture and converting the three-channel picture into a uint8 type, deducing the input board end model, obtaining a detection result through post-processing, including a target coordinate, a category and a confidence coefficient, performing conditional filtering such as a confidence coefficient threshold and a rule area according to some parameters set by people, filtering targets of which the confidence coefficient is lower than the set threshold and the target center point coordinate is not in the rule area, and sending corresponding alarms according to whether the categories in the result are left and lost, if not, indicating the detection result is in a normal state.

In addition, the control of the required time and duration of the detected legacy loss can be started from the aspects of processing frame numbers per second and fast and slow background update coefficients, the fast and slow background update coefficients are 0.04 and 0.01, and when 10 frames are processed per second, an alarm is required to be generated, and the alarm lasts (110-40)/10=7s) in an ideal case, wherein the mark when a data set is generated is that an article moves or stays for 40 frames to 110 frames; when the fast and slow background updating coefficients are unchanged, 40/5=8s is probably required to generate an alarm if the processing frame number per second is changed from 10 frames to 5 frames, and the alarm lasts (110-40)/5=14s; when the number of frames processed per second is unchanged, and the fast and slow background updating coefficients multiplied by 2 are adjusted to be 0.08 and 0.02, then an alarm is required to be generated approximately by 4/2=2s, and the alarm lasts for 7/2=3.5s. Similarly, the length of time can be controlled by setting a sensitivity parameter associated with the number of frames processed per second and the fast and slow background update coefficients.

Compared with the prior art, the method has the advantages that the object-leaving and losing detection data set is constructed, the object detection model is trained, the object-leaving and losing objects can be directly detected by inputting the current frame, the fast background and the slow background three-channel fusion picture into the model, the problem that the effect of detecting the traditional background modeling prospect is poor is solved, meanwhile, compared with the common object detection neural network, only specific objects can be detected, any object can be detected by the network of the method, whether the object is left or lost can be accurately known through output, and the judging step of whether the object is lost or left in the prior art is omitted. Experiments show that the method has good detection effect, can detect any article in most scenes, and has better applicability to special scenes such as light change by adding pretreatment to a generated data set; and the process is simple, the results including the positions of the articles, the types of the left lost articles and the like can be directly and accurately obtained by inputting the pictures, the accuracy is high, and the false alarm and missing report situation can not occur.

The foregoing is a preferred embodiment of the present invention, but the present invention should not be limited to the embodiment and the disclosure of the drawings, so that the equivalents and modifications can be made without departing from the spirit of the disclosure.

Claims

1. The method for detecting the fusion double-background article carry-over loss based on the deep learning is characterized by comprising the following steps:

s1, generating a data set of article left lost

S2, training and testing of detection models

S3, equipment end application

2. The method for detecting the missing of the fused double-background object based on the deep learning according to claim 1, wherein in the step S11, the type, the position and the mask of the segmented object in the picture are contained in the corresponding id label information.

3. The method for detecting the missing of the fused double-background object based on the deep learning according to claim 1, wherein in the step S12, before mapping by using mask and image operation, the segmented object can be scaled appropriately by calculating the length-width of the segmented object and the aspect ratio of the background image, so as to avoid that the segmented object is too large or too small compared with the background image.

4. The method for detecting the carryover loss of the fused double-background object based on deep learning according to claim 1, wherein in the step S13, max_frame_num is set to 600, and stop and static are randomly generated within the range of max_frame_num, wherein static indicates how many frames the object has stopped first, after which the object starts to move, and stop indicates the frame number of the object stop.

5. The method for detecting the missing of the fused double-background object based on the deep learning according to claim 1, wherein in the step S14, the frame number range of the object moving and staying is set to 40 frames to 110 frames, the split object moves from the initial position init_ boxs to the position marked as the missing object in 40 frames to 110 frames, and the split object stays at the fixed position of static_ boxs to the position marked as the remaining object in 40 frames to 110 frames after moving.

6. The method for detecting the carryover loss of the fused double-background object based on deep learning according to claim 1, wherein in the step S15, the background update formula is b= (1-k) ×b+k×a, a is a current frame, B is a background, and k is an update coefficient, and the value range is 0-1.

7. The method for detecting the missing of the fusion double-background object based on the deep learning according to claim 1, wherein in the step S13, in order to optimize false alarms caused by environmental changes, preprocessing can be added when generating the dataset picture, some preprocessing including brightness change, color dithering, noise addition, blurring, random rain addition, random fog addition and random shading is added on the current frame by utilizing albumentations libraries, each preprocessing gives a probability of 0.5, each preprocessing is combined randomly, the data diversity is increased, and then whether each frame of picture is preprocessed or not is further assigned with a probability of 0.2.