CN114550036A

CN114550036A - Rapid search method and system for optimal cascade configuration of video target detection

Info

Publication number: CN114550036A
Application number: CN202210147889.0A
Authority: CN
Inventors: 谭光; 李楠
Original assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Current assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-27
Anticipated expiration: 2042-02-17
Also published as: CN114550036B

Abstract

The invention discloses a quick search method and a quick search system for optimal cascade configuration of video target detection, wherein the method comprises the following steps: acquiring a video data set and performing feature calculation on the video data set to obtain a scene feature combination; performing a frame filtering operation on the video data set based on the inter-frame similarity; based on an optimized configuration search algorithm, acquiring an optimal cascade configuration scheme which meets the precision requirement and has the lowest cost in a video data set, and constructing to obtain a training set with labels; training a cascade scheme mapper based on a training set to obtain a trained cascade scheme mapper; and acquiring a video to be detected, searching for the optimal configuration based on the trained cascade scheme mapper, and completing a target detection task. The system comprises: the device comprises a feature calculation module, a filtering module, a searching module, a training module and a detection module. By using the method and the device, the optimal cascade scheme can be automatically and efficiently obtained according to the video scene, and the target detection is completed. The invention can be widely applied to the field of target detection.

Description

Rapid search method and system for optimal cascade configuration of video target detection

Technical Field

The invention relates to the field of target detection, in particular to a quick search method and a quick search system for optimal cascade configuration of video target detection.

Background

At present, with the development of computer vision technology, the target detection model based on the deep neural network has increasingly accurate results. However, it presents a significant challenge to use for large-scale video data sets. Mainly because the video target detection task is executed by mobile equipment such as a camera, and the high calculation cost cannot be borne. Therefore, it is important to research how to reduce the calculation cost under the requirement of meeting the detection precision.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method and a system for quickly searching for an optimal cascade configuration for video target detection, which can automatically and efficiently obtain an optimal cascade scheme according to a video scene to complete target detection.

The first technical scheme adopted by the invention is as follows: a fast search method for video target detection optimal cascade configuration comprises the following steps:

acquiring a video data set and performing feature calculation on the video data set to obtain a scene feature combination;

performing frame filtering operation on the video data set based on the inter-frame similarity to obtain a video data set with redundant frames filtered out;

based on an optimized configuration search algorithm, acquiring an optimal cascade configuration scheme which meets the precision requirement and has the lowest cost in a video data set, and combining scene characteristic combination to construct a training set with labels;

training a cascade scheme mapper based on a training set to obtain a trained cascade scheme mapper;

and acquiring a video to be detected, searching for the optimal configuration based on the trained cascade scheme mapper, and completing a target detection task.

Further, before the step of obtaining the video data set and performing feature calculation on the video data set to obtain the scene feature value, the method further includes:

and (5) counting the running cost of all the configuration schemes.

Further, the step of obtaining a video data set and performing feature calculation on the video data set to obtain a scene feature combination specifically includes:

acquiring a video data set, wherein the video data set comprises a detection data set, a network video and a shooting video;

carrying out scene analysis on the video data set, calculating the characteristics of each frame of picture in the video data set, preprocessing the characteristics, and extracting to obtain a scene characteristic combination;

the scene feature combination comprises the number of detection targets, the speed of the detection targets, the displacement of the detection targets, the scene offset and the CNN feature.

Further, the preprocessing comprises normalization processing and extreme outlier removal processing by a 0-1 normalization method.

Further, the step of performing a frame filtering operation on the video data set based on the inter-frame similarity to obtain a video data set with redundant frames filtered out specifically includes:

acquiring an interframe difference algorithm;

comparing the calculation cost of the interframe difference algorithm, the adaptability of the scene and the filtering threshold interval;

and selecting an interframe difference algorithm to calculate interframe similarity, and filtering the video frames to obtain filtered video frame data.

Further, the optimal cascade configuration scheme which meets the precision requirement and has the lowest cost in the video data set is obtained based on the optimized configuration search algorithm, and the step of constructing the training set with the labels by combining the scene features specifically comprises the following steps:

dividing videos in the video data set into preset lengths in sequence to obtain a segment set;

carrying out filtering strategy and cascade configuration analysis on each fragment in the fragment set to obtain a corresponding combination scheme;

and generating video data with labels according to the scene feature combination and the combination scheme corresponding to the video data set, and constructing to obtain a training set with labels.

Further, the step of training the mapper for the cascading scheme based on the training set to obtain the mapper for the trained cascading scheme includes:

based on the training set, taking scene feature combination as input, and taking a combination scheme as an output training cascade scheme mapper;

and drawing an accuracy rate change curve in the training process, and debugging the cascade scheme mapper until the accuracy rate is judged to reach a preset value, so as to obtain the cascade scheme mapper after training.

Further, the step of obtaining the video to be detected, searching for the optimal configuration based on the trained cascade scheme mapper, and completing the target detection task specifically includes:

constructing and operating a trained cascade scheme mapper on the NCNN;

inputting a video to be detected on line, and performing feature extraction processing on the video to be detected to obtain the features of a scene to be detected;

outputting an optimal combination according to the characteristics of the scene to be detected, wherein the optimal combination comprises cascade configuration and a filtering strategy;

and filtering the video to be detected according to the optimal combination, and finishing a video target detection task by combining cascade configuration.

The second technical scheme adopted by the invention is as follows: a fast search system for optimal cascade configuration of video object detection, comprising:

the characteristic calculation module is used for acquiring a video data set and performing characteristic calculation on the video data set to obtain a scene characteristic combination;

the filtering module executes frame filtering operation on the video data set based on the interframe similarity to obtain the video data set with redundant frames filtered out;

the search module is used for acquiring an optimal cascade configuration scheme which meets the precision requirement and has the lowest cost in the video data set based on an optimized configuration search algorithm, and combining scene characteristics to construct a training set with labels;

the training module is used for training the cascade scheme mapper based on the training set to obtain a trained cascade scheme mapper;

and the detection module is used for acquiring the video to be detected, searching the optimal configuration based on the trained cascade scheme mapper and completing the target detection task.

The method and the system have the beneficial effects that: the invention can efficiently select effective filtering strategies and cascade configuration according to scene characteristics. After the frames are filtered, the lightweight configuration is preferentially considered from the cascade configuration of the current scene, and if the requirements are not met, the high-precision configuration is considered, so that the high efficiency, high precision and lowest cost of the whole target detection process are realized.

Drawings

FIG. 1 is a flow chart of the steps of a fast search method for optimal cascade configuration of video target detection according to the present invention;

fig. 2 is a block diagram of a fast search system of an optimal cascade configuration for video object detection according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

As shown in fig. 1, the present invention provides a fast search method for detecting an optimal cascade configuration of a video target, which includes the following steps:

and S0, counting the running cost of all the configuration schemes.

S1, acquiring a video data set and performing feature calculation on the video data set to obtain a scene feature combination;

s1.1, acquiring a video data set, wherein the video data set comprises a detection data set, a network video and a shooting video;

specifically, data covering a large number of common target detection task scenes can be acquired through channels such as downloading mainstream target detection data sets (such as KITTI, VOC, COCO and the like), network videos or self-shooting videos.

S1.2, performing scene analysis on the video data set, calculating the characteristics of each frame of picture in the video data set, preprocessing the characteristics, and extracting to obtain a scene characteristic combination;

the scene feature combination comprises the number of detection targets, the speed of the detection targets, the displacement of the detection targets, the scene offset and CNN features, and the preprocessing comprises the steps of carrying out normalization processing and extreme outlier removal processing by adopting a 0-1 standardization method.

Specifically, the normalized data conforms to the standard normal distribution, that is, the mean value is 0 and the standard deviation is 1, so that the dimensional influence between different characteristic values can be effectively removed.

S2, performing frame filtering operation on the video data set based on the interframe similarity to obtain the video data set with redundant frames filtered out;

s2.1, acquiring an interframe difference algorithm;

s2.2, comparing the calculation cost of the interframe difference algorithm, the adaptability of the scene and a filtering threshold interval;

specifically, a commonly used calculation method for measuring inter-frame difference in an industrial frame filtering strategy, such as Mean Square Error (MSE), Mean Absolute Error (MAE), peak signal to noise ratio (PSNR), Structural Similarity (SSIM), and the like, is investigated, the calculation costs of different difference algorithms and the adaptability to a scene are compared, and filtering threshold intervals of different inter-frame difference calculation methods are determined.

And S2.3, selecting an interframe difference algorithm to filter the video frames to obtain filtered video frame data.

S3, obtaining an optimal cascade configuration scheme which meets the precision requirement and has the lowest cost in the video data set based on an optimized configuration search algorithm, and combining scene characteristics to construct a training set with labels;

specifically, because the subsequent training neural network belongs to supervised learning, and the precondition is that high-quality labeled data are obtained, an optimized searching mode is adopted, the time complexity of the searching cascade space is reduced from exponential o (m ^ n) to o (m ^ n), the optimal cascade scheme under a large number of scenes is efficiently obtained, and the high-quality labeled training data are generated for the subsequent mapper based on the neural network training cascade scheme.

S3.1, sequentially dividing videos in the video data set into preset lengths to obtain a segment set;

specifically, video data is sequentially divided into a set of segments of length 16s (referred to as Window). Each Window is in turn divided into small segments (called segments) of length 4 s.

S3.2, performing filtering strategy and cascade configuration analysis on each fragment in the fragment set to obtain a corresponding combination scheme;

specifically, a first Segment in each Window is subjected to comprehensive filtering strategy and cascade configuration analysis, an optimal combination scheme is obtained, and the combination of top K filtering strategy and cascade configuration is recorded (namely, the K group scheme which meets the precision requirement and has the minimum calculation cost, namely filtering strategy and secondary cascade configuration).

Aiming at each remaining Segemt belonging to the same Window, the comprehensive analysis of the combination scheme is not executed, the optimal cascade configuration and the filtering strategy are only separated out from the top K scheme, the combination search space is greatly reduced, and the optimal filtering strategy and the cascade secondary configuration of the Segment are obtained.

If the optimal combination satisfying the accuracy requirement cannot be obtained from the topK combination propagated from the Segment, the scene is changed greatly, and at this time, the topK combination needs to be re-analyzed and updated.

And S3.3, generating video data with labels according to the scene feature combination and the combination scheme corresponding to the video data set, and constructing to obtain a training set.

S4, training the cascade scheme mapper based on the training set to obtain the trained cascade scheme mapper;

s4.1, based on a training set, combining scene features as input, and taking a combination scheme as an output training cascade scheme mapper;

in addition, test sets may also be constructed for verification.

And S4.2, drawing an accuracy rate change curve in the training process, and debugging the cascade scheme mapper until the accuracy rate is judged to reach a preset value, so as to obtain the trained cascade scheme mapper.

Specifically, the debugging operations are: changing the learning rate, observing a convergence curve of the target function value and the accuracy rate of the verification set, and selecting a proper learning rate from the convergence curve; changing the number of hidden layers and the number of hidden units in each layer, observing the change of accuracy rate in each change process, and selecting proper number of layers and units; and (5) observing a training set accuracy rate change curve and a verification set accuracy rate change curve while changing the number of training rounds, and recording the most appropriate number of training rounds.

And S5, acquiring the video to be detected, searching for the optimal configuration based on the trained cascade scheme mapper, and completing the target detection task.

S5.1, building and operating the trained cascade scheme mapper on the NCNN;

particularly, the NCNN is a high-performance neural network forward computing framework extremely optimized for mobile equipment, and can efficiently complete a video target detection task. The method can use a target detection model based on the deep neural network on a large-scale video data set on a movable device with limited computing resources, and reduces computing overhead.

S5.2, inputting a video to be detected on line, and performing feature extraction processing on the video to be detected to obtain the scene features to be detected;

s5.3, outputting an optimal combination according to the characteristics of the scene to be detected, wherein the optimal combination comprises cascade configuration and a filtering strategy;

and S5.4, filtering the video to be detected according to the optimal combination, and finishing a video target detection task by combining cascade configuration.

Specifically, video data is input, and extracted features which can describe scene change conditions are calculated. Since the selected features are key visual features, they can be calculated quickly. Inputting the characteristic value into a combined mapper to obtain output: the best combination, i.e. the best cascade configuration and the best filtering strategy. According to the selected optimal combination, a corresponding frame filtering strategy is executed, and frames can be directly filtered with low difference degree (the result of the previous frame is multiplexed, and model scheduling is avoided). Then, the best configuration is obtained from the cascaded two-stage configuration: the lightweight configuration is considered first, and the heavyweight configuration is considered only when the precision does not reach the standard. The optimal configuration is then used, according to the parameter values: frame rate, resolution and object detection model, completing the video object detection task.

The invention optimizes the mainstream video target detection process based on the neural network. The method can adapt to scene change, periodically adjust the optimal cascade combination scheme for completing target detection, and ensure the balance of precision and calculation cost. The method mainly comprises the steps of analyzing scene change, calculating scene characteristics, inputting the characteristics to a cascade combination mapper based on neural network training, directly outputting the optimal cascade combination, and efficiently selecting the cascade combination with the lowest cost meeting the precision requirement from a huge cascade combination space. The frame filtering model automatically selects a proper filtering strategy and a threshold value according to the scene, and can dynamically adjust the filtering decision according to the scene characteristic type, the query precision and the time-varying correlation among the video contents. In addition, the selected method for measuring the inter-frame difference is high in calculation speed, and the frame filtering model does not bring high overhead. The cascade configuration will also be dynamically adjusted according to the scene.

In the off-line stage, the optimized search algorithm is used, so that the cascade combination search space is greatly reduced, and high-quality labeled data can be efficiently obtained by training the cascade combination mapper by utilizing the neural network. Due to the adoption of the optimized searching scheme, the searching speed is high. This step is performed off-line, preventing the cost of analyzing the optimal cascade combination from offsetting the gains from switching the cascade combination. In addition, by building a neural network, the relation between the scene and the optimal cascade combination is learned, and the method can be adapted to most scenes in real life. In the online stage, the rapid search of the optimal cascade combination of the video target detection is completed by inputting the scene characteristic information, so that the prediction precision is improved, the resource consumption of the current video detection is obviously reduced, and the effects of low cost, high efficiency and high accuracy are achieved.

As shown in fig. 2, a fast search system for an optimal cascade configuration of video object detection includes:

Further as a preferred embodiment, the method further comprises:

and the pre-calculation module is used for counting the running cost of all the configuration schemes.

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

A fast search device for video target detection optimal cascade configuration comprises:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, the at least one program causes the at least one processor to implement a fast search method for detecting an optimal cascade configuration of video objects as described above.

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a fast search method for video object detection optimal concatenation configuration as described above.

The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A fast search method for video target detection optimal cascade configuration is characterized by comprising the following steps:

2. The method of claim 1, wherein before the steps of obtaining the video data set and performing feature calculation on the video data set to obtain the scene feature combination, the method further comprises:

and (5) counting the running cost of all the configuration schemes.

3. The method according to claim 2, wherein the step of obtaining the video data set and performing feature calculation on the video data set to obtain the scene feature combination specifically comprises:

4. The fast search method for optimal cascade configuration of video object detection as claimed in claim 3, wherein the preprocessing comprises normalization and outlier removal processing using 0-1 normalization.

5. The method of claim 4, wherein the step of performing a frame filtering operation on the video data set based on the inter-frame similarity to obtain a video data set with redundant frames filtered includes:

acquiring an interframe difference algorithm;

6. The method according to claim 5, wherein the step of obtaining the optimal cascade configuration scheme with the lowest cost and meeting the accuracy requirement in the video data set based on the optimized configuration search algorithm and constructing the training set with the tags by combining the scene features specifically comprises:

and generating video data with labels according to the scene feature combination and combination scheme corresponding to the video data set, and constructing to obtain a training set with labels.

7. The method as claimed in claim 6, wherein the step of training the mapper based on the training set to obtain a trained mapper comprises:

based on the training set, the scene feature combination is used as input, and the combination scheme is used as output training cascade scheme mapper;

and drawing an accuracy rate change curve in the training process, debugging the cascade scheme mapper, and judging that the accuracy rate reaches a preset value to obtain the trained cascade scheme mapper.

8. The method as claimed in claim 7, wherein the step of obtaining the video to be detected and searching for the optimal configuration based on the trained mapper of the cascading scheme to complete the target detection task includes:

constructing and operating a trained cascade scheme mapper on the NCNN;

9. A fast search system for optimal cascading configuration of video object detection, comprising:

the searching module is used for acquiring an optimal cascade configuration scheme which meets the precision requirement and has the lowest cost in the video data set based on an optimized configuration searching algorithm, and combining scene characteristics to construct a training set with labels;