CN109903339A

CN109903339A - A kind of video group personage's position finding and detection method based on multidimensional fusion feature

Info

Publication number: CN109903339A
Application number: CN201910235608.5A
Authority: CN
Inventors: 陈志�; 掌静; 岳文静; 周传; 陈璐; 刘玲; 任杰; 周松颖; 江婧
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2019-06-18
Anticipated expiration: 2039-03-26
Also published as: CN109903339B

Abstract

The present invention discloses a kind of video group personage's position finding and detection method of multidimensional fusion feature.Multi-layer video features figure is extracted in the invention first, establish the semantic information that top-down and bottom-up binary feature treatment channel sufficiently excavates video, then fusion multi-layer video features figure obtains multidimensional fusion feature, grab video candidate target, last parallel processing candidate target position returns and category classification, completes video group personage's detection and localization.The present invention obtains video semanteme information abundant by fusion multi-layer feature, while carrying out multitask predicted operation, effectively promotes the speed of group personage detection and localization, has good accuracy rate and implementation.

Description

A kind of video group personage's position finding and detection method based on multidimensional fusion feature

Technical field

It is especially a kind of special based on multidimensional fusion the present invention relates to the interleaving techniques such as computer vision, pattern-recognition field Video group personage's position finding and detection method of sign.

Background technique

With the development of video acquisition and image processing techniques, video group personage's detection and localization is computer vision One popular research direction in field, is with a wide range of applications, and it is also more high-rise computer vision problems Basis, such as dense population monitoring, social semantic analysis etc..

The task definition of video group personage's detection and localization is not difficult matter for human eye, mainly by different colours The perceptual positioning of block, the position for sorting out target person, but what is handled for computer is rgb matrix, such as What regional location where being partitioned into group personage in scene, and reducing influence of the background area to detection and localization is part hardly possible Thing.

The bounding box recurrence of the development experience of video group personage detection and localization algorithm, deep neural network rise, more ginsengs It examines window development, difficult sample excavation and focusing and multiple dimensioned multiport detects the progress of these great-leap-forward technologies, according to algorithm Core can be divided into two types, and one is the detection and localization algorithm based on traditional-handwork feature, another kind is based on depth The detection and localization algorithm of habit.Before 2013, traditional craft is based primarily upon to the detection and localization of personage in video or image Feature is limited by feature description and computing capability, and computer vision research personnel try one's best to design the detection of diversification Algorithm makes up deficiency of the hand-designed feature in image feature representation ability, and using exquisite calculation method to detection model Accelerate, reduces space-time consumption.In the manual feature detection algorithm for occurring several representatives among these, Viola-Jones detector, HOG detector, deformable part model detector.

With the rise of deep neural network, the detection model based on deep learning overcomes the detection of traditional-handwork feature and calculates Method describes limited disadvantage to feature, the expression of automatic learning characteristic from big data, wherein including thousands of parameter, needle New effective character representation can be obtained quickly by training study to new application scenarios, the detection mould based on deep learning Type, which is broadly divided into, is nominated based on region and based on end-to-end both direction.Detection model based on region nomination is first to be detected Image selects a large amount of region candidate frame, may include the target to be detected in these candidate frames, then extract each candidate frame Feature obtains feature vector, and characteristic of division vector obtains classification information, finally carries out position and returns to obtain corresponding coordinate information. Candidate frame extraction is given up based on detection end to end, directly feature extraction, candidate frame is returned and classification is placed on a convolution It is completed in network.

Since group's personage's behavior has the feature of communality and diversity, be interactive interpersonal behavior and people and The set of behavior interaction between environment, is mutually blocked or people so person to person easily occurs in group personage behavior generating process With mutually blocking for object, in addition when video imaging the factors such as illumination variation interference, the existing detection based on deep learning In the detection process character positions can cannot be accurately positioned because of these disturbing factors in model, or even cause personage's missing inspection.

Summary of the invention

Goal of the invention: in group's personage's scene, due to existing simultaneously multiple personages, in order to efficiently locate detection Group personage needs to carry out each personage accurately feature and describes.The existing detection model based on deep learning is usually adopted Monohierarchy top layer video features are used to return as detection foundation although top layer video features include video semanteme abundant Character positions out are relatively rough.In recent years, also there are some detection models using multi-layer fusion video features, these Although the video features fusion bottom video feature of model is only used during Fusion Features with promoting Detection accuracy Unidirectional fusion structure, this will lead to the characteristic information that each hierarchy characteristic figure only includes current level and more high-level, no The mapping result that all hierarchy characteristics can be embodied, to prevent testing result from being optimal.To overcome the shortcomings of existing technologies, The present invention proposes a kind of video group personage's position finding and detection method based on multidimensional fusion feature, and this method extracts multi-level view Frequency feature merges multi-level video features using two-way treatment channel and forms multidimensional fusion feature, can effectively utilize all The characteristic information of level obtains video semanteme information abundant, to more comprehensively be retouched to the character features in video It states, while parallel progress multitask predicted operation, effectively promotes the speed of group personage detection and localization, there is good accuracy rate And implementation.

Technical solution: to achieve the above object, technical solution proposed by the present invention are as follows:

A kind of video group personage's position finding and detection method based on multidimensional fusion feature includes the steps that sequence executes (1) To (8):

(1) video as training sample is inputted, the kind of object and position in video it is known that carry out video big frame by frame The size of each frame video frame is uniformly scaled H × W size by small normalization, and H indicates video frame height, and W indicates video frame Width；

(2) it is obtained using InceptionV3 model frame by frame to by step (1) treated, video carries out feature extraction The characteristics of image of each level of video forms multi-layer video features figure F ', F '={ F_i' | i=1,2 ..., numF }, F_i' indicate I-th tomographic image feature, numF indicate the total number of plies of video image characteristic extracted, F₁' indicate underlying image feature, F '_numFIt indicates Top layer images feature；

(3) the multi-layer video features figure F ' carry out Fusion Features operation to being drawn into, includes the steps that successively executing (3- 1) to (3-4):

(3-1) increases by one from F '_numFTo F₁' fusion channel, to the figure F ' progress of multi-layer video features from top-level feature Downward Fusion Features obtain top-down video features figure F^top-down；The method of Fusion Features are as follows: since top layer images feature F′_numFStart, traverses each tomographic image feature F downwards_i', to F_i' carry out convolution kernel successively as conv₁, step-length stride₁'s Convolution operation and upSample₁Up-sampling operation again, obtainsIt finally obtains

(3-2) increase by one fromIt arrivesFusion channel, it is rightIt carries out upward from low-level image feature Fusion Features, obtain bottom-up video features figure F^bottom-up, Indicate bottom-up video features figure F^bottom-upThe i-th tomographic image feature；The method of Fusion Features are as follows:

A. i=1 is initialized；

B. it calculatesIt is rightProgress convolution kernel is conv₂, step-length stride₂Convolution Operation, obtains resultIt calculates

C. i=i+1 is updated；

D. circulation executes step b to c, until i > numF is obtained after circulation terminates:

(3-3) is to bottom-up video features figure F^bottom-upIn each tomographic image featureCarrying out convolution kernel is conv₃, step-length stride₃Convolution operation, obtained result is denoted as F_i, obtained all F_iConstitute multidimensional fusion feature figure F, F={ F_i| i=1,2 ..., numF }；

(4) by multidimensional fusion feature figure F input area candidate network, K detection target is exported, obtains target position set Box={ Box_j| j=1,2 ..., K } and corresponding personage's Making by Probability Sets Person={ Person_j| j=1,2 ..., K }, it is described Box_jIndicate the position of j-th of detection target, Person_jIndicate that j-th of detection target is the probability of personage, Person_j∈ [0, 1], Person_jThe bigger expression detection target of value be personage a possibility that it is bigger；

(5) classified according to Person to detection target, the real border frame position that K detection target is arranged is PPerson={ PPerson_j| j=1,2 ..., K }, calculate group personage classification loss function Loss_cls, calculation formula isWherein, PPerson_jIndicate the true classification of j-th of detection target, PPerson_j Value is 0 or 1, PPerson_j=0 indicates that the detection target is not personage, PPerson_j=1 indicates that the detection target is personage；

(6) according to Box and Person regressive object position, the actual position of K detection target is set are as follows:

BBox={ BBox_j| j=1,2 ..., K }

Calculate group's character positions loss function are as follows:

Wherein, BBox_jIndicate the actual position of j-th of detection target；

(7) group personage detection and localization penalty values Loss, calculation formula Loss=Loss are calculated_cls+λLoss_locIf Loss≤Loss_max, then region candidate network is trained finishes, output area candidate network parameter, executes step (8)；If Loss > Loss_max, then each layer of update area candidate network of parameterThen return step (4), again Carry out person detecting；Loss_maxIt is preset crowd's detection and localization maximum loss value, λ is that position returns and human classification task Balance factor, α are the learning rates of stochastic gradient descent method,Indicate the local derviation of group personage detection and localization loss function Number；

(8) video to be detected is reacquired, video to be detected is successively normalized, feature extraction and feature Fusion, obtains the multidimensional fusion feature figure F of video to be detected_new, by F_newThe trained region candidate net of input step (7) Network obtains the group personage detection and localization result in new video.

Further, in the step (1), H=720, W=1280.

Further, in the step (2), numF=4.

Further, in the step (3), conv₁=1, stride₁=1, upSample₁=2, conv₂=3, stride₂=2, conv₃=1, stride₃=1.

Further, in the step (4), K=12；In the step (7), Loss_max=0.5, λ=1, α= 0.0001。

The utility model has the advantages that the invention adopts the above technical scheme compared with prior art, have following technical effect that

The present invention extracts the video presentation of video multi-layer, carries out binary feature processing, merges multi-layer video features figure Multidimensional fusion feature is obtained, video candidate target is grabbed, parallel processing candidate target position returns and category classification, completes video Group's personage's detection and localization.The present invention obtains video semanteme information abundant by fusion multi-layer feature, while carrying out more Business predicted operation, effectively promotes the speed of group personage detection and localization, has good accuracy rate and implementation, specifically:

(1) present invention establishes top-down and bottom-up binary feature treatment channel, sufficiently excavates the semanteme of video Information improves hierarchy characteristic utilization rate.

(2) present invention fusion multidimensional video feature organically combines position accurately low-level image feature and semantic top layer abundant Feature can preferably improve detection accuracy.

(3) the multiple prediction tasks of parallel processing of the present invention, and task balance factor is set, be conducive to be built according to scene characteristic Found optimum detection model.

Detailed description of the invention

Fig. 1 is the video group personage position finding and detection method process based on multidimensional fusion feature；

Fig. 2 is the structure chart of one of present invention region candidate network；

Fig. 3 is distinct methods Detection accuracy comparison diagram.

Specific embodiment

Technical solution of the present invention is described in further detail in the following with reference to the drawings and specific embodiments:

Embodiment 1: Fig. 1 is video group personage's position finding and detection method based on multidimensional fusion feature that the present embodiment proposes Flow chart, specifically includes the following steps:

One, pre-process: video of the input as training sample, kind of object and position in video it is known that video by Frame carries out size normalization, the size of each frame video frame is uniformly scaled H × W size, H indicates video frame height, W table Show video frame width；This step is equivalent to pretreatment, is conducive to subsequent detection, in the present embodiment, H=720, W=1280.

Two, feature extraction: using InceptionV3 model frame by frame to video carries out feature by step (1) treated It extracts, obtains the characteristics of image of each level of video, form multi-layer video features figure F ', F '={ F_i' | i=1,2 ..., NumF }, F_i' indicating the i-th tomographic image feature, numF indicates the total number of plies of video image characteristic extracted, F₁' indicate bottom layer image Feature, F '_numFIndicate top layer images feature, in the present embodiment, numF=4.

Low-level image feature target position information is accurate, can return out the detailed location data of target, but can characterize Semantic information is fewer, and data volume is big, and operation processing, which is got up, needs to occupy a large amount of space-time consumption.Although top-level feature is wrapped The semanteme contained is abundant, but because of Multilevel method, target position is relatively coarse, and the target semanteme returned out is not fine and smooth, in group In personage's scene, it be easy to cause erroneous judgement.The feature of each level has respective advantage and disadvantage, in order to take out in group's personage's scene Accurately group personage location information is taken out, using the characteristics of image of InceptionV3 model extraction video multi-layer, is formed more Hierarchy characteristic figure.It the use of the reason of InceptionV3 model is that this Feature Selection Model is not only functional in this step, And there is powerful calculated performance, convenient for processing later.

Three, Fusion Features: the multi-layer video features figure F ' carry out Fusion Features operation to being drawn into, including successively execute The step of (3-1) to (3-4):

(3-2) increase by one fromIt arrivesFusion channel, to F^top-downIt carries out upward from low-level image feature Fusion Features, obtain bottom-up video features figure F^bottom-up, Indicate bottom-up video features figure F^bottom-upThe i-th tomographic image feature；The method of Fusion Features are as follows:

A. i=1 is initialized；

C. i=i+1 is updated；

(3-3) is to bottom-up video features figure F^bottom-upIn each tomographic image featureCarrying out convolution kernel is conv₃, step-length stride₃Convolution operation, obtained result is denoted as F_i, obtained all F_iConstitute multidimensional fusion feature figure F, F={ F_i| i=1,2 ..., numF }.

In step 3, conv₁=1, stride₁=1, upSample₁=2, conv₂=3, stride₂=2, conv₃=1, stride₃=1.

The fusion of multilayer feature is not simply to be added, and first has to consider whether the size of hierarchy characteristic is consistent, The secondary reasonability for needing to consider hierarchy characteristic fusion can or can not lead to the case where detection effect reduces instead after merging.The present invention Improvement and design, every layer of the top-down structure spy comprising current layer and higher have been carried out to existing Feature fusion Reference breath, can be directly used every layer of optimal size and is detected, in order to embody the mapping result of all hierarchy characteristics To be optimal detection effect, bottom-up channel is especially increased, Opposite direction connection is carried out to top-down processing result, more Bottom position information is efficiently utilized, and convolution is carried out to each fusion results using convolution operation finally, on eliminating The aliasing effect of sampling.

Four, region candidate network training:

Region candidate network is a kind of common target detection network, and main functional module is as shown in Fig. 2, its meeting first Needs of the k rectangular window to adapt to different size objectives are generated for each pixel of sliding window, then by each rectangle The location information of window and corresponding characteristics of image input network, carry out classification layer respectively for each rectangular window and return layer Operation.Classification layer mainly differentiates the probability in current rectangle window there are personage, and parameter includes personage's weight parameter W_PAnd background Interference parameter W_E.It returns layer and mainly obtains coordinate information of the current rectangle window in full scale drawing picture, parameter includes rectangular window Mouthful coordinate and the high offset weighting parameter W of width_x、W_y、W_hAnd W_w.In whole region candidate network training process, all ginsengs are shared Several settings and adjustment.

The training process of region candidate network is as follows:

Multidimensional fusion feature figure F input area candidate network is exported K detection target, herein K=12, therefore by (4-1) Obtain target position set Box={ Box_j| j=1,2 ..., 12 } and corresponding personage's Making by Probability Sets Person={ Person_j|j =1,2 ..., 12 }, the Box_jIndicate the position of j-th of detection target, Person_jIndicate that j-th of detection target is personage's Probability, Person_j∈ [0,1], Person_jThe bigger expression detection target of value be personage a possibility that it is bigger；

(4-2) classifies to detection target according to Person, and the real border frame position that 12 detection targets are arranged is PPerson={ PPerson_j| j=1,2 ..., 12 }, calculate group personage classification loss function Loss_cls, calculation formula isWherein, PPerson_jIndicate the true classification of j-th of detection target, PPerson_j Value is 0 or 1, PPerson_j=0 indicates that the detection target is not personage, PPerson_j=1 indicates that the detection target is personage；

The actual position of 12 detection targets is arranged according to Box and Person regressive object position in (4-3) are as follows:

BBox={ BBox_j| j=1,2 ..., 12 }

Calculate group's character positions loss function are as follows:

Wherein, BBox_jIndicate the actual position of j-th of detection target；

(4-4) calculates group personage detection and localization penalty values Loss, calculation formula Loss=Loss_cls+λLoss_locIf Loss≤Loss_max, then region candidate network is trained finishes, output area candidate network parameter, executes step (8)；If Loss > Loss_max, then each layer of update area candidate network of parameterThen return step (4), again Carry out person detecting；Loss_maxIt is preset crowd's detection and localization maximum loss value, λ is that position returns and human classification task Balance factor, α are the learning rates of stochastic gradient descent method,Indicate the local derviation of group personage detection and localization loss function It counts, Loss in the present embodiment_max=0.5, λ=1, α=0.0001.

Five, video to be detected is detected using trained region candidate network:

Video to be detected is reacquired, video to be detected is successively normalized, feature extraction and feature are melted It closes, obtains the multidimensional fusion feature figure F of video to be detected_new, by F_newThe trained region candidate network of input step (7), Obtain the group personage detection and localization result in new video.Using area candidate network carries out target detection, it is contemplated that people from group The feature that object number is more, task is complicated, it is parallel to carry out position recurrence and category classification operation, improve detection efficiency.In classification During classification, because it is personage that detection is with clearly defined objective, classification two is divided to for two class of personage and non-personage, reduce detection other The time of classification waste, and true classification results are incorporated, improve the accuracy of category classification.It is put back into during returning in place, In order to simplify calculating process, the other target position of figure kind, recurrence task of refining only are returned.During integrally training, add Enter task balance factor, according to scene type, adjusts optimal task ratio, complete video group personage's detection and localization.

Six, experiment simulation

During test method performance, currently common object detection method Faster-RCNN, FPN and Mask- are selected Method, evaluation criterion are detection Detection accuracies under different IoU threshold values and different sizes to RCNN as a comparison.So-called IoU is Refer to friendship and the ratio of testing result and legitimate reading, IoU ∈ [0,1], IoU value is higher, the result of detection closer to legitimate reading, Remember that IoU >=0.5 is AP_50 in test process, note IoU >=0.75 is AP_75.In evaluation procedure, by target size be divided into it is small, In, big three classifications, be denoted as AP_S, AP_M, AP_L respectively.Fig. 3 give the present invention with control methods Faster-RCNN, The Detection accuracy comparison diagram of FPN, Mask-RCNN.From experimental result it can be found that with monohierarchy top-level feature is used only Faster-RCNN is compared, and has been used three kinds of methods of multi-layer fusion feature to obtain higher Detection accuracy, has been illustrated multilayer Grade fusion feature has stronger feature representation ability compared to monohierarchy top-level feature.FPN and Mask-RCNN are in characteristic processing During check configuration be used only carry out fusion treatment, the present invention obtains more accurate detection using two-way treatment channel Effect, experimental result also show this patent method and have obtained more preferably detection accurately for different IoU threshold values and target size Rate.

The above is only a preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of video group personage's position finding and detection method based on multidimensional fusion feature, which is characterized in that executed including sequence The step of (1) to (8):

(1) video as training sample is inputted, the kind of object and position in video are returned it is known that carrying out size frame by frame to video The size of each frame video frame is uniformly scaled H × W size by one processing, and H indicates video frame height, and W indicates that video frame is wide Degree；

(2) video is obtained frame by frame to video carries out feature extraction by step (1) treated using InceptionV3 model The characteristics of image of each level forms multi-layer video features figure F ', F '={ F_i' | i=1,2 ..., numF }, F_i' indicate i-th Tomographic image feature, numF indicate the total number of plies of video image characteristic extracted, F₁' indicate underlying image feature, F '_numFIndicate top Tomographic image feature；

(3) the multi-layer video features figure F ' carry out Fusion Features operation to being drawn into, includes the steps that successively executing (3-1) extremely (3-4):

(3-1) increases by one from F '_numFTo F₁' fusion channel, it is downward from top-level feature to multi-layer video features figure F ' progress Fusion Features, obtain top-down video features figure F^top-down；The method of Fusion Features are as follows: since top layer images feature F′_numFStart, traverses each tomographic image feature F downwards_i', to F_i' carry out convolution kernel successively as conv₁, step-length stride₁'s Convolution operation and upSample₁Up-sampling operation again, obtainsFinally obtain F^top-down={ F_i ^top-down| i=1, 2 ..., numF }；

(3-2) increases by one from F₁ ^top-downIt arrivesFusion channel, to F^top-downCarry out the spy upward from low-level image feature Sign fusion, obtains bottom-up video features figure F^bottom-up, F^bottom-up={ F_i ^bottom-up| i=1,2 ..., numF }, F_i ^bottom-upIndicate bottom-up video features figure F^bottom-upThe i-th tomographic image feature；The method of Fusion Features are as follows:

A. i=1 is initialized；

B. F is calculated_i ^bottom-up=F_i ^top-down, to F_i ^bottom-upProgress convolution kernel is conv₂, step-length stride₂Convolution behaviour Make, obtains resultIt calculates

C. i=i+1 is updated；

F^bottom-up={ F_i ^bottom-up| i=1,2 ..., numF }

(3-3) is to bottom-up video features figure F^bottom-upIn each tomographic image feature F_i ^bottom-upCarrying out convolution kernel is conv₃, step-length stride₃Convolution operation, obtained result is denoted as F_i, obtained all F_iConstitute multidimensional fusion feature figure F, F={ F_i| i=1,2 ..., numF }；

(4) by multidimensional fusion feature figure F input area candidate network, K detection target is exported, obtains target position set Box ={ Box_j| j=1,2 ..., K } and corresponding personage's Making by Probability Sets Person={ Person_j| j=1,2 ..., K }, the Box_j Indicate the position of j-th of detection target, Person_jIndicate that j-th of detection target is the probability of personage, Person_j∈ [0,1], Person_jThe bigger expression detection target of value be personage a possibility that it is bigger；

(5) classified according to Person to detection target, the real border frame position that K detection target is arranged is PPerson ={ PPerson_j| j=1,2 ..., K }, calculate group personage classification loss function Loss_cls, calculation formula isWherein, PPerson_jIndicate the true classification of j-th of detection target, PPerson_j Value is 0 or 1, PPerson_j=0 indicates that the detection target is not personage, PPerson_j=1 indicates that the detection target is personage；

BBox={ BBox_j| j=1,2 ..., K }

Calculate group's character positions loss function are as follows:

Wherein, BBox_jIndicate the actual position of j-th of detection target；

(7) group personage detection and localization penalty values Loss, calculation formula Loss=Loss are calculated_cls+λLoss_locIf Loss≤ Loss_max, then region candidate network is trained finishes, output area candidate network parameter, executes step (8)；If Loss > Loss_max, then each layer of update area candidate network of parameterThen return step (4), re-start Person detecting；Loss_maxIt is preset crowd's detection and localization maximum loss value, λ is the balance of position recurrence and human classification task The factor, α are the learning rates of stochastic gradient descent method,Indicate the partial derivative of group personage detection and localization loss function；

(8) video to be detected is reacquired, video to be detected is successively normalized, feature extraction and feature are melted It closes, obtains the multidimensional fusion feature figure F of video to be detected_new, by F_newThe trained region candidate network of input step (7), Obtain the group personage detection and localization result in new video.

2. a kind of video group personage's position finding and detection method based on multidimensional fusion feature according to claim 1, special Sign is, in the step (1), H=720, W=1280.

3. a kind of video group personage's position finding and detection method based on multidimensional fusion feature according to claim 1, special Sign is, in the step (2), numF=4.

4. a kind of video group personage's position finding and detection method based on multidimensional fusion feature according to claim 1, special Sign is, in the step (3), conv₁=1, stride₁=1, upSample₁=2, conv₂=3, stride₂=2, conv₃ =1, stride₃=1.

5. a kind of video group personage's position finding and detection method based on multidimensional fusion feature according to claim 1, special Sign is, in the step (4), K=12；In the step (7), Loss_max=0.5, λ=1, α=0.0001.