CN110610210B - Multi-target detection method - Google Patents
Multi-target detection method Download PDFInfo
- Publication number
- CN110610210B CN110610210B CN201910881579.XA CN201910881579A CN110610210B CN 110610210 B CN110610210 B CN 110610210B CN 201910881579 A CN201910881579 A CN 201910881579A CN 110610210 B CN110610210 B CN 110610210B
- Authority
- CN
- China
- Prior art keywords
- frame
- layer
- convolution
- positioning information
- activation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 25
- 230000004913 activation Effects 0.000 claims abstract description 73
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 27
- 239000011159 matrix material Substances 0.000 claims abstract description 13
- 238000000034 method Methods 0.000 claims description 52
- 230000006870 function Effects 0.000 claims description 31
- 239000013598 vector Substances 0.000 claims description 28
- 238000010586 diagram Methods 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 12
- 238000012216 screening Methods 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 10
- 238000004873 anchoring Methods 0.000 claims description 9
- 230000005284 excitation Effects 0.000 claims description 6
- 230000004807 localization Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 238000007670 refining Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 230000010354 integration Effects 0.000 claims description 5
- 230000015572 biosynthetic process Effects 0.000 claims description 4
- 238000003786 synthesis reaction Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 230000005764 inhibitory process Effects 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 230000007787 long-term memory Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 abstract description 6
- 230000000875 corresponding effect Effects 0.000 description 31
- 238000005516 engineering process Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 5
- 230000000007 visual effect Effects 0.000 description 5
- 238000013507 mapping Methods 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011049 filling Methods 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 238000003703 image analysis method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/245—Classification techniques relating to the decision surface
- G06F18/2451—Classification techniques relating to the decision surface linear, e.g. hyperplane
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-target detection method, which comprises the following steps: s1, extracting a basic feature graph and a context feature graph; s2, capturing a fuzzy activation region in the real-time image, and taking the coordinate information of the fuzzy activation region as first positioning information; s3, setting the cycle number n to 1; s4, taking the coordinate pair of the nth batch of positioning information as the center, and acquiring a local feature matrix of a fixed area near the center position on the basic feature map; s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module; s6, making n equal to n +1, and returning to step S4 until reaching the preset cycle number; outputting all positioning information; s7, inputting all positioning information into an area suggestion network; s8, loop through steps S1 to S7, sum all errors. The invention can output positioning information through the pre-defined double-layer cyclic convolution transmitting module, thereby obtaining the approximate position of the target object in the image and greatly reducing the calculation amount of each characteristic point.
Description
Technical Field
The invention belongs to the technical field of target detection, and particularly relates to a method for detecting an image target in the field of computers.
Background
Nowadays, high-speed parallel computing architectures represented by NVIDIA series are gradually developed, and products thereof are gradually developed into more civilized parallel computing platforms from DirectX to current computing devices such as GTX1080TI and the like. Under the popular trend, various fields needing abundant computing resources are developed rapidly, wherein the image processing technology leads the pioneer and promotes the progress of numerous fields such as intelligent technology, monitoring technology, security technology and the like. In addition, several related hardware technologies in the field of real-time image perception are also being developed, such as infrared cameras, monocular and binocular cameras in peripheral devices, and these perception hardware are gradually developed to a structure more conforming to the human visual system, so as to facilitate the processing of images on software algorithms. Under the dual support of the image perception module and the embedded computing system, how to use a more innovative and ergonomic image analysis technology on the mobile intelligent machine problem becomes a leading-edge subject with great challenges across software, hardware and subjects.
In recent years, due to the rapid development of hardware systems, many higher-performance real-time image analysis and processing methods are emerging, wherein the important problem to be solved is the real-time detection of multiple targets in an image. At present, many mature multi-target detection methods have appeared in the industry, and in the field of traditional machine learning, target detection is generally divided into three steps of violence extraction candidate regions, manual design extraction features, and classification by using a rapid Adaboost algorithm or an SVM algorithm with strong generalization capability.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a multi-target detection method which can output positioning information through a predefined double-layer cyclic convolution transmitting module so as to obtain the approximate position of a target object in an image, greatly reduce the calculation amount of each characteristic point, avoid anchoring and calculating each position in a fast-RCNN method and enable the detection to be more consistent with the detection speed under the real-time condition.
The purpose of the invention is realized by the following technical scheme: a multi-target detection method comprises the following steps:
s1, acquiring a real-time image from the camera, extracting a basic feature map from the real-time image, and inputting the real-time image into a context integration network to obtain a context feature map;
s2, inputting the real-time image into a convolution activation network, and capturing a fuzzy activation area in the real-time image; creating a corresponding positioning cache pool for each real-time image, storing coordinate information of all the activation areas of the real-time image, and taking the coordinate information of the fuzzy activation areas as a first batch of positioning information;
s3, setting the cycle number n to 1;
s4, taking the coordinate pair of the nth batch of positioning information as a center, acquiring a local feature matrix of a fixed area near the center position on the basic feature map, and filtering and pooling the local feature matrix to obtain a focusing feature;
s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module, and outputting two predicted values: the first predicted value is the confidence coefficient that the feature belongs to the corresponding category; the second predicted value is the (n + 1) th batch of positioning information input into the double-layer cyclic convolution emission module, and the batch of positioning information is input into a positioning cache pool, the positioning cache pool is maintained globally, and the positioning information output by the current cyclic body is injected into the positioning cache pool after each cyclic body is finished;
s6, making n equal to n +1, and returning to step S4 until reaching the preset cycle number; finally, all positioning information is output, and two error values are output: the first error value is the error between the prediction category near the positioning and the label category, and the second error value is the error between the positioning information and the coordinates of the label target frame;
s7, inputting all positioning information in the positioning cache pool into the area suggestion network, wherein, a fixed number of first suggestion candidate box sets and two error values are output firstly: the first error value is the error between the coordinates of the first suggestion candidate frame and the real coordinates of the target frame, and the second error value is the error between the prediction category of the first suggestion candidate frame and the real category of the target frame;
inputting the first suggestion candidate box into a suggestion-target module, screening and refining the first suggestion candidate box set, and outputting second suggestion candidate boxes, corresponding category labels of each second suggestion candidate box and offsets of coordinates of each second suggestion candidate box and corresponding label coordinates;
inputting the second suggestion candidate box into an ROI posing module in a fast RCNN method, and outputting final interest features with consistent sizes through pooling operation; inputting the final interest features into an RCNN module in a fast RCNN method to respectively obtain the prediction categories and the prediction frame coordinates in the candidate frames corresponding to the final interest features, and generating two error values: the first error is the error between the prediction category and the label category, and the second error is the error between the coordinates of the prediction frame and the coordinates of the label;
and S8, looping steps S1 to S7, summing all errors, performing a back propagation algorithm, and iteratively updating each weight parameter in the network.
Further, the context synthesis network is formed by overlapping basic convolution operation units, wherein the expression of a single convolution operation unit formula is as follows:
wherein, l represents the first layer of the convolution layer; j represents the jth feature map of the current convolutional layer;a jth feature diagram representing the l-1 th roll base;an mth convolution kernel representing a jth feature map of the ith layer volume base layer; mjRepresenting all convolution kernel sets corresponding to the jth feature map; symbol denotes convolution operation;an offset vector parameter representing a jth characteristic diagram of the ith layer volume base layer; f (-) represents the activation function.
Further, the step S2 includes the following sub-steps:
s21, inputting the original image into a superposed basic convolution operation unit, wherein two basic convolution operation units and one basic pooling unit are used as a convolution block unit, five convolution block units with the same structure are used for cascading, and after the cascading, a characteristic map of the original image is output;
s22, inputting the feature map into the GAP layer, and outputting a one-dimensional vector, wherein elements in the one-dimensional vector are the feature matrix average value of each channel in the feature map; calculating the weighted sum of all values in the one-dimensional vector, and solving an activation function layer of the class probability;
s23, carrying out weighted summation on the output characteristics of the last layer of convolution lumps, and solving an activation map based on the category, wherein the formula is as follows:
wherein f isk(x, y) represents the activation value of the last layer of convolution blob output features at its coordinates (x, y) for the kth cell in the feature vector;representing the weight corresponding to each unit k for each class c, namely the importance of the unit k for the class c;
s24, the activation map based on the category obtained in the above steps is scaled to the same size as the original image, the correlation between the activation region and the category is compared, and the coordinate point with the highest local correlation is calculated:
ci=max(g(x0,y0),g(x1,y1)...g(xN,yN))
where g (-) represents the pixel value of the location, (x)i,yi) As a coordinate point within the local activation region, ciRepresenting the correlation in the ith local activation region;
and outputting the obtained coordinate point with the highest local correlation as a coordinate point set of the first batch of positioning information.
Further, the double-layer cyclic convolution transmitting module takes a basic feature map, a category mark value, a target frame mark value, a first batch of positioning information and context features as input, continuously explores the optimal positioning under the optimization of a back propagation algorithm through a double-layer cyclic transmission network, and finally outputs all positioning information with a fixed quantity; the method comprises the following specific steps:
s51, using each image as processing unit, taking t batch positioning information Lt=((x0,y0),(x1,y1)...(xm,ym) Extracting high-dimensional vectors in a corresponding fixed range (2 x 2) on the basic characteristic diagram, and processing the high-dimensional vectors into a fixed-dimension positioning characteristic tensor P through vector operationt;
S52, inputting the localization feature tensor into a convolution layer, a regularization layer and an excitation function layer, and outputting an activated localization feature tensor, wherein the formula is as follows:
Pt_active=RELU(BN(Conv2d(Pt)))
wherein, when x > 0, relu (x) x; when x is less than or equal to 0, relu (x) is 0; BN (-) is derived from a deep learning network layer whose main function is to prevent the network from overfitting, which is called Batch-normalization network layer; conv2d (. cndot.) is derived from a deep learning network layer whose main function is to extract image features using convolution operations;
positioning information LtInputting the information into a convolution layer, a regularization layer and an excitation function layer, and outputting an activation positioning information tensor, wherein the formula is as follows:
Lt_active=RELU(BN(Conv2d(Lt)))
carrying out tensor multiplication on the two tensors to obtain a focusing characteristic tensor; the formula is as follows:
s53, in the circulation operation, one circulation unit is measured at one moment, and the specific implementation method is as follows:
s531, if the first moment in the loop operation is, initializing a hidden state of a first layer convolution LSTM structure by using a zero vector; otherwise, inputting the focusing feature tensor and the hidden state at the previous moment into a first layer convolution LSTM structure encoder e (-) and outputting a new hidden state of the encoder, wherein the formula is as follows:
wherein,represents the new hidden state of the output encoder e at time t;represents the new cell state of the output encoder e at time t, which is derived from a step defined in the existing method LSTM network structure, for storing hidden information valid in long-term memory; gtA focusing feature tensor representing time t;
hidden state of convolutional LSTM structure encoderInputting a cascaded convolution network ec (-) and a linear classifier, and outputting the classification probability of the focusing region:
wherein V represents the classification probability of the output; w1Denotes a first weight parameter, W2Represents a second weight parameter; b1Denotes a first offset parameter, b2Indicating a second bias parameterCounting; probiIs the probability of a certain class, ViIs the output of the ith unit at the front stage of the classifier, and C represents the total number of categories;
the output definition of the above first layer ends;
s532, in the second layer of convolution LSTM structure decoder d (-) if the cycle is the initial time, the decoder takes the context feature graph as an initialization value; if not, the decoder willAnd the hidden state of the previous moment of the layer is taken as input, and the new hidden state of the decoder is output, wherein the formula is as follows:
hidden state of convolutional LSTM structure decoderInputting a linear regressor el (·), outputting two-dimensional coordinates of the attention orientation position at the next moment, and storing the coordinates into a positioning cache pool, wherein the formula is as follows:
the definition of the output of the above second layer is finished;
s533, calculating a local classification error of the image by using a cross entropy method in the current time cycle and combining the hidden information of the past time and the current information by using a double-layer network; in the time cycle, the double-layer network combines the hidden information of the past time and the current information in combination, and the positioning error of the next time is calculated by using a mean square error method, wherein the formula is as follows:
wherein gt represents the label category, yiThe coordinates of the annotation are represented by,representing the probability of predicting the output representation of the current labeled coordinate to a certain value; in the process of calculating the loss function, the loss function sums the loss of each image at each time and takes the average value as the final loss.
And S534, circulating the steps S531 to S533, and taking all the positioning information obtained by the final loss and all the moments as the output of the double-layer circular convolution transmitting module.
Further, the regional suggestion network takes the basic feature map, the mark value of the target frame and the positioning information of the positioning cache pool as input, improves the RPN method according to the positioning information, which is abbreviated as LRPN, and then outputs the coordinates of a fixed number of suggestion candidate frames and the intra-frame prediction result, and outputs two loss functions;
the specific implementation mode is as follows:
s71, inputting the basic feature graph into the convolution network and the activation network, and outputting an activation feature graph;
s72, introducing anchor frame rules, and setting A anchor frames for each space position on the activation graph; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the activation characteristic diagram, and outputting a fraction tensor with the channel number of 2 multiplied by A, wherein the channel number in the tensor represents the class prediction probability fraction in A fixed-size anchor boxes corresponding to each space position on the LRPN activation characteristic diagram; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the LRPN activation characteristic diagram, and outputting a coordinate offset tensor with the channel number of 4 multiplied by A, wherein the channel number in the tensor represents the predicted coordinate offset of A fixed-size anchor frames corresponding to each space position on the LRPN activation characteristic diagram and is used for solving the optimal predicted coordinate;
s73, inputting the fraction tensor, the coordinate offset tensor and the positioning information into an LRPN suggestion module, and specifically comprising the following steps:
s731, screening out a corresponding effective anchoring frame in the tensor according to the positioning information, and cutting out the effective anchoring frame beyond the image boundary;
s732, sorting the anchor frames and the corresponding fraction tensors based on fractions, and taking the first N, wherein N is used as a hyper-parameter;
s733, screening the fraction tensors by using a non-maximum inhibition method, and taking the first M fraction tensors which are sorted according to sizes from the rest fraction tensors as a first suggested candidate frame set;
s734, setting the labeling of all the anchor boxes as-1 if the following conditions are not met:
a. an anchor frame corresponding to the positioning information;
b. anchor frames that do not exceed the image boundaries;
modifying each anchor frame based on the predicted coordinate offset, comparing the modified anchor frame coordinates with the labels of the target frames in the image, selecting F anchor frames with the largest overlapping rate and larger than a positive threshold value, and setting the labels of the anchor frames to be 1; b anchor frames with the largest overlapping rate and smaller than a negative threshold are selected, and the labels of the anchor frames are set to be 0; wherein, the F value is set according to the super parameter in the fast RCNN method, and the B value is set according to the super parameter in the fast RCNN method;
s735, removing the anchor frame with the label of-1, and solving the loss function of the rest anchor frames to obtain the anchor frame category prediction and the loss lrpn of the category labelclsAnchor frame coordinate prediction and loss of coordinate tags lrpnbboxAnd outputting the first suggestion candidate box;
s736, screening and refining the first suggestion candidate box set, i.e., inputting the first suggestion candidate box set into a suggestion-target module, specifically operating as follows:
s7361, traversing the frame coordinates of all the labels for any frame coordinate in the first suggested candidate frame set, selecting the frame coordinate with the largest overlapping rate as a corresponding label frame, if the overlapping rate of the label frame and the candidate frame is greater than a threshold value, considering the candidate frame as a foreground, and if the overlapping rate of the label frame and the candidate frame is less than the threshold value, considering the candidate frame as a background;
s7362, setting a fixed number of foreground and background for each training period, sampling from the candidate frames to meet the fixed number requirement, and taking the sampled candidate frame set as a second candidate frame set;
s7363, calculating the offset between the coordinates of the second candidate frame and the coordinates of the corresponding label frame, and using the offset and the second candidate frame set as the output of the module.
The invention has the beneficial effects that: the method comprises the steps of setting a plurality of cascaded modules, extracting an activation position based on identification features in an image through a convolution activation network to serve as an initial value input into a double-layer cyclic convolution emission module, then extracting features used for training in a convolution depth network and a context comprehensive network, obtaining a positioning information set based on visual attention through the double-layer cyclic convolution emission module, obtaining a second suggestion candidate frame based on the positioning information through a region suggestion network, and finally predicting the category and the coordinate of the features in the suggestion candidate frame through an ROI posing module and an RCNN module. The algorithm has the advantages that the positioning information can be output through the pre-defined double-layer circular convolution transmitting module, so that the approximate position of the target object in the image is obtained, the calculation amount of each characteristic point is greatly reduced, the anchoring and calculation of each position in the fast-RCNN method are avoided, and the detection can be more consistent with the detection speed under the real-time condition.
Drawings
FIG. 1 is a flow chart of a multi-target detection method of the present invention.
Detailed Description
When a visual device such as a monocular camera on a mobile machine acquires real-time images from the periphery of the machine, the embedded computing system needs to perform target detection processing on the images in time so as to judge the target position and the target size in the current environment and further take corresponding action measures. Based on the requirement, an accurate and rapid multi-target detection method is crucial. In this process, the mainstream method requires processing of all regions of the image, and each time a processing region has a possibility of being duplicated with other processing regions. In a hierarchical method structure of deep learning, the weight coefficient in a feature expression function is increased correspondingly due to the huge number of regional suggestions, so that the invention designs a scheme, improves the regional processing efficiency and reduces the load of a computing system by combining a human visual focusing mechanism.
The method comprises the steps of setting a plurality of cascaded modules, extracting an activation position based on identification features in an image through a convolution activation network to serve as an initial value input into a double-layer cyclic convolution emission module, then extracting features used for training in a convolution depth network and a context comprehensive network, obtaining a positioning information set based on visual attention through the double-layer cyclic convolution emission module, obtaining a second suggestion candidate frame based on the positioning information through a region suggestion network, and finally predicting the category and the coordinate of the features in the suggestion candidate frame through an ROI posing module and an RCNN module. The algorithm has the advantages that the positioning information can be output through the pre-defined double-layer circular convolution transmitting module, so that the approximate position of the target object in the image is obtained, the calculation amount of each characteristic point is greatly reduced, the anchoring and calculation of each position in the fast-RCNN method are avoided, and the detection can be more consistent with the detection speed under the real-time condition. The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, a multi-target detection method includes the following steps:
s1, acquiring a real-time image from the camera, extracting a basic feature map from the real-time image through a relevant technology such as ResNet series, and simultaneously inputting the real-time image into a context integration network to obtain a context feature map as the initial input of a subsequent module;
when hardware such as a camera acquires an original image with larger spatial resolution, a context synthesis network acts on the original image, the context synthesis network is formed by overlapping basic convolution operation units, wherein the expression of a single convolution operation unit formula is as follows:
wherein,l represents the first layer of the convolutional layer; j represents the jth feature map of the current convolutional layer;a jth feature diagram representing the l-1 th roll base;an mth convolution kernel representing a jth feature map of the ith layer volume base layer; mjRepresenting all convolution kernel sets corresponding to the jth feature map; symbol denotes convolution operation;an offset vector parameter representing a jth characteristic diagram of the ith layer volume base layer; f (-) represents the activation function.
In the context comprehensive network, the first 9 convolution operation units of the related algorithm VGG16 network are selected as an architecture, the equivalent value of the input channel number, the output channel number, the convolution kernel size, the convolution operation step length and the filling parameter of each convolution operation unit is fixed, the original image with the channel number of 3 is input into the first convolution operation unit, and finally the context feature map with the channel number of 128 is output.
In the scheme, the purpose of using the context integration network is to use the context integration network as an initialization basis of the double-layer cyclic convolution transmitting module, so that the double-layer cyclic convolution transmitting module can obtain the extracted fuzzy characteristics in advance, acquire the global information of the image and accelerate the process of accurately positioning the fuzzy position of the target.
S2, inputting the real-time image into a convolution activation network, and capturing a fuzzy activation area in the real-time image; creating a corresponding positioning cache pool for each real-time image, storing coordinate information of all the activation areas of the real-time image, and taking the coordinate information of the fuzzy activation areas as a first batch of positioning information;
based on an existing unsupervised algorithm scheme CAM, the convolution activation network mainly realizes a process of generating an activation map based on a category unsupervised by a GAP algorithm and outputting target fuzzy positioning information. The method specifically comprises the following substeps:
s21, inputting the original image into a superposed basic convolution operation unit, wherein two basic convolution operation units and one basic pooling unit are used as a convolution block unit, five convolution block units with the same structure are used for cascading, and after the cascading, a characteristic map of the original image is output;
s22, inputting the feature map into the GAP layer, and outputting a one-dimensional vector, wherein elements in the one-dimensional vector are the feature matrix average value of each channel in the feature map; calculating the weighted sum of all values in the one-dimensional vector, and solving an activation function layer of the class probability;
s23, based on function output for activating the category, the important area in the original image is marked and visualized in a mode of mapping the weight output by the GAP layer to the output characteristic of the last layer of convolution block mass, the specific method is to carry out weighted summation on the output characteristic of the last layer of convolution block mass, and solve the activation map based on the category, and the formula is as follows:
wherein f isk(x, y) represents the activation value of the last layer of convolution blob output features at its coordinates (x, y) for the kth cell in the feature vector;representing the weight corresponding to each unit k for each class c, namely the importance of the unit k for the class c; after passing through the GAP layer, for each cell, the activation values at all coordinate positions are solved and summed.
S24, in the convolution depth activation network, the activation region is subjected to weight mapping so as to highlight the importance of the activation region in the original image. The activation map based on the category obtained in the above steps is scaled to the same size as the original image, the correlation between the activation region and the category is compared, and the coordinate point with the highest local correlation is calculated:
ci=max(g(x0,y0),g(x1,y1)...g(xN,yN))
where g (-) represents the pixel value of the location, (x)i,yi) As a coordinate point within the local activation region, ciRepresenting the correlation in the ith local activation region;
and outputting the obtained coordinate point with the highest local correlation as a coordinate point set of the first batch of positioning information.
S3, setting the cycle number n to 1;
s4, this step is the inlet of the circulation body, the input of which comes from the output generated in the previous cycle: the nth batch of positioning information; if the loop body enters for the first time, using the first batch of positioning information in the step S2 as a center, otherwise, taking the coordinate pair of the nth batch of positioning information as the center, acquiring a local feature matrix of a fixed area near the center position on the basic feature map, and filtering and pooling the local feature matrix to obtain a focusing feature;
s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module, and outputting two predicted values: the first predicted value is the confidence coefficient that the feature belongs to the corresponding category; the second predicted value is the (n + 1) th batch of positioning information input into the double-layer cyclic convolution emission module, and the batch of positioning information is input into a positioning cache pool, the positioning cache pool is maintained globally, and the positioning information output by the current cyclic body is injected into the positioning cache pool after each cyclic body is finished; so far, the output of the circulation body outlet is as follows: (n + 1) th batch of positioning information, the output being the input for the next entry into the cycle body inlet;
the double-layer cyclic convolution transmitting module takes a basic feature map, a category mark value, a target frame mark value, a first batch of positioning information and context features as input, continuously explores the optimal positioning under the optimization of a back propagation algorithm through a double-layer cyclic transmitting network, and finally outputs all positioning information with fixed quantity; the method comprises the following specific steps:
s51, using each image as processing unit, taking t batch positioning information Lt=((x0,y0),(x1,y1)…(xm,ym) Extracting high-dimensional vectors in a corresponding fixed range (2 x 2) on the basic characteristic diagram, and processing the high-dimensional vectors into a fixed-dimension positioning characteristic tensor P through vector operationt;
S52, inputting the localization feature tensor into a convolution layer, a regularization layer and an excitation function layer, and outputting an activated localization feature tensor, wherein the formula is as follows:
Pt_active=RELU(BN(Conv2d(Pt)))
wherein, when x > 0, relu (x) x; when x is less than or equal to 0, relu (x) is 0; BN (-) is derived from a deep learning network layer whose main function is to prevent the network from overfitting, which is called Batch-normalization network layer; conv2d (. cndot.) is derived from a deep learning network layer whose main function is to extract image features using convolution operations;
positioning information LtInputting the information into a convolution layer, a regularization layer and an excitation function layer, and outputting an activation positioning information tensor, wherein the formula is as follows:
Lt_active=RELU(BN(Conv2d(Lt)))
carrying out tensor multiplication on the two tensors to obtain a focusing characteristic tensor; the formula is as follows:
s53, in the circulation operation, one circulation unit is measured at one moment, and the specific implementation method is as follows:
s531, if the first moment in the loop operation is, initializing a hidden state of a first layer convolution LSTM structure by using a zero vector; otherwise, inputting the focusing feature tensor and the hidden state at the previous moment into a first layer convolution LSTM structure encoder e (-) and outputting a new hidden state of the encoder, wherein the formula is as follows:
wherein,represents the new hidden state of the output encoder e at time t;represents the new cell state of the output encoder e at time t, which is derived from a step defined in the existing method LSTM network structure, for storing hidden information valid in long-term memory; gtA focusing feature tensor representing time t;
hidden state of convolutional LSTM structure encoderInputting a cascaded convolution network ec (-) and a linear classifier, and outputting the classification probability of the focusing region:
wherein V represents the classification probability of the output; w1Denotes a first weight parameter, W2Represents a second weight parameter; b1Denotes a first offset parameter, b2Representing a second bias parameter; the formula has the main function of solving the characteristic vector obtained after classification operation of the current focus area;
Probiis the probability of a certain class, ViIs the output of the ith unit at the front stage of the classifier, and C represents the total number of categories; the formula mainly uses the feature vector of the previous formula for mapping to obtain a classification probability value aiming at a certain class;
the output definition of the above first layer ends;
s532, in the second layer convolution LSTM structure decoder d (-)If the cycle is the initial time, the decoder takes the context feature map as an initialization value; if not, the decoder willAnd the hidden state of the previous moment of the layer is taken as input, and the new hidden state of the decoder is output, wherein the formula is as follows:
hidden state of convolutional LSTM structure decoderInputting a linear regressor el (·), outputting two-dimensional coordinates of the attention orientation position at the next moment, and storing the coordinates into a positioning cache pool, wherein the formula is as follows:
the definition of the output of the above second layer is finished;
s533, calculating a local classification error of the image by using a cross entropy method in the current time cycle and combining the hidden information of the past time and the current information by using a double-layer network; in the time cycle, the double-layer network combines the hidden information of the past time and the current information in combination, and the positioning error of the next time is calculated by using a mean square error method, wherein the formula is as follows:
wherein gt represents the label category, yiThe coordinates of the annotation are represented by,representing the probability of predicting the output representation of the current labeled coordinate to a certain value; in the process of calculating the loss function, the loss function sums the loss of each image at each time and takes the average value as the final loss.
And S534, circulating the steps S531 to S533, and taking all the positioning information obtained by the final loss and all the moments as the output of the double-layer circular convolution transmitting module.
S6, making n equal to n +1, and returning to step S4 until reaching the preset cycle number; finally, all positioning information is output, and two error values are output: the first error value is the error between the prediction category near the positioning and the label category, and the second error value is the error between the positioning information and the coordinates of the label target frame;
s7, inputting all positioning information in the positioning cache pool into the area suggestion network, wherein, a fixed number of first suggestion candidate box sets and two error values are output firstly: the first error value is the error between the coordinates of the first suggestion candidate frame and the real coordinates of the target frame, and the second error value is the error between the prediction category of the first suggestion candidate frame and the real category of the target frame;
inputting the first suggestion candidate box into a suggestion-target module, screening and refining the first suggestion candidate box set, and outputting second suggestion candidate boxes, corresponding category labels of each second suggestion candidate box and offsets of coordinates of each second suggestion candidate box and corresponding label coordinates;
the regional suggestion network takes a basic feature map, a target frame mark value and positioning information of a positioning cache pool as input, improves an RPN method according to the positioning information, is abbreviated as LRPN, and then outputs coordinates of a fixed number of suggestion candidate frames and in-frame prediction results and outputs two loss functions;
the specific implementation mode is as follows:
s71, inputting the basic feature graph into the convolution network and the activation network, and outputting an activation feature graph;
s72, introducing anchor frame rules, and setting A anchor frames for each space position on the activation graph; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the activation characteristic diagram, and outputting a fraction tensor with the channel number of 2 multiplied by A, wherein the channel number in the tensor represents the class prediction probability fraction in A fixed-size anchor boxes corresponding to each space position on the LRPN activation characteristic diagram; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the LRPN activation characteristic diagram, and outputting a coordinate offset tensor with the channel number of 4 multiplied by A, wherein the channel number in the tensor represents the predicted coordinate offset of A fixed-size anchor frames corresponding to each space position on the LRPN activation characteristic diagram and is used for solving the optimal predicted coordinate;
s73, inputting the fraction tensor, the coordinate offset tensor and the positioning information into an LRPN suggestion module, and specifically comprising the following steps:
s731, screening out a corresponding effective anchoring frame in the tensor according to the positioning information, and cutting out the effective anchoring frame beyond the image boundary;
s732, sorting the anchor frames and the corresponding fraction tensors based on fractions, and taking the first N, wherein N is used as a hyper-parameter;
s733, screening the fraction tensors by using a non-maximum inhibition method, and taking the first M fraction tensors which are sorted according to sizes from the rest fraction tensors as a first suggested candidate frame set;
s734, setting the labeling of all the anchor boxes as-1 if the following conditions are not met:
a. an anchor frame corresponding to the positioning information;
b. anchor frames that do not exceed the image boundaries;
modifying each anchor frame based on the predicted coordinate offset, comparing the modified anchor frame coordinates with the labels of the target frames in the image, selecting F anchor frames with the largest overlapping rate and larger than a positive threshold value, and setting the labels of the anchor frames to be 1; b anchor frames with the largest overlapping rate and smaller than a negative threshold are selected, and the labels of the anchor frames are set to be 0; wherein, the F value is set according to the hyper-parameter in the fast RCNN method, the number ratio of the selected anchor frames larger than the positive threshold value to the anchor frames smaller than the negative threshold value is 1:2, the total number of the anchor frames is 300, and then the F value is set as 100; setting the value B to be 200 according to the super parameter setting in the fast RCNN method in the same way;
s735, removing the anchor frame with the label of-1, and solving the loss function of the rest anchor frames to obtain the anchor frame category prediction and the loss lrpn of the category labelclsAnchor frame coordinate prediction and loss of coordinate tags lrpnbboxAnd outputting the first suggestion candidate box;
s736, screening and refining the first suggestion candidate box set, i.e., inputting the first suggestion candidate box set into a suggestion-target module, specifically operating as follows:
s7361, traversing the frame coordinates of all the labels for any frame coordinate in the first suggested candidate frame set, selecting the frame coordinate with the largest overlapping rate as a corresponding label frame, if the overlapping rate of the label frame and the candidate frame is greater than a threshold value, considering the candidate frame as a foreground, and if the overlapping rate of the label frame and the candidate frame is less than the threshold value, considering the candidate frame as a background;
s7362, setting a fixed number of foreground and background for each training period, sampling from the candidate frames to meet the fixed number requirement, and taking the sampled candidate frame set as a second candidate frame set;
s7363, calculating the offset of the coordinates of the second candidate frame and the coordinates of the corresponding label frame, and taking the offset and the second candidate frame set as the output of the module;
inputting the second suggestion candidate box into an ROI posing module in a fast RCNN method, and outputting final interest features with consistent sizes through pooling operation; inputting the final interest features into an RCNN module in a fast RCNN method to respectively obtain the prediction categories and the prediction frame coordinates in the candidate frames corresponding to the final interest features, and generating two error values: the first error is the error between the prediction category and the label category, and the second error is the error between the coordinates of the prediction frame and the coordinates of the label;
and S8, looping steps S1 to S7, summing all errors, performing a back propagation algorithm, and iteratively updating each weight parameter in the network.
Details of other module implementations
1) Inputting the second candidate box set into a pooling module, and outputting final interest features with uniform sizes by using a ROI posing method in fast RCNN;
2) inputting the final interest feature into a cascaded convolutional network by using an RCNN module method in fast RCNN, outputting a coordinate predicted value of a second candidate frame, inputting the final interest feature into the cascaded convolutional network, outputting a category predicted value of the final interest feature, and calculating the loss RCNN of the coordinate predicted value and a label valuebboxCalculating the loss rcnn of the above category prediction value and label valuecls。
3) The total loss is calculated as follows:
according to the total loss formula, the invention uses an end-to-end method, and adjusts the weight matrix according to the total loss L in parallel by using a supervised training SGD algorithm, wherein the weight matrix comprises the weight matrix of other supervised modules except the convolution depth activation module.
4) If in the testing stage, the coordinate prediction value in 2) is output as the detection result of the frame coordinate, and the category prediction value in 2) is output as the detection result of the frame category.
The scheme can be used as an independent complete technical scheme for realizing a computer product form, a medium for storing program codes is used as basic hardware for realizing the scheme, real-time camera equipment is usually used as equipment for receiving high-resolution images in external equipment, GTX1080Ti is used as image computing equipment, and terminal platforms such as personal computers and flat panels are used as output equipment of prediction results.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.
Claims (5)
1. A multi-target detection method is characterized by comprising the following steps:
s1, acquiring a real-time image from the camera, extracting a basic feature map from the real-time image, and inputting the real-time image into a context integration network to obtain a context feature map;
s2, inputting the real-time image into a convolution activation network, and capturing a fuzzy activation area in the real-time image; creating a corresponding positioning cache pool for each real-time image, storing coordinate information of all fuzzy activation areas of the real-time image, and taking the coordinate information of the fuzzy activation areas as a first batch of positioning information;
s3, setting the cycle number n to 1;
s4, taking the coordinate pair of the nth batch of positioning information as a center, acquiring a local feature matrix of a fixed area near the center position on the basic feature map, and filtering and pooling the local feature matrix to obtain a focusing feature;
s5, inputting the focusing characteristic and the context characteristic into a double-layer cyclic convolution transmitting module, and outputting two predicted values: the first predicted value is the confidence coefficient that the feature belongs to the corresponding category; the second predicted value is the (n + 1) th batch of positioning information input into the double-layer cyclic convolution emission module, and the batch of positioning information is input into a positioning cache pool, the positioning cache pool is maintained globally, and the positioning information output by the current cyclic body is injected into the positioning cache pool after each cyclic body is finished;
s6, making n equal to n +1, and returning to step S4 until reaching the preset cycle number; finally, all positioning information is output, and two error values are output: the first error value is the error between the prediction category near the positioning and the label category, and the second error value is the error between the positioning information and the coordinates of the label target frame;
s7, inputting all positioning information in the positioning cache pool into the area suggestion network, wherein, a fixed number of first suggestion candidate box sets and two error values are output firstly: the first error value is the error between the coordinates of the first suggestion candidate frame and the real coordinates of the target frame, and the second error value is the error between the prediction category of the first suggestion candidate frame and the real category of the target frame;
inputting the first suggestion candidate box into a suggestion-target module, screening and refining the first suggestion candidate box set, and outputting second suggestion candidate boxes, corresponding category labels of each second suggestion candidate box and offsets of coordinates of each second suggestion candidate box and corresponding label coordinates;
inputting the second suggestion candidate box into a ROIploling module in a fast RCNN method, and outputting final interest features with consistent sizes through pooling operation; inputting the final interest features into an RCNN module in a fast RCNN method to respectively obtain the prediction categories and the prediction frame coordinates in the candidate frames corresponding to the final interest features, and generating two error values: the first error is the error between the prediction category and the label category, and the second error is the error between the coordinates of the prediction frame and the coordinates of the label;
and S8, looping steps S1 to S7, summing all errors obtained from S1 to S7, performing a back propagation algorithm, and iteratively updating each item of weight parameter in the network.
2. The multi-target detection method of claim 1, wherein the context synthesis network is formed by superposition of basic convolution operation units, wherein the expression of a single convolution operation unit formula is as follows:
wherein, l represents the first layer of the convolution layer; j represents the jth feature map of the current convolutional layer;a jth feature diagram representing the l-1 th roll base;an mth convolution kernel representing a jth feature map of the ith layer volume base layer; mjRepresenting all convolution kernel sets corresponding to the jth feature map; symbol denotes convolution operation;an offset vector parameter representing a jth characteristic diagram of the ith layer volume base layer; f (-) represents the activation function.
3. The multi-target detection method according to claim 1, wherein the step S2 includes the following sub-steps:
s21, inputting the original image into a superposed basic convolution operation unit, wherein two basic convolution operation units and one basic pooling unit are used as a convolution block unit, five convolution block units with the same structure are used for cascading, and after the cascading, a characteristic map of the original image is output;
s22, inputting the feature map into the GAP layer, and outputting a one-dimensional vector, wherein elements in the one-dimensional vector are the feature matrix average value of each channel in the feature map; calculating the weighted sum of all values in the one-dimensional vector, and solving an activation function layer of the class probability;
s23, carrying out weighted summation on the output characteristics of the last layer of convolution lumps, and solving an activation map based on the category, wherein the formula is as follows:
wherein f isk(x, y) represents the activation value of the last layer of convolution blob output features at its coordinates (x, y) for the first cell in the feature vector;representing the weight corresponding to each unit k for each class c, namely the importance of the unit k for the class c;
s24, the activation map based on the category obtained in the above steps is scaled to the same size as the original image, the correlation between the activation region and the category is compared, and the coordinate point with the highest local correlation is calculated:
ci=max(g(x0,y0),g(x1,y1)...g(xN,yN))
where g (-) represents the pixel value of the location, (x)i,yi) As a coordinate point within the local activation region, ciRepresenting the correlation in the ith local activation region;
and outputting the obtained coordinate point with the highest local correlation as a coordinate point set of the first batch of positioning information.
4. The multi-target detection method according to claim 1, wherein the double-layer cyclic convolution emission module takes a basic feature map, a category mark value, a target frame mark value, a first batch of positioning information and a context feature as input, continuously explores optimal positioning under the optimization of a back propagation algorithm through a double-layer cyclic emission network, and finally outputs all positioning information with a fixed quantity; the method comprises the following specific steps:
s51, using each image as processing unit, taking t batch positioning information Lt=((x0,y0),(x1,y1)...(xm,ym) In the fixed range 2 x 2 corresponding to the basic characteristic diagram, the high-dimensional vector is taken out and processed into the fixed-dimension positioning characteristic tensor P through vector operationt;
S52, inputting the localization feature tensor into a convolution layer, a regularization layer and an excitation function layer, and outputting an activated localization feature tensor, wherein the formula is as follows:
Pt_active=RELU(BN(Conv2d(Pt)))
wherein, when x > 0, relu (x) x; when x is less than or equal to 0, relu (x) is 0;
positioning information LtInputting the information into a convolution layer, a regularization layer and an excitation function layer, and outputting an activation positioning information tensor, wherein the formula is as follows:
Lt_active=RELU(BN(Conv2d(Lt)))
carrying out tensor multiplication on the two tensors to obtain a focusing characteristic tensor; the formula is as follows:
s53, in the circulation operation, one circulation unit is measured at one moment, and the specific implementation method is as follows:
s531, if the first moment in the loop operation is, initializing a hidden state of a first layer convolution LSTM structure by using a zero vector; otherwise, inputting the focusing feature tensor and the hidden state at the previous moment into a first layer convolution LSTM structure encoder e (-) and outputting a new hidden state of the encoder, wherein the formula is as follows:
wherein,represents the new hidden state of the output encoder e at time t;represents the new cell state of the output encoder e at time t, which is derived from a step defined in the existing method LSTM network structure, for storing hidden information valid in long-term memory; gtA focusing feature tensor representing time t;
hidden state of convolutional LSTM structure encoderInputting a cascaded convolution network ec (-) and a linear classifier, and outputting the classification probability of the focusing region:
wherein V represents the classification probability of the output; w1Denotes a first weight parameter, W2Represents a second weight parameter; b1Denotes a first offset parameter, b2Representing a second bias parameter; probiIs the probability of a certain class, ViIs the output of the ith unit at the front stage of the classifier, and C represents the total number of categories;
the output definition of the above first layer ends;
s532, in the second layer of convolution LSTM structure decoder d (-) if the cycle is the initial time, the decoder takes the context feature graph as an initialization value; if the cycle is not the initial time, the decoder willAnd the hidden state of the previous moment of the layer is taken as input, and the new hidden state of the decoder is output, wherein the formula is as follows:
hidden state of convolutional LSTM structure decoderInputting a linear regressor el (·), outputting two-dimensional coordinates of the attention orientation position at the next moment, and storing the coordinates into a positioning cache pool, wherein the formula is as follows:
the definition of the output of the above second layer is finished;
s533, calculating a local classification error of the image by using a cross entropy method in the current time cycle and combining the hidden information of the past time and the current information by using a double-layer network; in the time cycle, the double-layer network combines the hidden information of the past time and the current information in combination, and the positioning error of the next time is calculated by using a mean square error method, wherein the formula is as follows:
wherein gt represents the label category, yiThe coordinates of the annotation are represented by,representing the probability of predicting the output representation of the current labeled coordinate to a certain value; in the process of calculating the loss function, the loss function sums the loss of each image at each moment, and the average values are respectively taken as the final loss;
and S534, circulating the steps S531 to S533, and taking all the positioning information obtained by the final loss and all the moments as the output of the double-layer circular convolution transmitting module.
5. The multi-target detection method according to claim 1, wherein the regional suggestion network takes a basic feature map, a target frame mark value and positioning information of a positioning cache pool as input, improves an RPN method according to the positioning information, which is abbreviated as LRPN, and then outputs coordinates of a fixed number of suggestion candidate frames and intra-frame prediction results, and outputs two loss functions;
the specific implementation mode is as follows:
s71, inputting the basic feature graph into the convolution network and the activation network, and outputting an activation feature graph;
s72, introducing anchor frame rules, and setting A anchor frames for each space position on the activation graph; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the activation characteristic diagram, and outputting a fraction tensor with the channel number of 2 multiplied by A, wherein the channel number in the tensor represents the class prediction probability fraction in A fixed-size anchor boxes corresponding to each space position on the LRPN activation characteristic diagram; performing convolution operation with the step length of 1 and the convolution kernel of 1 multiplied by 1 on the LRPN activation characteristic diagram, and outputting a coordinate offset tensor with the channel number of 4 multiplied by A, wherein the channel number in the tensor represents the predicted coordinate offset of A fixed-size anchor frames corresponding to each space position on the LRPN activation characteristic diagram and is used for solving the optimal predicted coordinate;
s73, inputting the fraction tensor, the coordinate offset tensor and the positioning information into an LRPN suggestion module, and specifically comprising the following steps:
s731, screening out a corresponding effective anchoring frame in the tensor according to the positioning information, and cutting out the effective anchoring frame beyond the image boundary;
s732, sorting the anchor frames and the corresponding fraction tensors based on fractions, and taking the first N, wherein N is used as a hyper-parameter;
s733, screening the fraction tensors by using a non-maximum inhibition method, and taking the first M fraction tensors which are sorted according to sizes from the rest fraction tensors as a first suggested candidate frame set;
s734, setting the labeling of all the anchor boxes as-1 if the following conditions are not met:
a. an anchor frame corresponding to the positioning information;
b. anchor frames that do not exceed the image boundaries;
modifying each anchor frame based on the predicted coordinate offset, comparing the modified anchor frame coordinates with the labels of the target frames in the image, selecting F anchor frames with the largest overlapping rate and larger than a positive threshold value, and setting the labels of the anchor frames to be 1; b anchor frames with the largest overlapping rate and smaller than a negative threshold are selected, and the labels of the anchor frames are set to be 0; wherein, the F value is set according to the super parameter in the fast RCNN method, and the B value is set according to the super parameter in the fast RCNN method;
s735, removing the anchor frame with the label of-1, and solving the loss function of the rest anchor frames to obtain the anchor frame category prediction and the loss lrpn of the category labelclsAnchor frame coordinate prediction and loss of coordinate tags lrpnbboxAnd giving the first suggestionSelecting a frame and outputting;
s736, screening and refining the first suggestion candidate box set, i.e., inputting the first suggestion candidate box set into a suggestion-target module, specifically operating as follows:
s7361, traversing the frame coordinates of all the labels for any frame coordinate in the first suggested candidate frame set, selecting the frame coordinate with the largest overlapping rate as a corresponding label frame, if the overlapping rate of the label frame and the candidate frame is greater than a threshold value, considering the candidate frame as a foreground, and if the overlapping rate of the label frame and the candidate frame is less than the threshold value, considering the candidate frame as a background;
s7362, setting a fixed number of foreground and background for each training period, sampling from the candidate frames to meet the fixed number requirement, and taking the sampled candidate frame set as a second candidate frame set;
s7363, calculating the offset between the coordinates of the second candidate frame and the coordinates of the corresponding label frame, and using the offset and the second candidate frame set as the output of the module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910881579.XA CN110610210B (en) | 2019-09-18 | 2019-09-18 | Multi-target detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910881579.XA CN110610210B (en) | 2019-09-18 | 2019-09-18 | Multi-target detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110610210A CN110610210A (en) | 2019-12-24 |
CN110610210B true CN110610210B (en) | 2022-03-25 |
Family
ID=68891598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910881579.XA Active CN110610210B (en) | 2019-09-18 | 2019-09-18 | Multi-target detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110610210B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111583204B (en) * | 2020-04-27 | 2022-10-14 | 天津大学 | Organ positioning method of two-dimensional sequence magnetic resonance image based on network model |
CN111723852B (en) * | 2020-05-30 | 2022-07-22 | 杭州迪英加科技有限公司 | Robust training method for target detection network |
CN111986126B (en) * | 2020-07-17 | 2022-05-24 | 浙江工业大学 | Multi-target detection method based on improved VGG16 network |
CN113065650B (en) * | 2021-04-02 | 2023-11-17 | 中山大学 | Multichannel neural network instance separation method based on long-term memory learning |
CN113298094B (en) * | 2021-06-10 | 2022-11-04 | 安徽大学 | RGB-T significance target detection method based on modal association and double-perception decoder |
CN113822172B (en) * | 2021-08-30 | 2024-06-14 | 中国科学院上海微系统与信息技术研究所 | Video space-time behavior detection method |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250812A (en) * | 2016-07-15 | 2016-12-21 | 汤平 | A kind of model recognizing method based on quick R CNN deep neural network |
CN108717693A (en) * | 2018-04-24 | 2018-10-30 | 浙江工业大学 | A kind of optic disk localization method based on RPN |
CN108898145A (en) * | 2018-06-15 | 2018-11-27 | 西南交通大学 | A kind of image well-marked target detection method of combination deep learning |
CN109344815A (en) * | 2018-12-13 | 2019-02-15 | 深源恒际科技有限公司 | A kind of file and picture classification method |
CN109359684A (en) * | 2018-10-17 | 2019-02-19 | 苏州大学 | Fine granularity model recognizing method based on Weakly supervised positioning and subclass similarity measurement |
CN109523015A (en) * | 2018-11-09 | 2019-03-26 | 上海海事大学 | Image processing method in a kind of neural network |
US10304208B1 (en) * | 2018-02-12 | 2019-05-28 | Avodah Labs, Inc. | Automated gesture identification using neural networks |
CN109961034A (en) * | 2019-03-18 | 2019-07-02 | 西安电子科技大学 | Video object detection method based on convolution gating cycle neural unit |
CN110097136A (en) * | 2019-05-09 | 2019-08-06 | 杭州筑象数字科技有限公司 | Image classification method neural network based |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10719743B2 (en) * | 2018-01-19 | 2020-07-21 | Arcus Holding A/S | License plate reader using optical character recognition on plural detected regions |
-
2019
- 2019-09-18 CN CN201910881579.XA patent/CN110610210B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250812A (en) * | 2016-07-15 | 2016-12-21 | 汤平 | A kind of model recognizing method based on quick R CNN deep neural network |
US10304208B1 (en) * | 2018-02-12 | 2019-05-28 | Avodah Labs, Inc. | Automated gesture identification using neural networks |
CN108717693A (en) * | 2018-04-24 | 2018-10-30 | 浙江工业大学 | A kind of optic disk localization method based on RPN |
CN108898145A (en) * | 2018-06-15 | 2018-11-27 | 西南交通大学 | A kind of image well-marked target detection method of combination deep learning |
CN109359684A (en) * | 2018-10-17 | 2019-02-19 | 苏州大学 | Fine granularity model recognizing method based on Weakly supervised positioning and subclass similarity measurement |
CN109523015A (en) * | 2018-11-09 | 2019-03-26 | 上海海事大学 | Image processing method in a kind of neural network |
CN109344815A (en) * | 2018-12-13 | 2019-02-15 | 深源恒际科技有限公司 | A kind of file and picture classification method |
CN109961034A (en) * | 2019-03-18 | 2019-07-02 | 西安电子科技大学 | Video object detection method based on convolution gating cycle neural unit |
CN110097136A (en) * | 2019-05-09 | 2019-08-06 | 杭州筑象数字科技有限公司 | Image classification method neural network based |
Non-Patent Citations (2)
Title |
---|
Faster R-CNN: towards real-time object detection with region proposal networks;Ren Shaoqing et al;《Proc of Advances in Neural Information Processing Systems》;20151230;全文 * |
基于卷积神经网络的目标检测研究综述;李旭冬等;《计算机应用研究》;20170113;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110610210A (en) | 2019-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110610210B (en) | Multi-target detection method | |
CN111709311B (en) | Pedestrian re-identification method based on multi-scale convolution feature fusion | |
CN111898432B (en) | Pedestrian detection system and method based on improved YOLOv3 algorithm | |
Nandhini et al. | Object Detection Algorithm Based on Multi-Scaled Convolutional Neural Networks | |
CN108764019A (en) | A kind of Video Events detection method based on multi-source deep learning | |
CN115829991A (en) | Steel surface defect detection method based on improved YOLOv5s | |
Lin et al. | An antagonistic training algorithm for TFT-LCD module mura defect detection | |
CN112149665A (en) | High-performance multi-scale target detection method based on deep learning | |
Kiruba et al. | Hexagonal volume local binary pattern (H-VLBP) with deep stacked autoencoder for human action recognition | |
CN112149664A (en) | Target detection method for optimizing classification and positioning tasks | |
CN105976397A (en) | Target tracking method based on half nonnegative optimization integration learning | |
CN111738074B (en) | Pedestrian attribute identification method, system and device based on weak supervision learning | |
Wen et al. | A Lightweight ST-YOLO Based Model for Detection of Tea Bud in Unstructured Natural Environments. | |
Hu et al. | Automatic detection of pecan fruits based on Faster RCNN with FPN in orchard | |
Li et al. | Research on YOLOv3 pedestrian detection algorithm based on channel attention mechanism | |
Lu et al. | ODL Net: Object detection and location network for small pears around the thinning period | |
Chang et al. | Deep Learning Approaches for Dynamic Object Understanding and Defect Detection | |
Yuan et al. | Research approach of hand gesture recognition based on improved YOLOV3 network and Bayes classifier | |
Wu et al. | Real-time visual tracking via incremental covariance model update on Log-Euclidean Riemannian manifold | |
Xu et al. | A Lightweight Pig Face Recognition Method Based on Efficient Mobile Network and Horizontal Vertical Attention Mechanism | |
Guang et al. | Application of Neural Network-based Intelligent Refereeing Technology in Volleyball | |
Hendi et al. | Automated Video Events Detection and Classification using CNN-GRU Model | |
Ma et al. | Har enhanced weakly-supervised semantic segmentation coupled with adversarial learning | |
Rao et al. | Markov random field classification technique for plant leaf disease detection | |
Peng | Research on YOLOv4 Object Detection Based on K-means Algorithm and Fusion Attention Mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |