CN112613540A

CN112613540A - Target detection method, device and electronic system

Info

Publication number: CN112613540A
Application number: CN202011461531.2A
Authority: CN
Inventors: 商明阳; 杜昂昂; 王志成
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-04-06

Abstract

The invention provides a target detection method, a device and an electronic system, which relate to the technical field of image processing, and the method comprises the following steps: acquiring a characteristic diagram of an image to be detected; performing feature mapping processing on the feature map through a pre-trained preliminary binding detection model to obtain a face frame, a head frame and a human body frame; and binding the corresponding face frame, the head frame and the corresponding human body frame to obtain a face-head binding group. By the method and the device, the relatively accurate face-head-person binding relationship can be obtained, and the practicability of the binding relationship is improved.

Description

Target detection method, device and electronic system

Technical Field

The invention relates to the technical field of image processing, in particular to a target detection method, a target detection device and an electronic system.

Background

In recent years, with the rapid development of artificial intelligence based on a neural network, a pedestrian recognition technology is widely applied to traffic, security and the like. The existing pedestrian recognition technology generally adopts the following modes to detect and recognize the pedestrian: the method comprises the steps of detecting a face frame and a body frame of a pedestrian to be recognized through a detection model, then calculating Intersection over Union (IoU for short) of the face frame and the body frame one by one, establishing a matching relation between the face frame and the body frame of the same pedestrian through a card threshold value and a binary matching mode, recognizing the pedestrian according to the matching relation, and distinguishing the same pedestrian from other pedestrians. However, this pedestrian recognition method is only suitable for simple scenes, and when a human face or a human body is blocked, a matching error between the human face frame and the human body frame is likely to occur, that is, the human face frame of the pedestrian a and the human body frame of the pedestrian B are matched together, so that the accuracy of pedestrian recognition is low.

Disclosure of Invention

In view of this, the present invention provides a target detection method, device and electronic system to improve the accuracy of binding a face frame and a body frame.

In a first aspect, an embodiment of the present invention provides a target detection method, which is applied to an electronic device, and includes: acquiring a characteristic diagram of an image to be detected; carrying out human body, human head and human face feature mapping processing on the feature map through a pre-trained preliminary binding detection model to obtain classification information; the classification information comprises head branch information, and the head branch information comprises a face frame, a classification probability of the face frame, a head frame, a classification probability of the head frame and a first human body frame; taking the face frames with the classification probability higher than a first threshold value as candidate face frame groups, and performing de-duplication processing on the candidate face frame groups to obtain target face frame groups; carrying out duplication removal processing on the head frames with the classification probability higher than a second threshold value to obtain a target head frame group; and binding the target face frame and the target head frame which have corresponding relation in the target face frame group and the target head frame group, and the first human body frame corresponding to the target head frame to obtain a face-head binding group.

With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the classification information further includes human body branch information, and the human body branch information includes a second human body frame and a classification probability corresponding to the second human body frame; the method further comprises the following steps: IoU calculation is carried out on a first human body frame in the head branch information and a second human body frame in the human body branch information; deleting the second body box if IoU of the first body box and the second body box is greater than or equal to a third threshold; and if the IoU of the second human body frame and each first human body frame is smaller than the third threshold value, configuring default head frame information and default face frame information for the second human body frame to form a face-head binding group.

With reference to the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the step of performing deduplication processing on the header frames with classification probabilities higher than a second threshold to obtain a target header frame group includes: respectively adding preset values to the classification probabilities of the head frames corresponding to the target face frames in the target face frame group; wherein the preset value is greater than 0; taking the head frames with the classification probability higher than a second threshold value as candidate head frame groups; based on the principle of preferentially reserving the head frames with high classification probability values, carrying out duplicate removal processing on the candidate head frame groups to obtain target head frame groups; after the step of obtaining a face-to-head binding group, the method further comprises: and subtracting the preset value from the classification probability of the face-head binding group added with the preset value.

With reference to the second possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the step of performing deduplication processing on the candidate header group based on a header rule with a high priority retention classification probability value to obtain a target header group includes: computing IoU a first candidate header and a second candidate header in the set of candidate headers; deleting the second candidate header if IoU of the first candidate header and the second candidate header are greater than a fourth threshold and the classification probability of the first candidate header is greater than the classification probability of the second candidate header.

With reference to the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the step of binding a target face frame and a target head frame corresponding to the target face frame group and the target head frame group, and a corresponding first human body frame to obtain a face-head binding group includes: forming a face-head binding pair by the target face frame and the target head frame which have corresponding relation in the target face frame group and the target head frame group; configuring a face frame with default information for a target head frame of the target head frame group to which a target face frame is not bound, and obtaining a face-head binding pair corresponding to the target head frame; and adding the first human body frame corresponding to each target head frame in the target head frame group to the face-head binding pair corresponding to the target head frame to obtain a face-head binding group.

With reference to the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where a training process of the preliminary binding detection model includes: acquiring an image sample set; the labeling data of the image samples in the image sample set comprise a face real value frame, a head real value frame, a human body real value frame and a binding relation between the real value frames corresponding to the same person; based on the labeling data, allocating a sample label to an anchor point prediction frame corresponding to a pixel point in a sample feature map of the image sample; the sample label comprises a positive sample, a negative sample, a positive face sample, a negative face sample, a positive human body sample and a negative human body sample; and training a preliminary binding detection initial model based on the sample labels distributed by the anchor point prediction frames corresponding to the image samples in the image sample set and the binding relationship until the training is finished to obtain the preliminary binding detection model.

With reference to the fifth possible implementation manner of the first aspect, an embodiment of the present invention provides the sixth possible implementation manner of the first aspect, wherein, based on the annotation data, the step of assigning a sample label to an anchor point prediction frame corresponding to a pixel point in a sample feature map of the image sample includes: for anchor point prediction frames corresponding to each pixel point in each sample feature map of the image sample, computing a first IoU of the anchor point prediction frames and a face true value frame in the annotation data and a second IoU of the anchor point prediction frames and a head true value frame in the annotation data; allocating a positive face sample and a negative face sample to the anchor point prediction block according to the first IoU, allocating a positive head sample and a negative head sample to the anchor point prediction block according to the second IoU, and optimizing the allocated sample labels in the following ways: determining a first type of anchor point prediction frame and a second type of anchor point prediction frame from each anchor point prediction frame; the first type of anchor point prediction frame is an anchor point prediction frame which is not allocated with a head positive sample and is allocated with a face positive sample; the second type of anchor point prediction frame is an anchor point prediction frame which is distributed with a head positive sample; distributing a positive sample for the first type anchor point prediction frame, and distributing a positive sample for the human body for the first type anchor point prediction frame and the second type anchor point prediction frame; checking whether a head frame of the second type anchor point prediction frame has a corresponding face frame, if so, distributing a face positive sample for the second type anchor point prediction frame, and if not, distributing a face negative sample for the second type anchor point prediction frame; and after the optimization of the first type of anchor point prediction frame and the second type of anchor point prediction frame is completed, distributing a face negative sample and a human body negative sample for a third type of anchor point prediction frame distributed with a head negative sample.

In a second aspect, an embodiment of the present invention further provides an object detection apparatus, which is applied to an electronic device, and includes: the characteristic diagram acquisition module is used for acquiring a characteristic diagram of an image to be detected; the preliminary binding module is used for carrying out human body, human head and human face feature mapping processing on the feature map through a pre-trained preliminary binding detection model to obtain classification information; the classification information comprises head branch information, and the head branch information comprises a face frame, a classification probability of the face frame, a head frame, a classification probability of the head frame and a first human body frame; the first post-processing module is used for taking the face frames with the classification probability higher than a first threshold value as candidate face frame groups, and performing de-duplication processing on the candidate face frame groups to obtain target face frame groups; the second post-processing module is used for carrying out duplication elimination processing on the head frames with the classification probability higher than a second threshold value to obtain a target head frame group; and the binding relation determining module is used for binding the target face frame and the target head frame which have corresponding relation in the target face frame group and the target head frame group, and the first human body frame corresponding to the target head frame to obtain the face-head binding group.

In a third aspect, an embodiment of the present invention further provides an electronic system, including: image acquisition equipment, processing apparatus and storage device. The image acquisition equipment is used for acquiring an image to be identified; the storage means has stored thereon a computer program which, when executed by the processing apparatus, performs the object detection method as defined in any one of the preceding embodiments.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processing device to perform the steps of the object detection method according to any one of the foregoing embodiments.

The embodiment of the invention has the following beneficial effects:

according to the target detection method, the device and the electronic system provided by the embodiment of the invention, through a pre-trained primary binding detection model, the characteristic map of an image to be detected is subjected to human body, human head and human face characteristic mapping processing to obtain head branch information, the head branch information comprises face frames, the classification probability of the head frames and a first human body frame, the face frames with the classification probability higher than a first threshold value are used as candidate face frame groups to be subjected to de-duplication processing to obtain target face frame groups, the head frames with the classification probability higher than a second threshold value are subjected to de-duplication processing to obtain target head frame groups, and target face frames and target head frames which have corresponding relations in the target face frame groups and the target head frame groups and the first human body frames corresponding to the target head frames are bound to obtain the face-head-person binding groups. The mode of acquiring the corresponding human body, the human head and the human face frame and determining the binding relationship of the human face and the human head makes the binding relationship of the human face and the human head more accurate, and improves the reliability and the practicability of the model.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of an electronic system according to an embodiment of the invention;

fig. 2 is a schematic flowchart of a target detection method according to a second embodiment of the present invention;

fig. 3 is a schematic flowchart of a target detection method according to a third embodiment of the present invention;

fig. 4 is a schematic flowchart of a target detection method according to a fourth embodiment of the present invention;

fig. 5 is a schematic flowchart of a target detection method according to a fifth embodiment of the present invention;

fig. 6 is a schematic flowchart of a training method for a preliminary binding detection model according to a sixth embodiment of the present invention;

fig. 7 is a schematic overall architecture diagram of a neural network model according to a sixth embodiment of the present invention;

fig. 8 is a schematic structural diagram of a target detection apparatus according to a seventh embodiment of the present invention;

fig. 9 is a schematic structural diagram of another object detection apparatus according to a seventh embodiment of the present invention;

fig. 10 is a schematic structural diagram of an object detection apparatus according to an eighth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In consideration of the fact that the face cannot be detected or the detected face is incomplete frequently in an actual scene due to the angle, illumination and the like of the face, the existing pedestrian recognition method has the problems of high recognition difficulty and poor recognition accuracy in the scene. Based on this, the embodiment of the invention provides a target detection method, a target detection device and an electronic system. According to the technology, the head frame is introduced, the people frame corresponding to the head frame is utilized for subsequent processing, the people frame is prevented from being killed by mistake, the binding accuracy of the human face, the human head and the human body corresponding to the surrounding frame is improved, and the technology can be applied to scenes such as pedestrian identification and pedestrian tracking.

To facilitate understanding of the present embodiment, a detailed description will be given to a target detection method disclosed in the present embodiment.

Example one

First, referring to fig. 1, a schematic diagram of an electronic system 100 is shown. The electronic system can be used for realizing the target detection method and device of the embodiment of the invention.

As shown in FIG. 1, an electronic system 100 includes one or more processing devices 102, one or more memory devices 104, an input device 106, an output device 108, and one or more image capture devices 110, which are interconnected via a bus system 112 and/or other type of connection mechanism (not shown). It should be noted that the components and structure of the electronic system 100 shown in fig. 1 are exemplary only, and not limiting, and that the electronic system may have other components and structures as desired.

The processing device 102 may be a server, a smart terminal, or a device containing a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, may process data for other components in the electronic system 100, and may control other components in the electronic system 100 to perform target detection functions.

Storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer-readable storage medium and executed by processing device 102 to implement the client functionality (implemented by the processing device) of the embodiments of the invention described below and/or other desired functionality. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

Image capture device 110 may acquire an image to be identified and store the captured image in storage 104 for use by other components.

For example, the devices used to implement an object detection method, apparatus and electronic system according to an embodiment of the present invention may be integrated or distributed, such as integrating the processing device 102, the storage device 104, the input device 106 and the output device 108, and arranging the image capturing device 110 at a specific position where an image can be captured. When the above-described devices in the electronic system are integrally provided, the electronic system may be implemented as an intelligent terminal such as a camera, a smart phone, a tablet computer, a vehicle-mounted terminal, and the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the electronic system described above may refer to the corresponding process in the foregoing method embodiments, and is not described herein again.

Example two

Referring to fig. 2, a flow chart of an object detection method, which can be applied to the electronic system, mainly includes the following steps S202 to S210:

step S202, acquiring a characteristic diagram of the image to be detected.

Specifically, the image may be an image extracted from a video received by the image capturing device, or an image obtained from a network or a third party. The target detection network generally comprises a backbone network and a detector, wherein the backbone network refers to a basic model for performing feature extraction operation on an image sample (mainly comprising Alexnet, VGG, Resnet, DenseNet, googleNet, MobileNet and the like, and can be selected according to actual needs). The feature vector extraction of the specified part of the pedestrian or the human body can be carried out on the image to be detected through the backbone network, and then the feature map of the image to be detected is obtained, wherein the feature map can be one or a plurality of.

Step S204, mapping the characteristics of the human body, the human head and the human face on the characteristic diagram through a pre-trained primary binding detection model to obtain classification information; the classification information includes head branch information including a face frame, a classification probability of the face frame, a head frame, a classification probability of the head frame, and a first human body frame.

The preliminary binding detection model is a pre-trained model, the input of the model is a characteristic diagram of an image to be detected, and the output of the model is a face frame, a head frame and a human body frame with binding relation. The training process of the preliminary binding detection model is described in example six below.

For example, the preliminary binding detection model may perform detection on target objects to be recognized (faces, human bodies, and heads) in a feature map by using a sliding window operation mode, where the center position of each sliding window corresponds to a pixel point on the feature map, that is, an anchor point (anchor), and a target frame with a preset aspect ratio and area is generated for each anchor point position through a window sliding on the feature map, that is, an anchor-box (anchor-box) corresponding to each anchor point position.

The anchor point frames corresponding to each anchor point position can be input into the feature mapping model, human body, human head and human face feature mapping processing is carried out on the anchor point frames corresponding to each anchor point position through the feature mapping model, and features in each anchor point frame are classified. In the feature classification process, carrying out independent image frame selection on different positions in an anchor point frame, thus obtaining a plurality of feature frames; respectively calculating the probability that each feature box is similar to the global features or the local features of the real human body, wherein the probability is the classification probability; according to the classification probability, the class of the feature frame can be determined, namely the feature frame belongs to a face frame, a head frame or a human body frame, so that the face frame, the head frame and the human body frame corresponding to the anchor point frame and the related classification probability are obtained.

The head frame is larger than the face frame and is a frame corresponding to the outline of the head, and the body frame is larger than the head frame and is a frame corresponding to the outline of the body.

The anchor point frames corresponding to the feature map can be one or more anchor point frames corresponding to each pixel point on the feature map, and the face frame, the head frame and the human body frame corresponding to each anchor point and the related classification probability can be obtained by adopting the processing for each anchor point frame. In order to distinguish the body frame in the subsequent body branch information from the body frame in the subsequent body branch information, the embodiment of the present invention refers to the body frame in the header branch information as the first body frame.

Specifically, in the head branch information, the face frame and the head frame each have a respective classification probability, and the first body frame and the head frame share the same classification probability (i.e., the classification probability of the head frame is directly used as the classification probability of the first body frame). Therefore, classification probability information of the first body frame is not given separately in the header branch information.

The face, head, and person feature maps in the above header branch information may all use the same anchor configuration, i.e., the size, scale, and aspect ratio (ratio) of the anchor block may all be the same.

And step S206, using the face frames with the classification probability higher than the first threshold value as candidate face frame groups, and performing de-duplication processing on the candidate face frame groups to obtain target face frame groups.

Specifically, a part of face frames with lower classification probability can be filtered out by setting a first threshold, so that the calculation amount is reduced. The above-mentioned mode of performing deduplication processing on the candidate face frame group may adopt an NMS postprocessing method, and by removing duplicate face frames, it is ensured that the same pedestrian to be identified does not have duplicate face frames.

And step S208, carrying out deduplication processing on the header frames with the classification probability higher than the second threshold value to obtain a target header frame group.

In the embodiment, the objective fact that a face has a head is considered, so the target face frame is necessarily corresponding to the head frame, because the target face frame is a face frame retained after the card threshold and the deduplication processing, the quality of the target face frame is relatively good, it can be estimated that the quality of the head frame corresponding to the target face frame is also relatively good, and the target face frame and the head frame corresponding to the target face frame are both obtained by performing feature mapping processing on the same feature map, and the binding relationship between the target face frame and the head frame is relatively accurate, so the target face frame and the head frame having the corresponding relationship are retained in the embodiment.

The step of reserving the target face frame and the head frame having the corresponding relationship may be specifically to mark a designated label on the head frame corresponding to the target face frame, and not delete the head frame of the designated label in a subsequent processing process; the designation label may be a character label including a designation character string, a label with a designation color mark frame, or the like.

Alternatively, the target face frame and the head frame having the correspondence relationship may be retained, or the classification probability of the head frame corresponding to such a target face frame is increased, when performing subsequent processing, the head frame having a higher priority classification probability is retained, and a specific increase value for increasing the classification probability may be greater than or equal to 0, for example: any value between 0 and 2, preferably a value greater than 1 and less than 2.

The target face frame and the head frame that are retained to have the corresponding relationship may also be implemented in other manners, which is not limited in the present invention.

Because the target face frame is selected from the feature map, and the target face frame necessarily corresponds to a certain head frame, all the head frames corresponding to each target face frame can be reserved, the situation that only the face frame lacks the head frame can be avoided, and the binding accuracy of subsequent face-head people is guaranteed.

And step S210, binding a target face frame and a target head frame which have a corresponding relation in the target face frame group and the target head frame group, and a first human body frame corresponding to the target head frame to obtain a face-head binding group.

Since the first human body frame and the head frame share the same classification probability in step S204, after the target head frame is obtained in step S208, all the first human body frames corresponding to the target head frame can be directly retained according to the classification probability. Because the head frame is selected from the frames in the feature diagram, the head frame necessarily corresponds to a certain human body frame, and the first human body frame corresponding to each target head frame is completely reserved, the situation that only the head frame is absent but the human body frame is absent can be avoided. And then, the target face frame and the target head frame which have corresponding relations in the target face frame group and the target head frame group, and the first human body frame corresponding to the target head frame can be stored in the same list, so that the list has the face frame, the unique head frame corresponding to the face frame and the unique human body frame corresponding to the head frame, the binding relation of the face head and the head is established, and the binding group of the face head and the head is obtained.

According to the target detection method, through a pre-trained primary binding detection model, mapping processing is carried out on a human body, a human head and human face features of a feature map of an image to be detected to obtain head branch information which comprises face frames, classification probabilities of the face frames, the classification probabilities of the head frames and a first human body frame, the face frames with the classification probabilities higher than a first threshold value are used as candidate face frame groups to be subjected to de-duplication processing to obtain target face frame groups, de-duplication processing is carried out on the head frames with the classification probabilities higher than a second threshold value to obtain target head frame groups, and target face frames, target head frames and first human body frames corresponding to the target head frames in corresponding relations in the target face frame groups and the target head frame groups are bound to obtain the face head group binding. The mode of acquiring the corresponding human body, the human head and the human face frame and determining the binding relationship of the human face and the human head makes the binding relationship of the human face and the human head more accurate, and improves the reliability and the practicability of the model.

EXAMPLE III

On the basis of the second embodiment, the embodiment of the invention further provides a target detection method, in order to further improve the accuracy of subsequent target pedestrian identification, the method of the embodiment expands classification information on the basis of the head branch information, wherein the classification information comprises the head branch information and the human body branch information; the content included in the head branch information is the same as the content described above, namely the face frame, the classification probability of the face frame, the head frame, the classification probability of the head frame, and the first human body frame; the body branch information includes a second body frame and a classification probability corresponding to the second body frame, where the first body frame in the header branch information may be different from or the same as the second body frame in the body branch information. Referring to fig. 3, a schematic flow chart of a target detection method is shown, which mainly includes the following steps:

step S302, obtaining a characteristic diagram of the image to be detected.

In this embodiment, the image to be detected may be an image acquired by an image acquisition device, and the image acquisition device may be arranged in a waiting hall of a passenger station (such as a subway or a high-speed rail) to acquire a face image or a human body image according to a detection requirement; the image acquisition equipment can also be arranged at the traffic intersection or two sides of the road for acquiring images of pedestrians according to the detection requirement. The image to be detected can be obtained from a third-party device (such as a cloud server).

The image to be detected is input into the backbone network model, and a characteristic diagram obtained by multilayer convolution processing of the backbone network model can be obtained, wherein the characteristic diagram can be one or a plurality of.

Step S304, carrying out human body, human head and human face feature mapping processing on the feature map through a pre-trained preliminary binding detection model to obtain classification information; the classification information includes header branch information and body branch information; the head branch information comprises a face frame, a classification probability of the face frame, a head frame, a classification probability of the head frame and a first human body frame; the human body branch information includes a second human body frame and a classification probability corresponding to the second human body frame.

Specifically, the process of acquiring the header branch information is the same as that in the above embodiment, and is not described here again. For the human body branch, the feature map can be input into the feature mapping model, the feature mapping model performs image framing of human body global features on the feature map alone, only the feature frame with higher probability of similarity (i.e. classification probability) with the real human body global features is reserved, the part of feature frame is the second human body frame, and the classification probability of the second human body frame is correspondingly obtained while the second human body frame is obtained.

And step S306, using the face frames with the classification probability higher than the first threshold value as candidate face frame groups, and performing de-duplication processing on the candidate face frame groups to obtain target face frame groups.

And step S308, carrying out deduplication processing on the head frames with the classification probability higher than the second threshold value to obtain a target head frame group.

And S310, binding a target face frame and a target head frame which have corresponding relations in the target face frame group and the target head frame group, and a first human body frame corresponding to the target head frame to obtain a face-head binding group.

In step S312, IoU calculation is performed for the first body box in the head branch information and the second body box in the body branch information.

Step S314, judging whether IoU of the first human body frame and the second human body frame is smaller than a third threshold value, if so, executing the following step S316, wherein IoU of the second human body frame and each first human body frame is smaller than the third threshold value; if not, i.e., if IoU of the first and second human body frames is greater than or equal to the third threshold value, the following step S318 is performed.

And step S316, configuring default head frame information and default face frame information for the second human body frame to form a face-head binding group.

Step S318, delete the second body frame.

Considering that the human head and the human face are blocked, some human body frames do not have corresponding human heads and human faces, the embodiment may perform matching operation on a first human body frame in the head branch information and a second human body frame in the human branch information, and whether the matching is performed specifically may be measured based on IoU size, if IoU of the first human body frame and the second human body frame exceeds a given threshold, the first human body frame is retained, and the second human body frame is deleted, so that the binding accuracy is ensured. If IoU of the second body frame and any first body frame do not exceed the given threshold, the second body frame will eventually be retained, which also indicates that it does not have corresponding head and face frames, at which point the face-to-head binding group with default head and face frame information is set, e.g., its face and head frames are set to designated placeholders, such as both [ -1, -1, -1, -1 ]. This establishes the final < face box, head box, people box > group with binding relationships.

In the target detection method, the head frames corresponding to the face frames and the body frames corresponding to the head frames are reserved in the process of mapping the features of the human body, the human head and the human face on the feature map, so that the head frames with the face frames or the body frames with the head frames can be prevented from being lost, the head branch information can be supplemented by introducing the body branch information, a data base is provided for subsequently establishing the binding relationship of the face, the head and the human face, the accuracy of face, head and human detection is improved, and the target detection method has practical value.

In addition, since two human body frames are output during the feature mapping process, and the two human body frames are sometimes repeated, the same pedestrian may be corresponded to the two human body frames during the pedestrian recognition process, for example: one of the body frames is a first body frame from the head branch and the other body frame is a second body frame from the body branch. By calculating IoU of the two human body frames and selecting only one human body frame from the two human body frames according to the relation between IoU and the third threshold value to serve as the subsequent human body frame for establishing the binding relation between the face and the head, repeated human body frames can be further removed, the fact that the same pedestrian to be identified only corresponds to one human body frame is guaranteed, and accuracy and practicability of the detection result are further improved.

Example four

On the basis of the second embodiment, the embodiment of the invention also provides a target detection method, which reserves the head frames corresponding to the target face frames in a mode of modifying the classification probability. Specifically, the step S308 (i.e. performing deduplication processing on the header frames with classification probabilities higher than the second threshold to obtain the target header frame group) is optimized, and the step includes: (1) respectively adding preset values to the classification probabilities of the head frames corresponding to the target face frames in the target face frame group; wherein the preset value is greater than 0; (2) taking the head frames with the classification probability higher than a second threshold value as candidate head frame groups; (3) and based on the principle of preferentially reserving the head frames with high classification probability values, carrying out de-duplication processing on the candidate head frame groups to obtain the target head frame group. Through the processing mode, the head frame corresponding to the target face frame can be prevented from being deleted in the subsequent processing processes of card threshold value, duplicate removal and the like, a data basis is provided for the subsequent establishment of the face-head-person binding relationship, and the accuracy of target pedestrian identification is further ensured.

Referring to fig. 4, a schematic flow chart of a target detection method is shown, which mainly includes the following steps:

and step S402, acquiring a characteristic diagram of the image to be detected.

Step S404, mapping the characteristics of the human body, the human head and the human face on the characteristic diagram through a pre-trained primary binding detection model to obtain classification information; the classification information includes head branch information including a face frame, a classification probability of the face frame, a head frame, a classification probability of the head frame, and a first human body frame.

Step S406, using the face frame with the classification probability higher than the first threshold as a candidate face frame group, and performing de-duplication processing on the candidate face frame group to obtain a target face frame group.

Step S408, respectively adding preset values to the classification probabilities of the head frames corresponding to the target face frames in the target face frame group; wherein the preset value is greater than 0.

For example, 2 is added to the classification probability of the head frame corresponding to each target face frame in the target face frame group, so that the modified classification probabilities of the part of head frames are all larger than 1, and the part of head frames cannot be deleted in the subsequent processing processes of card threshold, duplicate removal and the like.

And step S410, taking the head box with the classification probability higher than the second threshold value as a candidate head box group.

Step S412, based on the principle of preferentially reserving the header frame with high classification probability value, the candidate header frame group is subjected to deduplication processing to obtain the target header frame group.

Specifically, IoU calculating a first candidate header and a second candidate header in the set of candidate headers; if IoU of the first candidate header box and the second candidate header box is greater than the fourth threshold and the classification probability of the first candidate header box is greater than the classification probability of the second candidate header box, the second candidate header box is deleted.

And step S414, binding the target face frame and the target head frame which have corresponding relation in the target face frame group and the target head frame group, and the first human body frame corresponding to the target head frame to obtain a face-head binding group.

And step S416, subtracting the preset value from the classification probability of the face-head binding group added with the preset value.

In the target detection method, in the process of mapping the characteristics of the human body, the human head and the human face on the characteristic diagram, the classification probabilities of the head frames corresponding to the face frames are respectively added with preset values which are more than 0, so that the part of the head frames containing the face frames cannot be deleted in the subsequent processing processes of card threshold value, duplication removal and the like because the modified classification probabilities of the head frames are more than 1, the head frames corresponding to the face frames and the human body frames corresponding to the head frames are reserved, the loss of the head frames with faces is prevented, and the binding relationship of the head and the human face established based on the method is more accurate and reasonable.

In addition, since the classification probability of the head frame is modified in the process of obtaining the target head frame, the classification probability of the head frame with the classification probability greater than 1 in the face-head binding group is restored to the original initial value after the face-head binding relationship is established, so that the classification probability (including the classification probability of the target face frame and the classification probability of the target head frame) output by the primary binding detection model is in a normal level (between 0 and 1), and the stability of the output result of the model is ensured.

EXAMPLE five

On the basis of the third embodiment, an embodiment of the present invention further provides a target detection method, in order to further improve the accuracy of pedestrian identification, the method optimizes step S210 (i.e., binding a target face frame and a target head frame having a corresponding relationship in a target face frame group and a target head frame, and a first body frame corresponding to the target head frame to obtain a face-head binding group), and referring to a flow diagram of the target detection method shown in fig. 5, the method mainly includes the following steps:

and step S502, acquiring a characteristic diagram of the image to be detected.

Step S504, carrying out human body, human head and human face feature mapping processing on the feature map through a pre-trained preliminary binding detection model to obtain classification information; the classification information includes header branch information and body branch information; the head branch information comprises a face frame, a classification probability of the face frame, a head frame, a classification probability of the head frame and a first human body frame; the human body branch information includes a second human body frame and a classification probability corresponding to the second human body frame.

Step S506, the face frames with the classification probability higher than the first threshold are used as candidate face frame groups, and the candidate face frame groups are subjected to deduplication processing to obtain target face frame groups.

And step S508, performing deduplication processing on the header frames with the classification probability higher than the second threshold value to obtain a target header frame group.

Step S510, a face-head binding pair is formed by the target face frame and the target head frame in the target face frame group and the target head frame having a corresponding relationship, and a face frame having default information is configured for the target head frame not bound with the target face frame in the target head frame group, so as to obtain the face-head binding pair corresponding to the target head frame.

For example: target face frame group: a face frame 1, a face frame 2, a face frame 4 and a face frame 5; target header frame group: head frame 1, head frame 2, head frame 3, head frame 4 and head frame 5; wherein, face frame 1 and head frame 1 are that the feature mapping processing obtained, and face frame 1 corresponds with head frame 1 promptly, and face frame 2 corresponds with head frame 2, and face frame 4 corresponds with head frame 4, and face frame 5 corresponds with head frame 5, and based on this, the face head that obtains binds to can show as: < face frame 1, head frame 1>, < face frame 2, head frame 2>, < face frame 4, head frame 4>, < face frame 5, head frame 5 >.

And if the head frame 3 has no target face frame corresponding to the head frame, the head frame 3 is a target head frame without binding the target face frame, in order to avoid losing the head frame 3, the head frame 3 is also bound with a face frame, only the head frame 3 is configured with a face frame with default information, for example, the position of the face frame bound to the head frame 3 is a preset occupation value, such as [ -1, -1, 1], and the corresponding face-head binding group thereof can be expressed as [ -1, -1, -1], and head frame 3 >.

Step S512, adding the first human body frame corresponding to each target head frame in the target head frame group to the face-head binding pair corresponding to the target head frame to obtain a face-head binding group.

Considering that there is a face and a head, the present embodiment adds the target head frame based on the face-head binding pair when adding the body frame. Continuing with the previous example, we can obtain < face frame 1, head frame 1, body frame 1>, < face frame 2, head frame 2, body frame 2>, [ -1, -1, -1], head frame 3, body frame 3>, < face frame 4, head frame 4, body frame 4>, < face frame 5, head frame 5, body frame 5 >.

In step S514, IoU calculation is performed for the first body box in the head branch information and the second body box in the body branch information.

Step S516, judging whether IoU of the first human body frame and the second human body frame is smaller than a third threshold value, if so, executing the following step S518, wherein IoU of the second human body frame and each first human body frame is smaller than the third threshold value; if not, i.e., if IoU of the first and second human body frames is greater than or equal to the third threshold value, the following step S520 is performed.

And step S518, configuring default head frame information and default face frame information for the second human body frame to form a face-head binding group.

And step S520, deleting the second human body frame.

In the target detection method, the head frame corresponding to the face frame and the human body frame corresponding to the head frame are reserved in the process of mapping the human body, the human head and the human face features, and the head frame with the face can be prevented from being lost. In the process of establishing the face-head binding relationship, the face-head binding relationship is established first, and then the final face-head binding relationship is established according to the face-head binding relationship.

EXAMPLE six

On the basis of the above embodiment, an embodiment of the present invention further provides a training method for a preliminary binding detection model, which is shown in fig. 6 and includes the following steps:

step S602, acquiring an image sample set; the labeling data of the image samples in the image sample set comprise a face real value frame, a head real value frame, a human body real value frame and a binding relation between the real value frames corresponding to the same person.

In the training stage, the image sample needs to be labeled, and the labeled data in this embodiment includes: the labeling process only needs to label the head real value frame (if the head frame exists) while normally labeling the face real value frame and the human real value frame, and then the binding relationship is stored in a certain form, namely the head real value frame, the head real value frame and the human real value frame of a person are stored in a list, or the same identifiers are configured for the face real value frame, the head real value frame and the human real value frame of the same person.

Step S604, based on the labeled data, allocating sample labels to anchor point prediction frames corresponding to pixel points in a sample feature map of the image sample; the sample label comprises a positive sample, a negative sample, a positive sample and a negative sample.

Specifically, for an anchor point prediction frame corresponding to each pixel point in each sample feature map of an image sample, a first IoU between the anchor point prediction frame and a face true value frame in the annotation data and a second IoU between the anchor point prediction frame and a head true value frame in the annotation data are calculated. Assigning a positive face sample and a negative face sample to the anchor prediction block according to the first IoU, assigning a positive head sample and a negative head sample to the anchor prediction block according to the second IoU, and optimizing the assigned sample labels in the following manner:

(1) determining a first type of anchor point prediction frame and a second type of anchor point prediction frame from each anchor point prediction frame; the first type of anchor point prediction frame is an anchor point prediction frame which is not distributed with a head positive sample and is distributed with a face positive sample; the second type of anchor prediction block is the anchor prediction block to which the head positive sample is assigned.

(2) And distributing a head positive sample for the first type of anchor point prediction frame, and distributing a human body positive sample for the first type of anchor point prediction frame and the second type of anchor point prediction frame.

(3) And checking whether a head frame of the second type anchor point prediction frame has a corresponding face frame, if so, distributing a face positive sample for the second type anchor point prediction frame, and if not, distributing a face negative sample for the second type anchor point prediction frame.

(4) And after the optimization of the first class of anchor point prediction frame and the second class of anchor point prediction frame is completed, distributing a face negative sample and a human body negative sample for the third class of anchor point prediction frame distributed with the head negative sample.

Step S606, training a preliminary binding detection initial model based on sample labels and binding relations distributed by anchor point prediction frames corresponding to all image samples in the image sample set until the training is finished to obtain the preliminary binding detection model.

As a possible implementation manner, this embodiment provides a neural network model for implementing the method of the foregoing embodiment, referring to the overall architecture diagram of the neural network model shown in fig. 7, where the model includes a backbone network, a feature mapping model, and a binding model, where the binding model may be further subdivided into a preliminary binding model and a binding optimization model, and the binding optimization model is used to further perfect the face-to-head binding relationship based on the preliminary binding model. The feature mapping model and the preliminary binding model may be used as the preliminary binding detection initial model (after training, the preliminary binding detection model), and in order to be more intuitive and easy to understand, the head branch information and the body branch information obtained by the feature mapping model and the corresponding contents thereof are illustrated in fig. 7.

Based on the neural network model shown in fig. 7, the specific training process of the preliminary binding detection model is as follows:

(1) and inputting the image samples marked with the truth box labels into a backbone network to extract the characteristic vectors of the human face, the human head and the human body, and obtaining the characteristic diagram corresponding to each image sample.

(2) And performing frame selection on different positions of each feature map in a mode of configuring anchor point frames with fixed sizes to obtain a set of anchor point frames at different positions on each feature map.

(3) Inputting the anchor point frame (namely the anchor point prediction frame) into the feature mapping model to perform human body, human head and human face feature mapping processing to obtain human body branch information and head branch information;

the head branch information comprises a face prediction frame, the classification probability of the face prediction frame, a head prediction frame, the classification probability of the head prediction frame and a first human body prediction frame corresponding to the head prediction frame;

the human body branch information includes: a second human prediction box and a classification probability of the second human frame.

(4) Positive and negative samples are divided. IoU for each anchor box and each true box label are computed separately, a IoU matrix is constructed, with each row representing an anchor box and each column representing a true box. Specifically, for each anchor block, refer to the following steps:

first, the IoU between the anchor box and the first true box is computed, with the anchor boxes with IoU greater than a first specified threshold as the first positive samples and the anchor boxes with IoU less than a second specified threshold IoU as the first negative samples.

Then, the IoU between the anchor box and the face true box is calculated, the anchor box with the IoU greater than the third specified threshold as the face positive sample, and the anchor box with the IoU less than the fourth specified threshold IoU as the face negative sample.

The above process is a preliminary assignment process of the head and face labels, and in order to make label assignment more accurate, the present embodiment further optimizes the division of the positive and negative samples, specifically, according to the following manner:

if an anchor block is divided into negative samples (-1) and positive samples (1), modifying the head label in the anchor block into positive samples, namely setting the head label to 1 (representing positive samples);

if an anchor block is divided into positive samples, the head label of the anchor block is marked as 1; the face label of the anchor point frame is marked according to whether a face frame exists in the head prediction frame: if there is no face frame in the head prediction frame of the anchor frame, the face label of the anchor frame is marked as 0 (representing a negative sample); if the head prediction frame of the anchor point frame has a face frame, marking the face label of the anchor point frame as 1; whether a face frame exists in a head prediction frame of the anchor frame or not, marking a human body label of the anchor frame as 1;

if the anchor box is divided into negative samples of head and negative samples of face, the face label, the head label and the human label of the anchor box are all marked as 0.

Besides the positive and negative samples of the head, the positive and negative samples of the face and the positive and negative samples of the human body, the label of the anchor point frame can also comprise coordinate offset information of the face prediction frame, the head prediction frame and the human body prediction frame corresponding to the anchor point frame, and is used in the model training process.

(5) And inputting the anchor point frame marked with the label into the initial binding detection model, and performing model training until the initial binding detection model converges, and finishing the training to obtain the initial binding detection model.

EXAMPLE seven

For the target detection method provided in the second embodiment, an embodiment of the present invention provides a target detection apparatus, referring to a schematic structural diagram of the target detection apparatus shown in fig. 8, where the apparatus includes the following modules:

and the characteristic map acquisition module 82 is used for acquiring a characteristic map of the image to be detected.

A preliminary binding module 84, configured to perform human body, human head, and human face feature mapping processing on the feature map through a pre-trained preliminary binding detection model to obtain classification information; the classification information includes head branch information including a face frame, a classification probability of the face frame, a head frame, a classification probability of the head frame, and a first human body frame.

And the first post-processing module 86 is configured to use the face frames with the classification probability higher than the first threshold as candidate face frame groups, and perform deduplication processing on the candidate face frame groups to obtain target face frame groups.

And a second post-processing module 88, configured to perform deduplication processing on the header frames with classification probabilities higher than a second threshold to obtain a target header frame group.

And a binding relationship determining module 90, configured to bind the target face frame and the target head frame in the target face frame group and the target head frame group, which have a corresponding relationship, and the first human body frame corresponding to the target head frame, to obtain a face-head binding group.

The target detection device provided by this embodiment performs human body, human head, and human face feature mapping processing on a feature map of an image to be detected through a pre-trained preliminary binding detection model to obtain head branch information, where the head branch information includes a face frame, a classification probability of the head frame, and a first human body frame, performs deduplication processing on a face frame with the classification probability higher than a first threshold as a candidate face frame group to obtain a target face frame group, performs deduplication processing on a head frame with the classification probability higher than a second threshold to obtain a target head frame group, binds a target face frame and a target head frame having a corresponding relationship in the target face frame group and the target head frame in the target head frame group, and a first human body frame corresponding to the target head frame to obtain a head person binding group. The mode of acquiring the corresponding human body, the human head and the human face frame and determining the binding relationship of the human face and the human head makes the binding relationship of the human face and the human head more accurate, and improves the reliability and the practicability of the model.

The preliminary binding module 84 is further configured to: carrying out human body, human head and human face feature mapping processing on the feature map through a primary binding detection model to obtain classification information; the classification information further comprises human body branch information, and the human body branch information comprises a second human body frame and a classification probability corresponding to the second human body frame.

Correspondingly, the binding relationship determining module 90 is further configured to perform IoU calculation on the first body box in the head branch information and the second body box in the body branch information; deleting the second body box if IoU of the first body box and the second body box is greater than or equal to a third threshold; and if IoU of the second body frame and each first body frame are less than a third threshold value, configuring default head frame information and default face frame information for the second body frame to form a face-head binding group.

The second post-processing module 88 is further configured to add preset values to the classification probabilities of the head frames corresponding to the target face frames in the target face frame group; wherein the preset value is greater than 0; taking the head frames with the classification probability higher than a second threshold value as candidate head frame groups; and based on the principle of preferentially reserving the head frames with high classification probability values, carrying out de-duplication processing on the candidate head frame groups to obtain the target head frame group. Correspondingly, the binding relationship determining module 90 is further configured to subtract the preset value from the classification probability of the target head box added with the preset value in the face-head binding group.

The second post-processing module 88 is further configured to: computing IoU a first candidate header and a second candidate header in the set of candidate headers; if IoU of the first candidate header box and the second candidate header box is greater than the fourth threshold and the classification probability of the first candidate header box is greater than the classification probability of the second candidate header box, the second candidate header box is deleted.

The binding relation determination module 90 is further configured to: forming a face-head binding pair by the target face frame and the target head frame which have corresponding relation in the target face frame group and the target head frame group; configuring a face frame with default information for a target head frame of an unbound target face frame in a target head frame group to obtain a face-head binding pair corresponding to the target head frame; and adding the first human body frames corresponding to all target head frames in the target head frame group to the face head binding pairs corresponding to the target head frames to obtain a face head person binding group.

On the basis of the above fig. 8, another target detection apparatus is provided in the embodiments of the present invention, referring to fig. 9, the apparatus further includes: a model training module 94 for obtaining an image sample set; the labeling data of the image samples in the image sample set comprise a face real value frame, a head real value frame, a human body real value frame and a binding relation between the real value frames corresponding to the same person; based on the labeling data, allocating a sample label for an anchor point prediction frame corresponding to a pixel point in a sample feature map of the image sample; the sample label comprises a positive sample, a negative sample, a positive face sample, a negative face sample, a positive human body sample and a negative human body sample; and training a primary binding detection initial model based on the sample labels and the binding relations distributed by the anchor point prediction frames corresponding to the image samples in the image sample set until the training is finished to obtain the primary binding detection model.

The model training module 94 is further configured to calculate, for anchor point prediction frames corresponding to respective pixel points in each sample feature map of the image sample, a first IoU of the anchor point prediction frame and a face true value frame in the annotation data, and a second IoU of the anchor point prediction frame and a head true value frame in the annotation data; allocating positive face samples and negative face samples to the anchor point prediction blocks according to the first IoU, allocating positive head samples and negative head samples to the anchor point prediction blocks according to the second IoU, and optimizing the allocated sample labels in the following ways:

determining a first type of anchor point prediction frame and a second type of anchor point prediction frame from each anchor point prediction frame; the first type of anchor point prediction frame is an anchor point prediction frame which is not distributed with a head positive sample and is distributed with a face positive sample; the second type of anchor point prediction frame is an anchor point prediction frame which is distributed with head positive samples;

distributing a positive sample for the first type anchor point prediction frame, and distributing a positive sample for the human body for the first type anchor point prediction frame and the second type anchor point prediction frame;

checking whether a head frame of the second type anchor point prediction frame has a corresponding face frame, if so, distributing a face positive sample for the second type anchor point prediction frame, and if not, distributing a face negative sample for the second type anchor point prediction frame;

and after the optimization of the first class of anchor point prediction frame and the second class of anchor point prediction frame is completed, distributing a face negative sample and a human body negative sample for the third class of anchor point prediction frame distributed with the head negative sample.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

Example eight

Referring to fig. 10, an embodiment of the present invention further provides an object detection apparatus 200, including: the system comprises a processor 10, a memory 11, a bus 12 and a communication interface 13, wherein the processor 10, the communication interface 13 and the memory 11 are connected through the bus 12; the processor 10 is arranged to execute executable modules, such as computer programs, stored in the memory 11.

The Memory 11 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 13 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

The bus 12 may be an ISA bus, a PCI bus, an EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 10, but this does not indicate only one bus or one type of bus.

The memory 11 is configured to store a program, and the processor 10 executes the program after receiving an execution instruction, and the method executed by the apparatus defined by the flow process disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 10, or implemented by the processor 10.

The processor 10 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 10. The Processor 10 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 11, and the processor 10 reads the information in the memory 11 and completes the steps of the method in combination with the hardware thereof.

Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product of the readable storage medium provided in the embodiment of the present invention includes a computer readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the foregoing method embodiment, which is not described herein again.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An object detection method, characterized in that the method is applied to an electronic device; the method comprises the following steps:

acquiring a characteristic diagram of an image to be detected;

carrying out human body, human head and human face feature mapping processing on the feature map through a pre-trained preliminary binding detection model to obtain classification information; the classification information comprises head branch information, and the head branch information comprises a face frame, a classification probability of the face frame, a head frame, a classification probability of the head frame and a first human body frame;

taking the face frames with the classification probability higher than a first threshold value as candidate face frame groups, and performing de-duplication processing on the candidate face frame groups to obtain target face frame groups;

carrying out duplication removal processing on the head frames with the classification probability higher than a second threshold value to obtain a target head frame group;

and binding the target face frame and the target head frame which have corresponding relation in the target face frame group and the target head frame group, and the first human body frame corresponding to the target head frame to obtain a face-head binding group.

2. The method of claim 1, wherein the classification information further comprises human body branch information, the human body branch information comprising a second human body frame and a classification probability corresponding to the second human body frame; the method further comprises the following steps:

IoU calculation is carried out on a first human body frame in the head branch information and a second human body frame in the human body branch information;

deleting the second body box if IoU of the first body box and the second body box is greater than or equal to a third threshold;

and if the IoU of the second human body frame and each first human body frame is smaller than the third threshold value, configuring default head frame information and default face frame information for the second human body frame to form a face-head binding group.

3. The method of claim 1, wherein the step of performing de-duplication on the header frames with classification probability higher than the second threshold to obtain the target header frame group comprises:

respectively adding preset values to the classification probabilities of the head frames corresponding to the target face frames in the target face frame group; wherein the preset value is greater than 0;

taking the head frames with the classification probability higher than a second threshold value as candidate head frame groups;

based on the principle of preferentially reserving the head frames with high classification probability values, carrying out duplicate removal processing on the candidate head frame groups to obtain target head frame groups;

after the step of obtaining a face-to-head binding group, the method further comprises: and subtracting the preset value from the classification probability of the face-head binding group added with the preset value.

4. The method of claim 3, wherein the step of performing de-duplication on the candidate header group based on the header principle with high priority retention classification probability value to obtain the target header group comprises:

computing IoU a first candidate header and a second candidate header in the set of candidate headers;

deleting the second candidate header if IoU of the first candidate header and the second candidate header are greater than a fourth threshold and the classification probability of the first candidate header is greater than the classification probability of the second candidate header.

5. The method according to claim 1, wherein the step of binding a target face frame and a target head frame having a corresponding relationship in the target face frame group and the target head frame group, and a first body frame corresponding to the target head frame to obtain a binding group of face-head persons comprises:

forming a face-head binding pair by the target face frame and the target head frame which have corresponding relation in the target face frame group and the target head frame group;

configuring a face frame with default information for a target head frame of the target head frame group to which a target face frame is not bound, and obtaining a face-head binding pair corresponding to the target head frame;

and adding the first human body frame corresponding to each target head frame in the target head frame group to the face-head binding pair corresponding to the target head frame to obtain a face-head binding group.

6. The method of claim 1, wherein the training process of the preliminary binding detection model comprises:

acquiring an image sample set; the labeling data of the image samples in the image sample set comprise a face real value frame, a head real value frame, a human body real value frame and a binding relation between the real value frames corresponding to the same person;

based on the labeling data, allocating a sample label to an anchor point prediction frame corresponding to a pixel point in a sample feature map of the image sample; the sample label comprises a positive sample, a negative sample, a positive face sample, a negative face sample, a positive human body sample and a negative human body sample;

and training a preliminary binding detection initial model based on the sample labels distributed by the anchor point prediction frames corresponding to the image samples in the image sample set and the binding relationship until the training is finished to obtain the preliminary binding detection model.

7. The method of claim 6, wherein the step of assigning a sample label to an anchor point prediction box corresponding to a pixel point in a sample feature map of the image sample based on the annotation data comprises:

for anchor point prediction frames corresponding to each pixel point in each sample feature map of the image sample, computing a first IoU of the anchor point prediction frames and a face true value frame in the annotation data and a second IoU of the anchor point prediction frames and a head true value frame in the annotation data;

allocating a positive face sample and a negative face sample to the anchor point prediction block according to the first IoU, allocating a positive head sample and a negative head sample to the anchor point prediction block according to the second IoU, and optimizing the allocated sample labels in the following ways:

determining a first type of anchor point prediction frame and a second type of anchor point prediction frame from each anchor point prediction frame; the first type of anchor point prediction frame is an anchor point prediction frame which is not allocated with a head positive sample and is allocated with a face positive sample; the second type of anchor point prediction frame is an anchor point prediction frame which is distributed with a head positive sample;

and after the optimization of the first type of anchor point prediction frame and the second type of anchor point prediction frame is completed, distributing a face negative sample and a human body negative sample for a third type of anchor point prediction frame distributed with a head negative sample.

8. An object detection apparatus, applied to an electronic device, includes:

the characteristic diagram acquisition module is used for acquiring a characteristic diagram of an image to be detected;

the preliminary binding module is used for carrying out human body, human head and human face feature mapping processing on the feature map through a pre-trained preliminary binding detection model to obtain classification information; the classification information comprises head branch information, and the head branch information comprises a face frame, a classification probability of the face frame, a head frame, a classification probability of the head frame and a first human body frame;

the first post-processing module is used for taking the face frames with the classification probability higher than a first threshold value as candidate face frame groups, and performing de-duplication processing on the candidate face frame groups to obtain target face frame groups;

the second post-processing module is used for carrying out duplication elimination processing on the head frames with the classification probability higher than a second threshold value to obtain a target head frame group;

and the binding relation determining module is used for binding the target face frame and the target head frame which have corresponding relation in the target face frame group and the target head frame group, and the first human body frame corresponding to the target head frame to obtain the face-head binding group.

9. An electronic system, characterized in that the electronic system comprises: the device comprises an image acquisition device, a processing device and a storage device;

the image acquisition equipment is used for acquiring an image to be identified;

the storage means having stored thereon a computer program which, when executed by the processing apparatus, performs the method of any of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program, when being executed by a processing device, is adapted to carry out the steps of the method according to any one of claims 1 to 7.