CN111582207B

CN111582207B - Image processing method, device, electronic equipment and storage medium

Info

Publication number: CN111582207B
Application number: CN202010403620.5A
Authority: CN
Inventors: 王灿; 李杰锋; 刘文韬; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2023-08-15
Anticipated expiration: 2040-05-13
Also published as: TW202143100A; WO2021227694A1; TWI777538B; CN111582207A

Abstract

The present disclosure provides an image processing method, an apparatus, an electronic device, and a storage medium, wherein the method includes: identifying a target region of a target object in a first image; determining first two-dimensional position information of a plurality of key points representing the gesture of the target object in a first image, relative depth of each key point relative to a reference node of the target object and absolute depth of the reference node of the target object in a camera coordinate system based on a target area corresponding to the target object; three-dimensional position information of a plurality of key points of the target object in a camera coordinate system is determined based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object. The three-dimensional position information of a plurality of key points of the target object in a camera coordinate system can be obtained more accurately based on the first two-dimensional position information of the target object, the relative depth relative to the reference node and the absolute depth of the reference node.

Description

Image processing method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of image processing, and in particular relates to an image processing method, an image processing device, electronic equipment and a storage medium.

Background

The three-dimensional human body posture detector is widely applied to the fields of security protection, games, entertainment and the like. Current three-dimensional human body posture detection methods generally identify first two-dimensional position information of human body key points in an image, and then convert the first two-dimensional position information into three-dimensional position information according to a predetermined positional relationship between the human body key points.

The human body posture obtained by the current three-dimensional human body posture detection method has larger error.

Disclosure of Invention

The embodiment of the disclosure at least provides an image processing method, an image processing device, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides an image processing method, including: identifying a target region of a target object in the first image; determining first two-dimensional position information of a plurality of key points representing the gesture of the target object in the first image, relative depth of each key point relative to a reference node of the target object and absolute depth of the reference node of the target object in a camera coordinate system based on a target area corresponding to the target object; three-dimensional position information of a plurality of key points of the target object in the camera coordinate system is determined based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

In this way, the embodiment of the disclosure can obtain the three-dimensional position information of the plurality of key points of the target object in the camera coordinate system more accurately, the three-dimensional position information of the plurality of key points of the target object in the camera coordinate system can represent the three-dimensional gesture of the target object, and the higher the precision of the three-dimensional position information is, the higher the precision of the three-dimensional gesture of the obtained target object is.

In a possible implementation manner, the identifying the target area of the target object in the first image includes: extracting features of the first image to obtain a feature map of the first image; and determining a plurality of target bounding boxes from a plurality of candidate bounding boxes which are generated in advance based on the feature map, and determining a target area corresponding to the target object based on the target bounding boxes.

Therefore, the target area of the target object is determined in two steps, and the position of each target object in the first image can be accurately detected from the first image, so that the integrity of human body information and the detection precision in the subsequent key point detection process are improved.

In a possible implementation manner, the determining, based on the target bounding box, a target area corresponding to the target object includes: determining a feature subgraph corresponding to each target bounding box based on a plurality of target bounding boxes and the feature graphs; and carrying out bounding box regression processing based on the feature subgraphs respectively corresponding to the target bounding boxes to obtain a target area corresponding to the target object.

In this way, the feature subgraphs corresponding to the target bounding boxes are subjected to bounding box regression processing, so that the positions of the target objects in the first image can be accurately detected from the first image.

In a possible implementation manner, determining an absolute depth of a reference node of the target object in a camera coordinate system based on a target area corresponding to the target object includes: determining a target feature map corresponding to the target object based on a target area corresponding to the target object and the first image; performing depth recognition processing based on a target feature map corresponding to the target object to obtain normalized absolute depth of a reference node of the target object; and obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera.

In this way, the situation that absolute depths of different first image acquisitions acquired by different cameras at the same view angle and the same position are different due to the fact that absolute depths of reference nodes are predicted directly based on target feature maps and caused by different internal parameters of the cameras can be avoided as much as possible.

In a possible implementation manner, performing depth recognition processing based on a target feature map corresponding to the target object to obtain a normalized absolute depth of a reference node of the target object, including: acquiring an initial depth image based on the first image; the method comprises the steps of representing an initial depth value of a second pixel point corresponding to a first pixel point position in a first image in a camera coordinate system, wherein the pixel value of any first pixel point in the initial depth image; determining second two-dimensional position information of a reference node corresponding to the target object in the first image based on a target feature map corresponding to the target object, and determining an initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image; and determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object.

In this way, the normalized absolute depth of the reference node obtained by this process can be made more accurate.

In a possible implementation manner, the determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object includes: performing at least one-level first convolution processing on a target feature map corresponding to the target object to obtain a feature vector of the target object; splicing the feature vector and the initial depth value to form a spliced vector, and performing at least one-stage second convolution processing on the spliced vector to obtain a corrected value of the initial depth value; and obtaining the normalized absolute depth based on the corrected value of the initial depth value and the initial depth value.

In a possible implementation manner, the image processing method is applied to a pre-trained neural network, and the neural network comprises a target detection network, a key point detection network and a depth prediction network; the neural network comprises three branch networks for obtaining a target region of the target object, first two-dimensional position information and the relative depth of the target object, and the absolute depth, respectively.

In this way, the three branch networks of the target detection network, the key point detection network and the depth prediction network form an end-to-end target object gesture detection frame, the first image is processed based on the frame, three-dimensional position information of a plurality of key points of each target object in the first image in a camera coordinate system is obtained, the processing speed is higher, and the recognition accuracy is higher.

In a second aspect, an embodiment of the present disclosure further provides an image processing apparatus, including: the identification module is used for identifying a target area of a target object in the first image; the first detection module is used for determining first two-dimensional position information of a plurality of key points representing the gesture of the target object in the first image, relative depth of each key point relative to a reference node of the target object and absolute depth of the reference node of the target object in a camera coordinate system based on a target area corresponding to the target object; and a second detection module, configured to determine three-dimensional position information of a plurality of key points of the target object in the camera coordinate system based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

In a possible implementation manner, the identifying module is configured, when identifying a target area of a target object in the first image, to: extracting features of the first image to obtain a feature map of the first image; and determining a plurality of target bounding boxes from a plurality of candidate bounding boxes which are generated in advance based on the feature map, and determining a target area corresponding to the target object based on the target bounding boxes.

In a possible implementation manner, the identification module is configured to, when determining, based on the target bounding box, a target area corresponding to the target object: determining a feature subgraph corresponding to each target bounding box based on a plurality of target bounding boxes and the feature graphs; and carrying out bounding box regression processing based on the feature subgraphs respectively corresponding to the target bounding boxes to obtain a target area corresponding to the target object.

In a possible implementation manner, the first detection module is configured to, when determining, based on a target area corresponding to the target object, an absolute depth of a reference node of the target object in a camera coordinate system: determining a target feature map corresponding to the target object based on a target area corresponding to the target object and the first image; performing depth recognition processing based on a target feature map corresponding to the target object to obtain normalized absolute depth of a reference node of the target object; and obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera.

In a possible implementation manner, the first detection module is configured to, when performing depth recognition processing based on a target feature map corresponding to the target object to obtain a normalized absolute depth of a reference node of the target object: acquiring an initial depth image based on the first image; the method comprises the steps of representing an initial depth value of a second pixel point corresponding to a first pixel point position in a first image in a camera coordinate system, wherein the pixel value of any first pixel point in the initial depth image; determining second two-dimensional position information of a reference node corresponding to the target object in the first image based on a target feature map corresponding to the target object, and determining an initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image; and determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object.

In a possible implementation manner, the first detection module is configured to, when determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object: performing at least one-level first convolution processing on a target feature map corresponding to the target object to obtain a feature vector of the target object; splicing the feature vector and the initial depth value to form a spliced vector, and performing at least one-stage second convolution processing on the spliced vector to obtain a corrected value of the initial depth value; and obtaining the normalized absolute depth based on the corrected value of the initial depth value and the initial depth value.

In a possible implementation manner, the image processing device is provided with a pre-trained neural network, and the neural network comprises three branch networks of a target detection network, a key point detection network and a depth prediction network; the neural network comprises three branch networks for obtaining a target region of the target object, first two-dimensional position information and the relative depth of the target object, and the absolute depth, respectively.

In a third aspect, embodiments of the present disclosure further provide a computer device, comprising: a processor and a memory interconnected, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions being executable by the processor to implement the steps of the image processing method of the first aspect, or any of the possible implementation manners of the first aspect, when the computer device is running.

In a fourth aspect, the disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the image processing method of the first aspect, or any of the possible implementation manners of the first aspect.

The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

FIG. 1 illustrates a flow chart of an image processing method provided by an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a particular method of identifying a target region of a target object in a first image provided by an embodiment of the present disclosure;

FIG. 3 illustrates a specific example of determining a target region corresponding to a target object based on a target bounding box provided by an embodiment of the present disclosure;

FIG. 4 illustrates a flow chart of a particular method of determining an absolute depth of a reference node of a target object in a camera coordinate system provided by an embodiment of the present disclosure;

FIG. 5 illustrates a flow chart of another specific method provided by embodiments of the present disclosure for deriving a normalized absolute depth of a reference node;

FIG. 6 illustrates a specific example of a target object gesture detection framework provided by embodiments of the present disclosure;

FIG. 7 illustrates a specific example of another target object pose detection framework provided by embodiments of the present disclosure;

fig. 8 shows a schematic diagram of an image processing apparatus provided by an embodiment of the present disclosure;

fig. 9 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.

The three-dimensional human body gesture detection method generally comprises the steps of identifying first two-dimensional position information of human body key points in an image to be identified through a neural network, and then converting the first two-dimensional position information of each human body key point into three-dimensional position information according to mutual position relations among the human body key points (such as connection relations among different key points, distance ranges among adjacent key points and the like); however, the human body is complex and changeable, and the position relations among the key points of the human body corresponding to different human bodies are different, so that the three-dimensional human body posture obtained by the method has larger error.

In addition, the precision of the current three-dimensional human body posture detection method is based on accurate estimation of human body key points, but due to shielding of clothes, limbs and the like, the human body key points cannot be accurately identified from an image in many cases, and therefore the three-dimensional human body posture error obtained through the method can be further enlarged.

The present invention is directed to a method for manufacturing a semiconductor device, and a semiconductor device manufactured by the method.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

Based on the above study, the present disclosure provides an image processing method and apparatus, by identifying a target region of a target object in a first image, and determining, based on the target region, first two-dimensional position information of a plurality of key points characterizing a pose of the target object in the first image, a relative depth of each key point with respect to a reference node of the target object, and an absolute depth of the reference node of the target object in a camera coordinate system, respectively, thereby more accurately obtaining three-dimensional position information of the plurality of key points of the target object in the camera coordinate system, respectively, based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

For the sake of understanding the present embodiment, first, a detailed description will be given of an image processing method disclosed in an embodiment of the present disclosure, where an execution subject of the image processing method provided in the embodiment of the present disclosure is generally a computer device having a certain computing capability, and the computer device includes, for example: the terminal device, or server or other processing device, may be a User Equipment (UE), mobile device, user terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld device, computing device, vehicle mounted device, wearable device, etc. In some possible implementations, the image processing method may be implemented by way of a processor invoking computer readable instructions stored in a memory.

The image processing method provided in the embodiment of the present disclosure is described below by taking an execution subject as a terminal device as an example.

Referring to fig. 1, a flowchart of an image processing method according to an embodiment of the disclosure is shown, where the method includes steps S101 to S103, where:

s101: identifying a target region of a target object in the first image;

s102: determining first two-dimensional position information of a plurality of key points representing the gesture of the target object in the first image, relative depth of each key point relative to a reference node of the target object and absolute depth of the reference node of the target object in a camera coordinate system based on a target area corresponding to the target object;

s103: three-dimensional position information of a plurality of key points of the target object in the camera coordinate system is determined based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

The following describes the above-mentioned S101 to S103 in detail.

I: in S101, at least one target object is included in the first image. The target object includes, for example, an object whose posture is to be determined, such as a person, an animal, a robot, a vehicle, or the like.

In a possible embodiment, when more than one target object is included in the first image, the categories of different target objects may be the same or different; for example, the plurality of target objects are all people; or the plurality of target objects are vehicles. For another example, the target object in the first image includes: humans and animals; or the target object in the first image comprises a person and a vehicle, and the target object class is determined according to the actual application scene requirement.

The target area of the target object refers to an area including the target object in the first image.

Exemplary, referring to fig. 2, an embodiment of the present disclosure provides a specific method for identifying a target area of a target object in a first image, including:

s201: and extracting the characteristics of the first image to obtain a characteristic diagram of the first image.

Here, the feature extraction may be performed on the first image using, for example, a neural network to obtain a feature map of the first image.

S202: and determining a plurality of target bounding boxes from a plurality of candidate bounding boxes which are generated in advance based on the feature map, and determining a target area corresponding to the target object based on the target bounding boxes.

In implementations, multiple target bounding boxes may be obtained, for example, using a bounding box prediction algorithm. The bounding box prediction algorithm includes, for example, roIAlign, ROI-Pooling, etc., taking RoIAlign as an example, the RoIAlign may traverse a plurality of candidate bounding boxes generated in advance, determine a value of a region of interest (region of interest, ROI) of a sub-image corresponding to each candidate bounding box, where the sub-image belongs to any target object in the first image, and the higher the ROI value, the greater the probability that the sub-image corresponding to the candidate bounding box corresponding to the ROI belongs to a certain target object; after the ROI value corresponding to each candidate bounding box is determined, a plurality of target bounding boxes are determined from the candidate bounding boxes according to the order of the ROI values corresponding to the candidate bounding boxes from large to small.

The target bounding box is, for example, rectangular; the information of the target bounding box includes, for example: coordinates of any vertex in the target bounding box in the first image, as well as height and width values of the target bounding box. Alternatively, the information of the target bounding box includes, for example: coordinates of any vertex in the target bounding box in the feature map of the first image, and height and width values of the target bounding box.

After a plurality of target boundary boxes are obtained, determining target areas corresponding to all target objects in the first image respectively based on the target boundary boxes.

Referring to fig. 3, an embodiment of the present disclosure provides a specific example of determining a target area corresponding to a target object based on a target bounding box, including:

s301: and determining a feature subgraph corresponding to each target boundary box based on a plurality of target boundary boxes and the feature graphs.

In a specific implementation, under the condition that the information of the target boundary frame comprises the coordinates of any vertex on the target boundary frame in the first image and the height value and the width value of the target boundary frame, the characteristic points in the characteristic map and the pixel points in the first image have a certain position mapping relation; and determining the feature subgraphs corresponding to the target bounding boxes respectively from the feature graphs of the first image according to the related information of the target bounding boxes and the mapping relation between the feature graphs and the first image.

In the case where the information of the target bounding box includes coordinates of any vertex of the target bounding box in the feature map of the first image, and the height value and the width value of the target bounding box, feature subgraphs corresponding to the respective target bounding boxes may be determined from the feature map of the first image directly based on the target bounding box.

S302: and carrying out bounding box regression processing based on the feature subgraphs respectively corresponding to the target bounding boxes to obtain a target area corresponding to the target object.

Here, for example, a bounding box regression algorithm may be used to perform a bounding box regression process on the target bounding box based on the feature subgraphs corresponding to each target bounding box, so as to obtain a plurality of bounding boxes including the complete target object. Each of the plurality of bounding boxes corresponds to a target object, and the region determined based on the bounding box corresponding to the target object is the target region corresponding to the target object.

At this time, the number of the obtained target areas is consistent with the number of the target objects in the first image, and each target object corresponds to one target area; if the mutual shielding position relationship exists between different target objects, the target areas corresponding to the target objects with the mutual shielding relationship have a certain overlapping degree.

In another embodiment of the present disclosure, other target detection algorithms may also be employed that are target areas of the target object in the first image. For example, a semantic segmentation algorithm is adopted to determine the semantic segmentation result of each pixel point in the first image, and then the positions of the pixel points belonging to different target objects in the first image are determined according to the semantic segmentation result; and then solving a minimum bounding box according to the pixel points belonging to the same target object, and determining the area corresponding to the minimum bounding box as the target area of the target object.

II: in S102, the image coordinate system is a two-dimensional coordinate system established in two directions of a length and a width of the first image; the camera coordinate system refers to a three-dimensional coordinate system established in the direction in which the optical axis of the camera is located and in two directions in a plane parallel to the optical axis and in which the optical center of the camera is located.

The key points of the target object are position points which are positioned on the target object and can represent the gesture of the target object after being connected according to a certain sequence; for example, when the target object is a human body, the key points include, for example, the position points where the respective joints of the human body are located. The position point is expressed as a two-dimensional coordinate value in an image coordinate system; in the camera coordinate system, it is expressed as three-dimensional coordinate values.

In a specific implementation, for example, a keypoint detection network may be used to perform a keypoint detection process based on a target feature map of the target object, so as to obtain two-dimensional position information of multiple keypoints of the target object in the first image, and a relative depth of each keypoint with respect to a reference node of the target object. Here, the method for obtaining the target feature map may refer to the following description of S401, which is not repeated here.

The reference node is, for example, a position point at which a certain portion is predetermined on the target object. For example, the reference node may be predetermined according to actual needs; for example, when the target object is a human body, a position point where the pelvis of the human body is located may be determined as a reference node, or any key point on the human body may be determined as a reference node, or a position point where the center of the chest and abdomen of the human body is located may be determined as a reference node; specifically, the setting can be performed as needed.

Referring to fig. 4, an embodiment of the disclosure provides a specific method for determining an absolute depth of a reference node of a target object in a camera coordinate system based on a target area corresponding to the target object, including:

s401: and determining a target feature map corresponding to the target object based on the target area corresponding to the target object and the first image.

Here, the target feature map may be determined from the feature map based on, for example, a feature map of the first image obtained by feature extraction of the first image, and the target region.

Here, the feature points in the feature map extracted for the first image and the pixel points in the first image have a certain position mapping relationship; after the target areas of the target objects are obtained, the positions of the target objects in the feature images of the first image can be determined according to the position mapping relation, and then the target feature images of the target objects are intercepted from the feature images of the first image.

S402: and executing depth recognition processing based on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object.

Here, in one possible implementation, for example, a pre-trained depth prediction network may be used to perform a depth detection process on the target feature map, resulting in a normalized absolute depth of the reference node of the target object.

In another embodiment of the present disclosure, referring to fig. 5, another specific method for obtaining a normalized absolute depth of a reference node is provided, including:

s501: acquiring an initial depth image based on the first image; the pixel value of any first pixel point in the initial depth image represents the initial depth value of a second pixel point corresponding to the first pixel point position in the first image in the camera coordinate system.

Here, an initial depth value for each pixel (second pixel) in the first image is determined using a depth prediction network; the initial depth value of each first pixel point forms an initial depth image of the first image; the pixel value of any pixel point (first pixel point) in the initial depth image is the initial depth value of the pixel point (second pixel point) at the corresponding position in the first image.

S502: and determining second two-dimensional position information of a reference node corresponding to the target object in the first image based on the target feature map corresponding to the target object, and determining an initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image.

Here, the target feature map corresponding to the target object may be, for example, a target feature map determined for each target object from the feature maps of the first image based on the target region corresponding to each target object.

After the target feature graphs corresponding to the target objects are obtained, a pre-trained reference node detection network can be utilized, and second two-dimensional position information of the reference nodes of the target objects in the first image can be determined based on the target feature graphs. And then determining a pixel point corresponding to the reference node from the initial depth image by using the second two-dimensional position information, and determining the pixel value of the pixel point determined from the initial depth image as the initial depth value of the reference node.

S503: and determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object.

For example, at least one first convolution process may be performed on the target feature map corresponding to the target object to obtain a feature vector of the target object; splicing the feature vector and the initial depth value to form a spliced vector, and performing at least one-stage second convolution processing on the spliced vector to obtain a corrected value of the initial depth value; and obtaining the normalized absolute depth based on the corrected value of the initial depth value and the initial depth value.

Here, for example, a neural network for adjusting the initial depth value may be employed, the neural network including a plurality of convolution layers; wherein, part of the convolution layers are used for carrying out at least one level of first convolution processing on the target feature map; the other convolution layers are used for carrying out at least one level of second convolution processing on the spliced vector, so as to obtain the correction value; and then, adjusting the initial depth value according to the corrected value to obtain the normalized depth of the reference node of the target object.

In view of S402 above, the specific method for determining the absolute depth of the reference node of the target object in the camera coordinate system according to the embodiment of the present disclosure further includes:

s403: and obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera.

In a specific implementation, as different first images may be photographed by different cameras in the process of performing image processing on the different first images; the corresponding camera internal parameters may be different for different cameras; here, the camera internal parameters include, for example: the focal length of the camera on the x-axis, the focal length of the camera on the y-axis, the coordinates of the optical center of the camera on the x-axis and the y-axis in the camera coordinate system.

The camera internally has different parameters, and even the first images acquired at the same visual angle and the same position are different; if the absolute depth of the reference node is predicted directly based on the target feature map, the absolute depths of different first image acquisitions acquired by different cameras at the same view angle and the same position are different.

To avoid the above, embodiments of the present disclosure directly predict the normalized depth of the reference node, which is obtained without considering the camera internal parameters; the absolute depth of the reference node is then recovered from the camera internal parameters and the normalized absolute depth.

Illustratively, the normalized absolute depth, and absolute depth of the reference node of any target object satisfies the following equation (1):

（1）

wherein,,representing a normalized absolute depth of the reference node; />Representing the absolute depth of the reference node; />Representing the area of the target area; />Representing the area of the target bounding box.

Representing the camera focal length. The camera coordinate system is a three-dimensional coordinate system; comprises three coordinate axes of x, y and z; the origin of the camera coordinate system is the optical center of the camera; the optical axis of the camera is the z-axis of the camera coordinate system; the plane in which the optical center is located and which is perpendicular to the z axis is the plane in which the x axis and the y axis are located; / >Is the focal length of the camera on the x-axis; />Is the focal length of the camera on the y-axis.

It should be noted here that, as known in S202 described above, there are a plurality of target bounding boxes determined by RoIAlign; and the areas of the target bounding boxes are all equal.

Since the focal length of the camera has been determined at the time the camera acquired the first image and the target region and the target bounding box have been determined at the time the target region was determined, the absolute depth of the reference node of the target object is obtained according to the above formula (1) after the normalized absolute depth of the reference node is obtained.

III: in S103 described above, it is assumed that each target object includes J keypoints, and that there are N target objects in the first image; wherein, the three-dimensional gesture of the N target objects is expressed as:。

wherein, the three-dimensional gesture of the mth target objectCan be expressed as:

。

wherein,,coordinate values representing the x-axis direction of the jth key point of the mth target object in a camera coordinate system;coordinate values representing the j-th key point of the m-th target object in the y-axis direction in a camera coordinate system; />And the coordinate value of the jth key point of the mth target object in the z-axis direction in the camera coordinate system is represented.

The target areas of the N target objects are expressed as: . Wherein the target area of the mth target object +.>Expressed as:

the method comprises the steps of carrying out a first treatment on the surface of the Here, the->And->Coordinate values representing vertices where the upper left corner of the target area is located; />And->The width value and the height value of the target area are respectively represented.

The three-dimensional pose of the N target objects with respect to the reference node is expressed as:the method comprises the steps of carrying out a first treatment on the surface of the Wherein the three-dimensional posture of the mth target object with respect to the reference node +.>Expressed as:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Coordinate values of an x-axis of a jth key point representing an mth target object in an image coordinate system; />Coordinate values of a y-axis of a jth key point representing an mth target object in an image coordinate system; i.e. ->And the two-dimensional coordinate value of the jth key point of the mth target object in the image coordinate system is represented.

Representing the relative depth of the jth node of the mth target object with respect to the reference node of the mth target object.

Obtaining a three-dimensional pose of an mth target object by back projection using an internal reference matrix K of the camera, wherein three-dimensional coordinate information of a jth node of the mth target object satisfies the following formula (2)

（2）

Wherein,,representing the absolute depth value of the reference node of the mth target object in the camera coordinate system. Here, it should be noted that this +. >Obtained based on the corresponding example of the above formula (1).

The internal reference matrix K is, for example:；

wherein:focal length of the camera on the x-axis in the camera coordinate system; />Is the focal length of the camera in the camera coordinate system on the y-axis; />Coordinate values on an x-axis in a camera coordinate system for an optical center of the camera; />Representing the coordinate values of the optical center of the camera in the y-axis in the camera coordinate system.

Through the process, three-dimensional position information of a plurality of key points of the target object in a camera coordinate system can be obtained; aiming at the mth target object, three-dimensional position information corresponding to J key points of the target object respectively represents the three-dimensional gesture of the mth target object.

According to the embodiment of the disclosure, the target area of the target object in the first image is identified, and based on the target area, the first two-dimensional position information of a plurality of key points representing the gesture of the target object in the first image, the relative depth of each key point relative to the reference node of the target object and the absolute depth of the reference node of the target object in the camera coordinate system are determined, so that the three-dimensional position information of the plurality of key points of the target object in the camera coordinate system is obtained more accurately based on the first two-dimensional position information, the relative depth and the absolute depth of the target object.

In another embodiment of the present disclosure, another image processing method is provided, where the image processing method is applied to a pre-trained neural network.

The neural network comprises a target detection network, a key point detection network and a depth prediction network; the neural network comprises three branch networks for obtaining a target region of the target object, first two-dimensional position information and the relative depth of the target object, and the absolute depth, respectively.

The specific operation of the three branch networks may be shown in the above embodiments, and will not be described herein.

According to the embodiment of the disclosure, the end-to-end target object posture detection frame is formed through the three branch networks of the target detection network, the key point detection network and the depth prediction network, the first image is processed based on the frame, three-dimensional position information of a plurality of key points of each target object in the first image in a camera coordinate system is obtained, the processing speed is higher, and the recognition accuracy is higher.

Referring to fig. 6, the embodiment of the present disclosure further provides a specific example of a target object gesture detection framework, including:

A target detection network, a key point detection network and three network branches of a depth prediction network;

the target detection network performs feature extraction on the first image to obtain a feature map of the first image; then, determining a plurality of target bounding boxes from a plurality of candidate bounding boxes which are generated in advance by RoIALign according to the first feature map; and executing bounding box regression processing on the multiple target bounding boxes to obtain target areas corresponding to each target object. And transmitting the target feature map corresponding to the target region to a key point detection network and a depth prediction network.

And the key point detection network is used for determining first two-dimensional position information of a plurality of key points representing the gesture of the target object in the first image and the relative depth of each key point relative to a reference node of the target object based on the target feature map. The three-dimensional gesture of the target object in each target feature map is formed by the first two-dimensional position information and the relative depth of each key point in each target feature map. The three-dimensional posture at this time is a three-dimensional posture with itself as a reference.

And the depth prediction network is used for determining the absolute depth of the reference node of the target object in the camera coordinate system based on the target feature map.

Finally, three-dimensional position information of a plurality of key points of the target object in the camera coordinate system is determined according to the first two-dimensional position information, the relative depth and the absolute depth of the reference node of the target object. For each target object, three-dimensional position information of a plurality of key points on the target object in a camera coordinate system respectively forms a three-dimensional gesture of the target object in the camera coordinate system. The three-dimensional posture at this time is a three-dimensional posture with reference to the camera.

Referring to fig. 7, the embodiment of the present disclosure further provides another specific example of a target object gesture detection framework, including:

a target detection network, a key point detection network, and a depth prediction network;

The depth prediction network is used for acquiring an initial depth image based on the first image; determining second two-dimensional position information of a reference node corresponding to a target object in the first image based on a target feature map corresponding to the target object, and determining an initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image; performing at least one-level first convolution processing on a target feature map corresponding to a target object to obtain a feature vector of the target object; splicing the feature vector and the initial depth value of the reference node to form a spliced vector, and performing at least one-stage second convolution processing on the spliced vector to obtain a corrected value of the initial depth value; and adding the correction value to the initial depth value of the reference node to obtain the normalized absolute depth value of the reference node.

Then, the absolute depth value of the reference node is recovered through the formula (1), and finally, three-dimensional position information of a plurality of key points of the target object in the camera coordinate system is determined according to the first two-dimensional position information, the relative depth and the absolute depth of the reference node of the target object. For each target object, three-dimensional position information of a plurality of key points on the target object in a camera coordinate system respectively forms a three-dimensional gesture of the target object in the camera coordinate system. The three-dimensional posture at this time is a three-dimensional posture with reference to the camera.

By means of any one of the two target object gesture detection frames, three-dimensional position information of a plurality of key points of each target object in the first image in a camera coordinate system can be obtained, processing speed is high, and recognition accuracy is high.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Based on the same inventive concept, the embodiments of the present disclosure further provide an image processing apparatus corresponding to the image processing method, and since the principle of the apparatus in the embodiments of the present disclosure for solving the problem is similar to that of the image processing method described in the embodiments of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and the repetition is omitted.

Referring to fig. 8, a schematic diagram of an image processing apparatus according to an embodiment of the disclosure is provided, where the apparatus includes: an identification module 81, a first detection module 82, a second detection module 83; wherein,,

an identifying module 81 for identifying a target area of a target object in the first image;

a first detection module 82, configured to determine, based on a target region corresponding to the target object, first two-dimensional position information of a plurality of keypoints characterizing the pose of the target object in the first image, a relative depth of each of the keypoints with respect to a reference node of the target object, and an absolute depth of the reference node of the target object in a camera coordinate system, respectively;

a second detection module 83 is configured to determine three-dimensional position information of a plurality of key points of the target object in the camera coordinate system based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

In a possible implementation manner, the identifying module 81 is configured, when identifying a target area of a target object in the first image, to:

extracting features of the first image to obtain a feature map of the first image;

And determining a plurality of target bounding boxes from a plurality of candidate bounding boxes which are generated in advance based on the feature map, and determining a target area corresponding to the target object based on the target bounding boxes.

In a possible implementation manner, the identifying module 81 is configured to, when determining, based on the target bounding box, a target area corresponding to the target object:

determining a feature subgraph corresponding to each target bounding box based on a plurality of target bounding boxes and the feature graphs;

and carrying out bounding box regression processing based on the feature subgraphs respectively corresponding to the target bounding boxes to obtain a target area corresponding to the target object.

In a possible implementation manner, the first detection module 82 is configured to, when determining, based on a target area corresponding to the target object, an absolute depth of a reference node of the target object in a camera coordinate system:

determining a target feature map corresponding to the target object based on a target area corresponding to the target object and the first image;

performing depth recognition processing based on a target feature map corresponding to the target object to obtain normalized absolute depth of a reference node of the target object;

And obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera.

In a possible implementation manner, the first detection module 82 is configured to, when performing depth recognition processing based on the target feature map corresponding to the target object, obtain a normalized absolute depth of the reference node of the target object:

acquiring an initial depth image based on the first image; the method comprises the steps of representing an initial depth value of a second pixel point corresponding to a first pixel point position in a first image in a camera coordinate system, wherein the pixel value of any first pixel point in the initial depth image;

determining second two-dimensional position information of a reference node corresponding to the target object in the first image based on a target feature map corresponding to the target object, and determining an initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image;

and determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object.

In a possible implementation manner, the first detection module 82 is configured to, when determining, based on an initial depth value of a reference node corresponding to the target object and the target feature map corresponding to the target object, a normalized absolute depth of the reference node of the target object:

performing at least one-level first convolution processing on a target feature map corresponding to the target object to obtain a feature vector of the target object;

splicing the feature vector and the initial depth value to form a spliced vector, and performing at least one-stage second convolution processing on the spliced vector to obtain a corrected value of the initial depth value;

and obtaining the normalized absolute depth based on the corrected value of the initial depth value and the initial depth value.

In addition, the three branch networks of the target detection network, the key point detection network and the depth prediction network form an end-to-end target object posture detection frame, the first image is processed based on the frame, three-dimensional position information of a plurality of key points of each target object in the first image in a camera coordinate system is obtained, the processing speed is higher, and the recognition accuracy is higher.

The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.

The embodiment of the present disclosure further provides a computer device 10, as shown in fig. 9, which is a schematic structural diagram of the computer device 10 provided in the embodiment of the present disclosure, including:

a processor 11 and a memory 12; the memory 12 stores machine readable instructions executable by the processor 11 which, when the computer device is running, are executed by the processor to perform the steps of:

identifying a target region of a target object in the first image;

determining first two-dimensional position information of a plurality of key points representing the gesture of the target object in the first image, relative depth of each key point relative to a reference node of the target object and absolute depth of the reference node of the target object in a camera coordinate system based on a target area corresponding to the target object;

three-dimensional position information of a plurality of key points of the target object in the camera coordinate system is determined based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

The specific execution process of the above instruction may refer to the steps of the image processing method described in the embodiments of the present disclosure, which is not described herein.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the image processing method described in the method embodiments described above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.

The computer program product of the image processing method provided in the embodiments of the present disclosure includes a computer readable storage medium storing a program code, where instructions included in the program code may be used to execute steps of the image processing method described in the above method embodiments, and specifically, reference may be made to the above method embodiments, which are not described herein.

The disclosed embodiments also provide a computer program which, when executed by a processor, implements any of the methods of the previous embodiments. The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present disclosure, and are not intended to limit the scope of the disclosure, but the present disclosure is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, it is not limited to the disclosure: any person skilled in the art, within the technical scope of the disclosure of the present disclosure, may modify or easily conceive changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features thereof; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An image processing method, comprising:

identifying a target region of a target object in a first image;

Determining a three-dimensional pose of the target object relative to the reference node based on the first two-dimensional position information and the relative depth of the target object, and determining three-dimensional position information of a plurality of key points of the target object in the camera coordinate system respectively based on the three-dimensional pose, the absolute depth and an internal reference matrix of a camera for back projection;

the identifying a target region of a target object in the first image includes:

2. The image processing method according to claim 1, wherein the determining, based on the target bounding box, a target area corresponding to the target object includes:

3. The image processing method according to claim 1 or 2, wherein determining an absolute depth of a reference node of the target object in a camera coordinate system based on a target region to which the target object corresponds, comprises:

4. The image processing method according to claim 3, wherein the performing depth recognition processing based on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object includes:

5. The image processing method according to claim 4, wherein the determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object includes:

6. The image processing method according to claim 1 or 2, wherein the image processing method is applied to a pre-trained neural network, the neural network comprising three branch networks of a target detection network, a key point detection network and a depth prediction network; the neural network comprises three branch networks for obtaining a target region of the target object, first two-dimensional position information and the relative depth of the target object, and the absolute depth, respectively.

7. An image processing apparatus, comprising:

the identification module is used for identifying a target area of a target object in the first image;

the first detection module is used for determining first two-dimensional position information of a plurality of key points representing the gesture of the target object in the first image, relative depth of each key point relative to a reference node of the target object and absolute depth of the reference node of the target object in a camera coordinate system based on a target area corresponding to the target object;

The second detection module is used for determining the three-dimensional posture of the target object relative to the reference node based on the first two-dimensional position information and the relative depth of the target object, and determining three-dimensional position information of a plurality of key points of the target object in the camera coordinate system respectively based on the three-dimensional posture, the absolute depth and an internal reference matrix of a camera for back projection;

the identification module is used for identifying a target area of a target object in the first image:

8. A computer device, comprising: a processor and a memory interconnected, said memory storing machine-readable instructions executable by said processor, said machine-readable instructions being executed by said processor to implement the steps of the image processing method according to any one of claims 1 to 6 when the computer device is running.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the image processing method according to any of claims 1 to 6.