CN110443228B

CN110443228B - Pedestrian matching method and device, electronic equipment and storage medium

Info

Publication number: CN110443228B
Application number: CN201910771286.6A
Authority: CN
Inventors: 孙勇
Original assignee: Atlas Future Nanjing Artificial Intelligence Research Institute Co ltd
Current assignee: Atlas Future Nanjing Artificial Intelligence Research Institute Co ltd
Priority date: 2019-08-20
Filing date: 2019-08-20
Publication date: 2022-03-04
Anticipated expiration: 2039-08-20
Also published as: CN110443228A

Abstract

The embodiment of the application provides a pedestrian matching method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a first image acquired by a first image acquisition device and a second image acquired by a second image acquisition device, wherein the first image and the second image are images of the same scene acquired at the same time; respectively carrying out pedestrian detection on the first image and the second image to obtain first position information of a first pedestrian in the first image and second position information of a second pedestrian in the second image; acquiring depth information of a first pedestrian, and constructing a three-dimensional space according to the first position information and the depth information to acquire first three-dimensional position information; mapping the first three-dimensional position information on a plane corresponding to the second image to obtain first mapping position information; and carrying out pedestrian matching according to the first mapping position information and the second mapping position information. The device is used for executing the method. The embodiment of the application improves the accuracy rate of pedestrian matching.

Description

Pedestrian matching method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a pedestrian matching method, apparatus, electronic device, and storage medium.

Background

The rise of artificial intelligence attracts the attention of all walks of life. In the field of business intelligence, effective information of people cannot be taken by a single camera frequently, which causes difficulty in recognizing pictures taken by the single camera. Depending on a plurality of cameras, although pictures at a plurality of angles can be taken, how the person in the picture taken by each camera corresponds to other cameras one by one becomes a difficulty.

The current method is to acquire the characteristics of the images acquired by each camera respectively and then to perform fusion comparison on the characteristics, thereby realizing pedestrian detection. However, the method can cause the problem of inaccurate matching due to the angle difference of the camera or the error of feature recognition.

Disclosure of Invention

In view of the above, an object of the embodiments of the present application is to provide a method, an apparatus, an electronic device and a storage medium for pedestrian matching, so as to solve the technical problem of low accuracy of pedestrian matching.

In a first aspect, an embodiment of the present application provides a pedestrian matching method, including:

acquiring a first image acquired by a first image acquisition device and a second image acquired by a second image acquisition device, wherein the first image and the second image are images of the same scene acquired at the same time;

respectively carrying out pedestrian detection on the first image and the second image to obtain first position information of a first pedestrian in the first image and second position information of a second pedestrian in the second image;

acquiring depth information of the first pedestrian, and constructing a three-dimensional space according to the first position information and the depth information to acquire first three-dimensional position information;

mapping the first three-dimensional position information on a plane corresponding to the second image to obtain first mapping position information;

and carrying out pedestrian matching according to the first mapping position information and the second mapping position information.

According to the embodiment of the application, the three-dimensional space is established for the first pedestrian in the first image, so that the first three-dimensional position information of the first pedestrian in the three-dimensional space is obtained, the first three-dimensional position information is mapped to the plane corresponding to the second image, the mapping position information is obtained, and the pedestrian matching is carried out according to the mapping position information and the second position information, so that the accuracy of the pedestrian matching is improved.

Further, performing pedestrian matching according to the first mapping position information and the second mapping position information, including:

obtaining a calibrated affine transformation coefficient, and carrying out affine transformation on the first mapping position information according to the affine transformation coefficient to obtain transformed position information;

and carrying out pedestrian matching on the transformed position information and the second position information based on the shortest distance principle.

According to the embodiment of the application, the position information of the first pedestrian in the second image is determined by using an affine transformation method, so that the offset of the position and the collection angle of the image collection device and the influence of the distortion of the visual field are eliminated, and the accuracy of pedestrian matching is further improved.

Further, the performing pedestrian matching on the transformed position information and the second position information based on the shortest distance principle includes:

and calculating the distance between the transformed position information and each second position information, and pairing the second pedestrian corresponding to the second position information with the shortest distance with the first pedestrian corresponding to the transformed position information.

The embodiment of the application pairs the second pedestrian which is closest to the first pedestrian in position information after transformation, so that a more accurate matching result can be obtained.

Further, the first image acquisition device is a monocular image acquisition device or a depth image acquisition device, and when the first image acquisition device is the monocular image acquisition device, the acquiring of the depth information of the first pedestrian includes:

processing the first image by using a preset depth three-dimensional neural network model to obtain a corresponding third image;

and obtaining the depth information according to the parallax of the first image and the third image.

According to the embodiment of the application, the third image is obtained through the depth three-dimensional neural network model, and the depth information is obtained according to the parallax, so that the depth information can be obtained through the monocular camera, the depth camera with higher price is not needed, and the resources are saved.

Further, the mapping the first three-dimensional position information on a plane corresponding to the second image includes:

and mapping the first three-dimensional position information on a plane corresponding to the second image according to the relative position relationship between the first image acquisition device and the second image acquisition device. The first three-dimensional position information can be accurately mapped to the plane corresponding to the second image through the relative position relationship between the first image acquisition device and the second image acquisition device.

Further, the pedestrian detection method includes:

acquiring a candidate frame of an image to be detected by adopting a candidate region generation algorithm;

inputting the image to be detected with the candidate frame into a convolutional neural network for feature extraction to obtain a plurality of features of the image corresponding to the candidate frame;

classifying the plurality of features to obtain object classes, and performing regression on the plurality of features to obtain candidate frame offsets;

and when the object type is a pedestrian, acquiring the position information of the pedestrian from the image to be detected according to the offset of the candidate frame.

According to the pedestrian detection method, the pedestrian can be rapidly detected, and therefore the pedestrian matching efficiency is improved.

Further, the affine transformation coefficient is obtained by:

acquiring a plurality of groups of image pairs, wherein each group of image pairs comprises a fourth image and a fifth image; the fourth image is acquired by the first image acquisition device, and the fifth image is acquired by the second image acquisition device at the same time;

taking a fixture in the fourth image as a first calibration object and a fixture in the fifth image as a second calibration object, acquiring second three-dimensional position information of the first calibration object, and mapping the second three-dimensional position information of the first calibration object onto a plane corresponding to the fifth image to acquire second mapping position information;

an affine transformation equation is constructed according to the position relation of the first image acquisition device and the second image acquisition device;

and obtaining an affine transformation coefficient corresponding to the affine transformation equation by adopting a least square method according to the second mapping position information of each fourth image and the position information of the second calibration object of the corresponding fifth image.

The embodiment of the application obtains the affine transformation coefficient through the mode so as to perform subsequent affine transformation, obtain more accurate transformed position information and improve the accuracy of pedestrian matching.

In a second aspect, an embodiment of the present application provides a pedestrian matching apparatus, including:

the image acquisition module is used for acquiring a first image acquired by a first image acquisition device and a second image acquired by a second image acquisition device, wherein the first image and the second image are images of the same scene acquired at the same time;

the pedestrian detection module is used for respectively carrying out pedestrian detection on the first image and the second image to obtain first position information of a first pedestrian in the first image and second position information of a second pedestrian in the second image;

the three-dimensional space establishing module is used for acquiring the depth information of the first pedestrian, establishing a three-dimensional space according to the first position information and the depth information, and acquiring first three-dimensional position information;

the mapping module is used for mapping the first three-dimensional position information on a plane corresponding to the second image to obtain first mapping position information;

and the pedestrian matching module is used for matching pedestrians according to the first mapping position information and the second mapping position information.

Further, the pedestrian matching module is specifically configured to:

Further, the first image acquisition device is a monocular image acquisition device or a depth image acquisition device, and when the first image acquisition device is the monocular image acquisition device, the three-dimensional space establishing module is specifically configured to:

Further, the mapping module is specifically configured to:

and mapping the first three-dimensional position information on a plane corresponding to the second image according to the relative position relationship between the first image acquisition device and the second image acquisition device.

Further, the pedestrian detection module is specifically configured to:

Further, the affine transformation coefficient is obtained by:

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor being capable of performing the method steps of the first aspect when invoked by the program instructions.

In a fourth aspect, an embodiment of the present application provides a non-transitory computer-readable storage medium, including:

the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform the method steps of the first aspect.

Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic flow chart of a pedestrian matching method according to an embodiment of the present disclosure;

FIG. 2 is a schematic view of a scene and an image provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a first image provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a second image provided by an embodiment of the present application;

fig. 5 is a flowchart of an affine transformation coefficient calibration method provided in the embodiment of the present application;

fig. 6 is a schematic structural diagram of a deep three-dimensional neural network module according to an embodiment of the present disclosure;

FIG. 7 is a flow chart illustrating a method for pedestrian detection according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a pedestrian matching device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Before the present application, in order to acquire more information in a certain scene, a plurality of image acquisition devices may be used to acquire an image of the scene, where the plurality of image acquisition devices may be respectively placed in different directions of the scene, but cameras of the image acquisition devices are all facing the scene and can acquire an image of the scene. For example, there are two image capturing devices, one standing on a pedestrian in a scene, one of which can capture the front of the pedestrian, and the other above which can capture the top of the pedestrian's head. If the pedestrian matching is carried out by the feature fusion method, because most pedestrians are in a moving state, the features extracted from the images respectively acquired by the two image acquisition devices are greatly different, and therefore the probability of error in the matching process is higher. According to the pedestrian matching method, the first three-dimensional position information is obtained by constructing the three-dimensional space of the pedestrian in the first image, then the first three-dimensional position information is mapped to the plane corresponding to the second image, the mapping position information is obtained, then the pedestrian matching is carried out according to the mapping position information and the second position information, and the pedestrian matching can be carried out more accurately.

Fig. 1 is a schematic flow chart of a pedestrian matching method according to an embodiment of the present application, and as shown in fig. 1, the method includes:

step 101: the method comprises the steps of acquiring a first image acquired by a first image acquisition device and a second image acquired by a second image acquisition device, wherein the first image and the second image are images of the same scene acquired at the same time.

The first image capturing device and the second image capturing device capture images of the same scene, and may be used for video recording or taking photos of the scene. When the first image acquisition device and the second image acquisition device simultaneously carry out video recording on a scene, the first image acquisition device can take an image of a certain frame as a first image, the second image acquisition device can take the image of the frame corresponding to the first image as a second image, and the matching device acquires the first image acquired by the first image acquisition device and the second image acquired by the second image acquisition device. In addition, the first image acquisition device and the second image acquisition device can respectively directly send acquired videos to the matching device, and the matching device can select the first image and the second image from the videos. Thus, the first image and the second image are simultaneously acquired images of the same scene. It should be noted that, in the embodiment of the present application, it means that the acquisition time of the first image and the acquisition time of the second image are the same or very close to each other, but the acquisition time of the first image and the acquisition time of the second image are not required to be absolutely the same, for example, there may be a small time difference between the acquisition time of the first image and the acquisition time of the second image, and the time difference may be 0.1 second, 1 second or 2 seconds, which may be adjusted according to the actual situation.

Furthermore, the same scene means that the first image and the second image both contain an object scene of interest, which may be a scene of an object geographical location, such as a certain mall; the target scene may also be a scene including a target pedestrian. Therefore, in the embodiment of the present application, the same scene does not require the first image and the second image to be identical, and a certain position offset is allowed between the first image and the second image, as shown in fig. 2, in which fig. 2 includes the first image and the second image, wherein the first image is filled by parallel lines inclined to the left, the second image is filled by parallel lines inclined to the right, and the first image and the second image have a part overlapping and filled with squares. As can be seen from fig. 2, the first image and the second image do not completely capture the same scene, and can be approximately regarded as the same scene due to the small deviation.

Therefore, before pedestrian matching is carried out, the scene deviation acquired by the first image acquisition device and the scene deviation acquired by the second image acquisition device can be calibrated in advance, and deviation correction is carried out in the matching device.

Step 102: and respectively carrying out pedestrian detection on the first image and the second image to obtain first position information of a first pedestrian in the first image and second position information of a second pedestrian in the second image.

Illustratively, the matching means performs pedestrian detection on the first image and the second image, marks a first pedestrian from the first image, and marks a second pedestrian from the second image, respectively, after acquiring the first image and the second image. The number of the first pedestrians in the first image may be one or multiple, and the number of the second pedestrians in the second image may also be one or multiple. And the pedestrian detection is to judge whether a pedestrian exists in the image or the video sequence by using a computer vision technology and give accurate positioning. The pedestrian detection method may be various, for example: global feature based methods, human part based methods and stereo vision based methods. The method based on global features is the mainstream pedestrian detection method at present, and mainly adopts various static features of images such as edge features, shape features, statistical features or transformation features to describe pedestrians. The basic idea of the human body part-based method is to divide a human body into a plurality of components, detect each component in an image respectively, integrate detection results according to a certain constraint relation, and finally judge whether pedestrians exist.

It should be noted that the present application is mainly used for detecting a pedestrian in an image, because the pedestrian is in an active state in a scene, and the position of the pedestrian may be different from the position of the pedestrian in the next second. Of course, the embodiment of the present application may also be used for matching other objects, for example: an animal in the scene or a certain furnishing, etc.

Step 103: and acquiring the depth information of the first pedestrian, and constructing a three-dimensional space according to the first position information and the depth information to acquire first three-dimensional position information.

For example, the matching means may obtain depth information to the first pedestrian in the first image from the first image. It is understood that the image depth refers to the distance between an object in an image and an image acquisition device, i.e. the depth of field of the object in a taken picture. The method of image depth estimation may employ a monocular depth estimation method and a binocular depth estimation method. Wherein monocular is based on one lens and binocular is based on two lenses. The method for obtaining depth information from an image acquired by a monocular camera is more complicated than that of a binocular camera, but can obtain depth information from the acquired image. For a binocular camera, the principle of obtaining depth information is as follows: the two cameras are used for imaging, and because a certain distance exists between the two cameras, images of the same scene formed by the two lenses have a certain difference, namely parallax. Because of the presence of disparity information, general depth information of the scene can be estimated therefrom. Depth information about images captured by monocular cameras is explained in detail in the following embodiments. It should be noted that the first image capturing device and the second image capturing device may both be monocular cameras, may both be binocular cameras, may also be a monocular camera, a binocular camera, and may also be other cameras, such as: RGB-D cameras, etc.

Because the first image can obtain the two-dimensional first position information of the first pedestrian on the corresponding plane, after the depth information of the first pedestrian in the first image is obtained, the three-dimensional space of the first image can be constructed according to the first position information and the depth information, and therefore the first three-dimensional position information of the first pedestrian is obtained.

Step 104: and mapping the first three-dimensional position information on a plane corresponding to the second image to obtain first mapping position information.

Step 105: and carrying out pedestrian matching according to the first mapping position information and the second mapping position information.

Illustratively, after the matching device obtains the first mapping position information, the pedestrian matching is performed according to the first mapping position information and the second mapping position information, and it is determined which second pedestrian is the same person in the first and second images corresponding to the first mapping position information.

It should be noted that when the first pedestrian and the second pedestrian are both plural, the pedestrian matching may be performed one by one, for example: firstly, calculating the distance between all first pedestrians in the first image and all second pedestrians in the second image, then selecting the first pedestrian in the first image with the minimum distance to pair with the second pedestrian in the second image, removing the first pedestrian and the second pedestrian after the pairing is completed, selecting the minimum distance to pair the rest pedestrians, and so on until all the pairing is completed. It should be noted that when the numbers of pedestrians in the two images (i.e., the first image and the second image) are not equal, the matching may be performed according to the above steps until the pedestrians in one of the images are matched.

After the pairing is completed, the paired first pedestrian and second pedestrian can be labeled, and the first pedestrian and the second pedestrian can not be matched again in the next matching.

It should also be noted that the embodiments of the present application are also applicable to pedestrian matching of a plurality of image capturing devices, for example, there is also a third image captured by a third image capturing device, after matching the first image with the second image, matching the second image with the third image, and matching the first image with the third image. Alternatively, the second image and the third image may be matched with the first image, respectively, with reference to the first image.

On the basis of the above embodiment, performing pedestrian matching according to the first mapping position information and the second mapping position information includes:

Illustratively, after the matching device obtains the first mapping position information, a pre-calibrated affine transformation coefficient is obtained, it should be noted that the affine transformation coefficient is related to parameters such as positions and capturing angles of the first image capturing device and the second image capturing device, and if any one of the parameters of the first image capturing device and the second image capturing device is changed, the affine transformation coefficient needs to be re-determined. Considering a lot of interference and uncertain factors, therefore, the first mapping position information obtained by directly mapping the first three-dimensional position information onto the plane corresponding to the second image may not be accurate enough, so that the first mapping position information may be affine transformed by using the preliminarily calibrated affine transformation coefficient to obtain the transformed position information.

The affine transformation is also called affine mapping, and means that in geometry, one vector space is subjected to linear transformation once and then is subjected to translation, and then is transformed into another vector space. Briefly, the geometry on one face is transformed into other geometry by scaling rotation and translation. The first mapping position information is affine transformed to obtain a transformed position on the observation plane of the second image acquisition device.

Assuming that the first image capturing device is placed in front of the pedestrian and the second image capturing device is placed above the pedestrian, fig. 3 is a schematic diagram of a first image provided by the embodiment of the present application, and fig. 4 is a schematic diagram of a second image provided by the embodiment of the present application, where both the first pedestrian position information of the first pedestrian and the second pedestrian position information of the second pedestrian can be represented by coordinates, for example: the first pedestrian position information is: (x)₁，y₁) And the second pedestrian position information is: (x)₂，y₂) After a three-dimensional space is constructed for the first pedestrian in the first image, the first three-dimensional position information may be obtained as: (x)₁，y₁，d₁) Wherein d is₁The depth information of the first pedestrian can be obtained according to the relative position relationship between the first image acquisition device and the second image acquisition device, and x in the first three-dimensional position information₁With x in the second image₂Correspond to, d₁And y₂Corresponding to the first three-dimensional position information (x)₁ ， y₁ ， d₁) After mapping on the plane corresponding to the second image, the obtained first mapping position information is: (x)₁，d₁). And then carrying out affine transformation on the first mapping position information by using an affine transformation coefficient, wherein the affine transformation formula is as follows: f (x)₁，d₁)＝(x′₁，y′₁) Thereby obtaining post-transform position information (x'₁，y′₁)。

After obtaining the converted position information, calculating a distance between the converted position information and a second pedestrian in the second image, and comparing second position information of the second pedestrian having the shortest distance with converted position information (x'₁，y′₁) And (6) pairing.

Further, fig. 5 is a flowchart of an affine transformation coefficient calibration method provided in the embodiment of the present application, as shown in fig. 5:

step 401: acquiring a plurality of groups of image pairs, wherein each group of image pairs comprises a fourth image and a fifth image; the fourth image is acquired by the first image acquisition device, and the fifth image is acquired by the second image acquisition device at the same time.

Step 402: and taking the fixture in the fourth image as a first calibration object and the fixture in the fifth image as a second calibration object, acquiring second three-dimensional position information of the first calibration object, and mapping the second three-dimensional position information of the first calibration object onto a plane corresponding to the fifth image to acquire second mapping position information. It should be noted that the stationary object may be a building under the scene, or may be a person standing still in the scene and simultaneously captured by the first image capturing device and the second image capturing device, respectively. Moreover, the method for obtaining the second mapping position information is the same as the method for obtaining the first mapping position information, and is not described herein again.

Step 403: and establishing an affine transformation equation according to the position relation of the first image acquisition device and the second image acquisition device.

Exemplarily, the second mapping position information is (x)₄，y₄) The position information of the second calibration object is (x)₅，y₅) At this time, the affine transformation equation constructed may be:

wherein, a₁，b₁，c₁，a₂，b₂And c₂Is the affine transformation coefficient to be solved.

Step 404: and obtaining an affine transformation coefficient corresponding to the affine transformation equation by adopting a least square method according to the second mapping position information of each fourth image and the position information of the second calibration object of the corresponding fifth image.

For example, multiple sets of data may be obtained from the multiple fourth images and the multiple fifth images, the multiple sets of data are substituted into the affine transformation equation, and the least square method is adopted to determine the affine transformation coefficients corresponding to the affine transformation equation.

On the basis of the above embodiment, the acquiring of the depth information of the first pedestrian by the first image acquiring device is performed by a monocular image acquiring device or a depth image acquiring device, and when the first image acquiring device is the monocular image acquiring device, the acquiring of the depth information of the first pedestrian includes:

Illustratively, fig. 6 is a block diagram of a deep three-dimensional neural network provided by an embodiment of the present invention, and as shown in fig. 6, after a first image is input into the deep three-dimensional neural network, there is a branch after each pooling layer, and the input feature map (i.e., a learned upsampling filter) is sampled at the branch by a "deconvolution" layer, which is to reverse the forward and backward calculations of the convolutional layer. The feature maps sampled at each level are added together to obtain a feature representation that is consistent with the size of the input image. The feature representation is convolved again and then a softmax transform is performed on the channels at each spatial position, and the output of the softmax layer can be understood as a difference map in the form of a probability distribution. Finally, the difference map and the first image are input to a selection layer to obtain a third image.

The deep three-dimensional neural network provided by the embodiment of the application combines multi-level information, and predicts a difference probability distribution and inputs the difference probability distribution to the selection layer from the first image to the third image through end-to-end training, wherein the selection layer is a model capable of realizing depth image rendering in a differentiable mode, so that filling can be performed inside the network.

After the third image is obtained, the first image and the third image may be equivalent to images respectively obtained by two cameras of a binocular camera, and the depth information is obtained from a parallax for the first pedestrian in the first image and the third image. It should be noted that the process of obtaining parallax from the first image and the third image is similar to that of the binocular camera, see the above-described embodiment.

On the basis of the foregoing embodiment, fig. 7 is a flowchart illustrating a method for detecting a pedestrian according to an embodiment of the present application, and as shown in fig. 7, the method for detecting a pedestrian includes:

step 601: and acquiring a candidate frame of the image to be detected by adopting a candidate region generation algorithm.

Illustratively, a candidate region generation algorithm (e.g., selective search) performs over-segmentation on an original image by using superpixels, divides the image into a plurality of small regions, merges the divided small regions according to four information, namely color closeness, texture closeness, total area after merging and proportion of the total area after merging in the candidate region, merges the two regions with the highest probability, and repeats the merging continuously until all the regions are merged into a complete image, all the regions once appearing are reserved as output candidate regions, and the candidate regions are framed so as to obtain corresponding candidate frames.

Step 602: and inputting the image to be detected with the candidate frame into a convolutional neural network for feature extraction to obtain a plurality of features of the image corresponding to the candidate frame.

Exemplarily, inputting the image to be detected with the candidate frame into a convolutional neural network, mapping corresponding positions in a convolutional feature map for a plurality of candidate regions generated by a selective search, intercepting corresponding feature blocks from the convolutional feature map, and sending the feature blocks into a region of interest roi (region of interest) pooling layer. The RoI pooling layer equally divides the candidate areas with different sizes into 7 x 7 grids, takes the maximum value in each grid as the characteristic representation of the network, and finally obtains characteristic vectors with fixed length of 49 x 256 dimensions. After the extraction of the regional features is completed, the convolutional neural network feeds the fixed-length feature vectors into two parallel fully-connected layers.

Step 603: and classifying the plurality of features to obtain object classes, and performing regression on the plurality of features to obtain candidate frame offsets.

Illustratively, two fully connected layers are respectively used for determining the category of the candidate region and completing regression of the candidate box boundary, so that a plurality of features can be classified to obtain the object category. Furthermore, regression can be performed on the plurality of features to obtain candidate frame offsets and obtain final candidate frames.

Step 604: and when the object type is a pedestrian, acquiring the position information of the pedestrian from the image to be detected according to the offset of the candidate frame.

For example, the object class may be determined according to a class determined by the convolutional neural network during training, and when the object class is a pedestrian, the position information of the pedestrian is obtained according to the corresponding candidate frame. Wherein the coordinates of the upper left corner of the candidate frame may be taken as the position information of the pedestrian. It should be noted that the above-described method can be employed for pedestrian detection of both the first image and the second image.

Fig. 8 is a schematic structural diagram of a pedestrian matching device according to an embodiment of the present application, and as shown in fig. 8, the matching device includes: the system comprises an image acquisition module 701, a pedestrian detection module 702, a three-dimensional space establishment module 703, a mapping module 704 and a pedestrian matching module 705, wherein:

the image acquisition module 701 is configured to acquire a first image acquired by a first image acquisition device and a second image acquired by a second image acquisition device, where the first image and the second image are images of the same scene acquired at the same time; the pedestrian detection module 702 is configured to perform pedestrian detection on the first image and the second image respectively to obtain first position information of a first pedestrian in the first image and second position information of a second pedestrian in the second image; the three-dimensional space establishing module 703 is configured to acquire depth information of the first pedestrian, and establish a three-dimensional space according to the first position information and the depth information to acquire first three-dimensional position information; the mapping module 704 is configured to map the first three-dimensional position information on a plane corresponding to the second image to obtain first mapping position information; the pedestrian matching module 705 is configured to perform pedestrian matching according to the first mapping position information and the second mapping position information.

On the basis of the foregoing embodiment, the pedestrian matching module 705 is specifically configured to:

On the basis of the above embodiment, the first image acquisition device is a monocular image acquisition device or a depth image acquisition device, and when the first image acquisition device is the monocular image acquisition device, the three-dimensional space establishing module 703 is specifically configured to:

On the basis of the foregoing embodiment, the mapping module 704 is specifically configured to:

On the basis of the foregoing embodiment, the pedestrian detection module 702 is specifically configured to:

On the basis of the above embodiment, the affine transformation coefficients are obtained as follows:

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

Fig. 9 is a schematic structural diagram of an entity of an electronic device provided in an embodiment of the present application, and as shown in fig. 9, the electronic device includes: a processor (processor)801, a memory (memory)802, and a bus 803; wherein,

the processor 801 and the memory 802 communicate with each other via the bus 803;

the processor 801 is configured to call program instructions in the memory 802 to perform the methods provided by the above-described method embodiments, including for example: acquiring a first image acquired by a first image acquisition device and a second image acquired by a second image acquisition device, wherein the first image and the second image are images of the same scene acquired at the same time; respectively carrying out pedestrian detection on the first image and the second image to obtain first position information of a first pedestrian in the first image and second position information of a second pedestrian in the second image; acquiring depth information of the first pedestrian, and constructing a three-dimensional space according to the depth information to acquire first three-dimensional position information; mapping the first three-dimensional position information on a plane corresponding to the second image to obtain first mapping position information; and carrying out pedestrian matching according to the first mapping position information and the second mapping position information.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: acquiring a first image acquired by a first image acquisition device and a second image acquired by a second image acquisition device, wherein the first image and the second image are images of the same scene acquired at the same time; respectively carrying out pedestrian detection on the first image and the second image to obtain first position information of a first pedestrian in the first image and second position information of a second pedestrian in the second image; acquiring depth information of the first pedestrian, and constructing a three-dimensional space according to the depth information to acquire first three-dimensional position information; mapping the first three-dimensional position information on a plane corresponding to the second image to obtain first mapping position information; and carrying out pedestrian matching according to the first mapping position information and the second mapping position information.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring a first image acquired by a first image acquisition device and a second image acquired by a second image acquisition device, wherein the first image and the second image are images of the same scene acquired at the same time; respectively carrying out pedestrian detection on the first image and the second image to obtain first position information of a first pedestrian in the first image and second position information of a second pedestrian in the second image; acquiring depth information of the first pedestrian, and constructing a three-dimensional space according to the depth information to acquire first three-dimensional position information; mapping the first three-dimensional position information on a plane corresponding to the second image to obtain first mapping position information; and carrying out pedestrian matching according to the first mapping position information and the second mapping position information.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A pedestrian matching method, characterized by comprising:

carrying out pedestrian matching according to the first mapping position information and the second mapping position information;

wherein, the pedestrian matching according to the first mapping position information and the second mapping position information comprises:

2. The method according to claim 1, wherein the step of performing pedestrian matching on the transformed position information and the second position information based on the shortest distance principle comprises:

3. The method according to claim 1, wherein the first image capturing device is a monocular image capturing device or a depth image capturing device, and when the first image capturing device is the monocular image capturing device, the obtaining the depth information of the first pedestrian comprises:

4. The method of claim 1, wherein mapping the first three-dimensional position information on a plane corresponding to the second image comprises:

5. The method according to any one of claims 1-4, wherein the method of pedestrian detection comprises:

6. The method of claim 1, wherein the affine transform coefficients are obtained by:

7. A pedestrian matching apparatus, comprising:

the pedestrian matching module is used for carrying out pedestrian matching according to the first mapping position information and the second mapping position information;

the pedestrian matching module is specifically configured to obtain a calibrated affine transformation coefficient, perform affine transformation on the first mapping position information according to the affine transformation coefficient to obtain transformed position information, and perform pedestrian matching on the transformed position information and the second position information based on a shortest distance principle.

8. An electronic device, comprising: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-6.

9. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1-6.