CN113140005A

CN113140005A - Target object positioning method, device, equipment and storage medium

Info

Publication number: CN113140005A
Application number: CN202110474194.9A
Authority: CN
Inventors: 杨昆霖; 李昊鹏; 刘诗男; 侯军; 伊帅
Original assignee: Shanghai Sensetime Technology Development Co Ltd
Current assignee: Shanghai Sensetime Technology Development Co Ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-20
Anticipated expiration: 2041-04-29
Also published as: CN113140005B

Abstract

The embodiment of the specification provides a target object positioning method and device, electronic equipment and a computer-readable storage medium. The method comprises the steps of obtaining at least two frames of target images including images to be detected from continuously collected multi-frame images, determining a second pixel in a feature map of each frame of target image according to the pixel position of the first pixel aiming at each first pixel in the feature map of the images to be detected, determining the fusion weight of each second pixel according to the similarity of the first pixel and the second pixel, fusing the features of the second pixel based on the fusion weight to obtain the feature of the pixel at the pixel position in the target feature map, and then determining the position of a target object in the images to be detected based on the target feature map. By fusing the information of the image to be detected and the adjacent image, the correlation between the adjacent images can be fully utilized, and the problem of reduced positioning precision caused by the interference of a moving object in the image is avoided.

Description

Target object positioning method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for locating a target object.

Background

In the fields of security protection, monitoring and the like, a target object in a video or an image is generally required to be positioned, then tracking, counting, behavior analysis and the like can be carried out on the target object based on a positioning result, and an accurate positioning result is a key for ensuring the accuracy of a subsequent processing result. Conventionally, when a target object in an image is located, a single frame image is often input to a neural network trained in advance, and the position of the target object in the image is predicted by the neural network. However, since there may be moving objects in the image, the image may be blurred, which may interfere with the positioning of the target object and affect the positioning accuracy.

Disclosure of Invention

The disclosure provides a target object positioning method, a target object positioning device and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a target object positioning method, the method including:

acquiring at least two frames of target images from continuously acquired multi-frame images, wherein the at least two frames of target images comprise images to be detected;

for each first pixel in the feature map of the image to be detected, determining a second pixel in the feature map of the at least two frames of target images respectively based on the pixel position of the first pixel;

determining a fusion weight of the second pixel based on the similarity of the second pixel to the first pixel;

fusing the second pixels based on the fusion weight to obtain the characteristics of the pixel positions in the target characteristic diagram so as to determine the target characteristic diagram;

and determining the position of the target object in the image to be detected based on the positioning prediction of the target feature map.

In some embodiments, the higher the similarity of the second pixel to the first pixel, the greater the fusion weight of the second pixel.

In some embodiments, determining the fusion weight of the second pixel based on the similarity of the second pixel to the first pixel comprises:

acquiring a first vector characterizing the characteristics of the first pixel and a second vector characterizing the characteristics of the second pixel;

and normalizing the product of the first vector and the second vector to obtain the fusion weight of the second pixel.

In some embodiments, fusing the second pixels based on the fusion weight to obtain a feature of the pixel position in a target feature map, so as to determine the target feature map, includes:

obtaining a second vector characterizing the second pixel;

and performing weighted summation on the second vectors based on the fusion weight to obtain a third vector for characterizing the pixel of the pixel position in the target feature map so as to determine the target feature map.

In some embodiments, the second pixel comprises a pixel within a target pixel region, and the target pixel region is a region surrounding the pixel position or a neighboring region of the pixel position in the feature map of the at least two frames of target images.

In some embodiments, the at least two frames of target images are derived based on:

carrying out down-sampling processing on multi-frame images continuously acquired by an image acquisition device;

and extracting an image area including the target object in the image aiming at each frame of image obtained by down-sampling to obtain one frame of target image in the at least two frames of target images.

In some embodiments, determining the position of the target object in the image to be detected based on the positioning prediction of the target feature map includes:

determining a positioning probability map corresponding to the target image according to the target feature map, wherein the positioning probability map is used for indicating the probability that pixel points in the target image are key points of the target object, and the key points are used for positioning the target object;

determining the positions of the key points in the target image based on the positioning probability map to determine the positions of the target objects in the target image.

In some embodiments, determining the location of the keypoint in the target image based on the localization probability map comprises:

performing mean pooling on the positioning probability map to obtain a first probability map;

performing maximum pooling on the first probability map to obtain a second probability map;

and determining pixel points with the same probability and larger than a preset threshold value in the first probability map and the second probability map as the key points.

In some embodiments, the method is implemented by a pre-trained neural network trained based on:

acquiring at least two frames of sample images from continuously acquired multi-frame images, wherein the at least two frames of sample images comprise target sample images carrying annotation information, the annotation information is used for indicating whether pixel points in the target sample images are key points of a target object, and the key points are used for positioning the target object;

inputting the at least two frames of sample images into a neural network to implement the following steps by the neural network:

for each pixel in the feature map of the target sample image, respectively determining a target pixel in the feature maps of the at least two frames of sample images based on the pixel position of the pixel; determining a fusion weight of the target pixel based on the similarity of the target pixel and the pixel; fusing the target pixels based on the fusion weight to obtain the characteristics of the pixel positions in the sample target characteristic diagram so as to determine the sample target characteristic diagram; determining a sample positioning probability map corresponding to the sample image based on the sample target feature map, wherein the sample positioning probability map is used for indicating the probability that each pixel point in the target sample image is a key point of a target object;

and constructing a loss function based on the difference between the sample positioning probability graph and a real positioning probability graph corresponding to the target sample image, and training the neural network by taking the loss function as an optimization target, wherein the real positioning probability graph is determined based on the labeling information.

According to a second aspect of the embodiments of the present disclosure, there is provided a target object positioning apparatus, the apparatus comprising:

the device comprises an acquisition module, a detection module and a processing module, wherein the acquisition module is used for acquiring at least two frames of target images from continuously acquired multi-frame images, and the at least two frames of target images comprise images to be detected;

the target feature map determining module is used for determining a second pixel in the feature maps of the at least two frames of target images respectively according to the pixel position of each first pixel in the feature map of the image to be detected; determining a fusion weight of the second pixel based on the similarity of the second pixel to the first pixel; fusing the second pixels based on the fusion weight to obtain the characteristics of the pixel positions in the target characteristic diagram so as to determine the target characteristic diagram;

and the positioning module is used for determining the position of the target object in the image to be detected based on the positioning prediction of the target characteristic map.

According to a third aspect of the embodiments of the present disclosure, an electronic device is provided, where the electronic device includes a processor, a memory, and computer instructions stored in the memory and executable by the processor, and when the processor executes the computer instructions, the method of the first aspect is implemented.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed, may implement the method of the first aspect described above.

In the embodiment of the disclosure, when a target object in an image is located, at least two frames of target images including the image to be detected may be obtained from a plurality of frames of images that are continuously collected, for each first pixel in the image to be detected, a second pixel is determined in a feature map of each frame of target image based on a pixel position where the first pixel is located, a fusion weight of each second pixel is determined according to a similarity between the first pixel and the second pixel, features of the second pixels are fused based on the fusion weight, features of the pixels at the pixel position in the target feature map are obtained, so as to obtain a target feature map, the target feature map is obtained by fusing information of the image to be detected and an adjacent image thereof, so as to locate the target object in the image to be detected, and association between the images may be fully utilized, thereby avoiding a problem of location accuracy degradation caused by interference of a moving object in the image, so as to improve the positioning precision.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic diagram of predicting a center position of a human head through a neural network according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a target object positioning method according to an embodiment of the present disclosure.

Fig. 3(a) is a schematic diagram of a target feature map obtained by fusing multi-frame feature maps according to an embodiment of the present disclosure.

Fig. 3(b) is a schematic diagram of a target feature map obtained by fusing multi-frame feature maps according to an embodiment of the present disclosure.

Fig. 4 is a schematic diagram of a structural schematic diagram of a neural network according to an embodiment of the present disclosure.

Fig. 5 is a schematic diagram of a target feature map obtained by fusing multi-frame feature maps according to an embodiment of the present disclosure.

Fig. 6 is a schematic diagram of a structural schematic diagram of a neural network according to an embodiment of the present disclosure.

Fig. 7 is a schematic logical structure diagram of a target object locating apparatus according to an embodiment of the present disclosure.

Fig. 8 is a schematic diagram of a logical structure of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In order to make the technical solutions in the embodiments of the present disclosure better understood and make the above objects, features and advantages of the embodiments of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

In the fields of security protection, monitoring and the like, a target object in a video or an image is generally required to be positioned, then tracking, counting, behavior analysis and the like can be carried out on the target object based on a positioning result, and an accurate positioning result is a key for ensuring the accuracy of a subsequent processing result. When the target object in the image is located, the target object can be located by locating the key points of the target object, for example, a neural network can be trained in advance, the neural network can output a key point location map corresponding to the image, for example, a pixel point which is a key point in the image is marked as 1, a pixel point which is not a key point in the image is marked as 0, and then the location of the target object in the image can be determined based on the key point location map. Taking crowd positioning as an example, crowd positioning can be realized by positioning the position of the center point of the head, as shown in fig. 1, an original image can be input into a neural network, the neural network can directly output a map of the center point of the head, the pixel point of the center point of the head in the map is marked as 1, and the pixel point of the center point of the head which is not the center point of the head in the map is marked as 0.

Currently, when positioning a target object in an image, a single frame of image is generally input into a neural network trained in advance, and the position of the target object in the image is predicted through the neural network. However, since there may be a moving object in the image, such as a moving person or an object, an undesirable phenomenon such as blurring of the image may occur, which may cause interference to the positioning of the target object and seriously affect the positioning accuracy.

In order to improve the positioning accuracy when positioning the target object in the image, the embodiments of the present disclosure provide a method for positioning a target object, when the target object in the image to be detected is positioned, the target object in the image to be detected can be positioned by combining one or more frames of adjacent images, fusing the characteristic graph of the adjacent image and the characteristic graph of the image to be detected to obtain a target characteristic graph, wherein, aiming at each pixel in the image to be detected, one or more pixels can be respectively determined in the neighborhood of the pixel position of the pixel in each frame of adjacent image, and determining fusion weight based on the similarity of the one or more pixels and the pixels of the image to be detected, and fusing the one or more pixels determined in the adjacent images of each frame based on the fusion weight to obtain the feature of the pixel position in the target feature map. The position of the target object in the image to be detected can then be determined based on the location prediction of the target feature map. By considering the time sequence relationship between the image to be detected and other adjacent images and fusing the information of the multi-frame images, the target characteristic diagram is obtained, the relevance between the images can be fully utilized, the problem that the positioning accuracy is reduced due to the interference of moving objects in the images is solved, and the positioning accuracy can be improved.

The detection method of the target object in the embodiment of the present disclosure may be executed by various electronic devices, for example, electronic devices such as a notebook computer, a server, a mobile phone, and a tablet.

The target object of the embodiment of the present disclosure may be various objects that need to be identified and located from the image, for example, the target object may be a person, a vehicle, an animal, and the like. By the method, the target object in the image can be positioned, and further the target object in the image can be subjected to subsequent processing such as counting, tracking and behavior analysis.

Specifically, the method, as shown in fig. 2, includes the following steps:

s202, acquiring at least two frames of target images from continuously acquired multi-frame images, wherein the at least two frames of target images comprise images to be detected;

s204, aiming at each first pixel in the feature map of the image to be detected, respectively determining a second pixel in the feature maps of the at least two frames of target images based on the pixel position of the first pixel;

s206, determining the fusion weight of the second pixel based on the similarity of the second pixel and the first pixel;

s208, fusing the second pixels based on the fusion weight to obtain the characteristics of the pixel positions in the target characteristic diagram so as to determine the target characteristic diagram;

and S2010, determining the position of the target object in the image to be detected based on the positioning prediction of the target feature map.

In step S202, first, at least two frames of target images may be obtained from multiple frames of images continuously acquired by the image acquisition device, for example, at least two frames of target images may be obtained from a video frame of a section of video acquired by the image acquisition device, where the at least two frames of target images include an image to be detected, the image to be detected is an image in which a target object in the image needs to be located, and the image to be detected may be any one of the at least two frames of target images and may be preset as needed, for example, the image to be detected may be a first frame of image acquired in the at least two frames of images, or an intermediate frame of image, or a last frame of image. The at least two frames of target images may be continuously acquired images or discontinuously acquired images, and certainly, in order to obtain a more accurate positioning result when using other images of the at least two frames of target images to assist the to-be-detected image to position the target object, the contents of the other images and the to-be-detected image are preferably not identical but have a certain difference, and meanwhile, the contents of the other images and the to-be-detected image cannot differ too much but mostly have a content consistency, so that the other images are preferably images separated from the to-be-detected image by a certain number of frames, and the number of the separated frames is preferably controlled within a certain range to ensure that the to-be-detected image is integrally close to the other images but has a certain difference.

In step S204, after at least two frames of target images are acquired, feature extraction may be performed on each frame of target image in the at least two frames of target images to obtain a feature map of each frame of target image, so as to obtain feature maps of the at least two frames of target images. The feature extraction of the target image may be implemented by using a preset neural network, and may also be implemented by using other manners, which is not limited in the embodiment of the present disclosure. Each frame of feature map may include features of multiple channels, and different channels may represent different types of features of an image, for example, color features, contour features, and the like. After obtaining the feature maps of the at least two frames of target images, for each pixel in the feature map of the image to be detected, which is hereinafter referred to as a first pixel, one or more pixels may be determined in each frame of target image respectively based on a pixel position where the first pixel is located, which is hereinafter referred to as a second pixel, where the second pixel may be a pixel with a higher association degree with the first pixel, for example, one or more second pixels may be determined in a neighboring area surrounding or neighboring the pixel position in each frame of target image, and the second pixel may be a pixel at the pixel position in the target image, or a plurality of pixels at the pixel position and its neighboring pixel positions.

In steps S206 and S208, after determining the second pixels in each frame of the target image based on the pixel positions, the fusion weights of the second pixels may be determined based on the similarity between the second pixels and the first pixels, wherein the similarity between the second pixels and the first pixels may be determined based on the similarity between the features of the second pixels and the first pixels. After determining the fusion weight of each second pixel, the second pixels may be fused based on the fusion weight to obtain the feature of the pixel at the pixel position in the fused target feature map. For example, assuming that the first pixel is a pixel in the first row and the first column in the image to be detected, the feature of the pixel in the first row and the first column in the target feature map may be determined based on the above steps. For other pixel positions in the image to be detected, the characteristics of the pixel positions in the target characteristic diagram can be determined in sequence by adopting the above mode, and therefore the whole target characteristic diagram is obtained through fusion.

For example, as shown in fig. 3(a), assume that three frames of target images are respectively an image a, an image B and an image C, where the image a is an image to be detected, and the feature maps corresponding to the three frames of target images are respectively a ', B ' and C '. For a first pixel (such as a gray pixel in the figure) in a first row and a first column in a feature map a ' of an image to be detected, second pixels may be determined in the feature maps a ', B ', and C ', respectively, where the second pixels may be pixels in the first row and the first column in the feature maps a ', B ', and C ', then fusion weights of the second pixels may be determined according to similarities between the second pixels and the first pixels, and then features of pixels in the first row and the first column in the feature maps a ', B ', and C ' are fused based on the respective corresponding fusion weights, so as to obtain features of pixels in the first row and the first column in a target feature map D '. The characteristics of the pixels at other pixel positions in the target feature map can also be determined by the method described above, so as to obtain the characteristics of the whole target feature map.

For another example, as shown in fig. 3(B), for a first pixel (a pixel P in the figure) in the first row and the first column in the feature map a ' of the image to be detected, one pixel region (e.g., a gray region in the figure) may be selected at the corresponding pixel position of the feature maps a ', B ', and C ', the pixels in the three pixel regions are all used as the second pixel, then the fusion weight of the pixels in the three pixel regions is determined based on the similarity between the pixels in the three pixel regions and the pixels in the first row and the first column in the image to be detected (i.e., a '), and then the features of the pixels in the three pixel regions are fused based on the fusion weight, so as to obtain the feature of the pixel in the same pixel position as the pixel P in the target feature map. The characteristics of the pixels at other pixel positions in the target feature map can also be determined by the method described above to obtain the characteristics of the target feature map of the whole frame.

Of course, there are many ways to specifically determine the second pixel, and the determination can be set based on actual requirements in specific applications.

In step S2010, after the target feature map is obtained, the target feature map may be subjected to positioning prediction, so as to determine a position of the target object in the image to be detected according to the target feature map, thereby positioning the target object.

When a target object in an image to be detected is positioned, a target characteristic diagram is obtained by fusing information of adjacent multi-frame images, an enhanced target characteristic diagram is obtained by fully utilizing the relevance between the adjacent images, and the target object is positioned based on the target characteristic diagram, so that the positioning result is more accurate.

In some embodiments, steps S202-S2010 may be performed by a pre-trained neural network, such as, after at least two frames of target images are obtained, the target images can be input into a pre-trained neural network, the characteristic diagram of the target images is extracted through the neural network to obtain at least two frames of target images, then, the neural network can respectively determine a second pixel in the feature map of the target image based on the pixel position of each first pixel in the feature map of the image to be detected, determining the fusion weight corresponding to the second pixel based on the similarity degree of the second pixel and the first phase pixel, fusing the characteristics of the second pixel based on the fusion weight to obtain the characteristics of the target characteristic diagram at the pixel position, obtaining the whole target characteristic diagram by a similar method, and the neural network may predict the position of the target object in the target image based on the obtained target feature map.

In some embodiments, the structure of the pre-trained neural network may be as shown in fig. 4, and includes a first sub-network, a second sub-network, and a third sub-network, where the first sub-network is used to perform feature extraction on target images (such as target image 1, target image 2, and target image 3 in the figure) to obtain feature maps (such as feature map 1, feature map 2, and feature map 3 in the figure) corresponding to each frame of the target images, the second sub-network is used to determine, for each first pixel of the feature maps of the image to be detected (target image 2 in the figure), a second pixel in the feature maps of the target images based on the pixel position of the first pixel, and a fusion weight corresponding to a second pixel in the feature maps of each target image based on the similarity between the first pixel and the second pixel, the features of the second pixels are fused based on the fusion weight, and obtaining the characteristics of the target characteristic diagram at the pixel position, and repeating the steps to obtain the whole target characteristic diagram. The third sub-network is configured to predict a position of the target object in the target image based on the target feature map, for example, the third sub-network may obtain a positioning probability map corresponding to the image to be detected based on the target feature map, where the positioning probability map is used to indicate a probability that each pixel point in the image to be detected is a key point of the target object, where the key point may be used to position the target object, and then determine the position of the target object in the image to be detected according to the positioning probability map.

In some embodiments, the neural network may be trained by: at least two frames of sample images can be obtained from continuously acquired multi-frame images, the at least two frames of sample images comprise target sample images carrying annotation information, the annotation information can be used for indicating whether pixel points in the target sample images are key points of the target object, and the key points can be used for positioning the target object. Based on the labeling information, a real positioning probability graph of the neural network can be obtained and used for indicating the real probability that each pixel point in the target sample image is a key point. Then inputting the at least two frames of sample images into a neural network, the neural network can determine a target pixel in the feature maps of the at least two frames of sample images respectively based on the pixel position of the pixel for each pixel in the feature maps of the target sample images, then can determine a fusion weight of the target pixel according to the similarity between the target pixel and the pixel, and fuse the target pixel based on the determined fusion weight to obtain the feature of the pixel position in the sample target feature map to determine the sample target feature map, then can determine a sample positioning probability map corresponding to the sample image based on the sample target feature map, wherein the sample positioning probability map is used for indicating the prediction probability that each pixel point in the target sample image is a key point of the target object, then can construct a loss based on the difference between the sample positioning probability map and the true positioning probability map corresponding to the target sample image, for example, the cross entropy loss of the two can be taken as a loss, and the neural network can be trained by taking the loss as an optimization target.

In order to predict a more accurate positioning result according to the target feature map obtained by fusion, when the target feature map is obtained by fusing at least two frame feature maps, a final target feature map can be determined based on a local attention mechanism. When the target feature map is determined based on the local attention mechanism, the feature of a certain pixel position in the target feature map can be obtained according to the feature fusion of a neighboring pixel region of the certain pixel position in the feature map of the image to be detected and a plurality of pixels in the neighboring region of the pixel position in the feature map of other target images. The feature fusion is carried out by considering the adjacent regions of all pixels, so that the target feature map not only fuses the time sequence information among multiple frames of images, but also fuses the spatial correlation information among adjacent pixel points of the same frame of image, the positioning precision is improved, and meanwhile, compared with the global attention mechanism, the redundancy can be reduced, and the calculation amount is reduced. Based on this, in step S204, when determining the second pixel in the feature map of the target image based on the pixel position where the first pixel is located, in some embodiments, the target pixel area may be determined in the feature map of each target image based on the pixel position, and the pixel in the target pixel area may be taken as the second pixel. The target pixel area is a pixel area surrounding the pixel position or a neighboring area of the pixel position, so as to ensure that the relevance of the second pixel and the first pixel is high.

For example, as shown in fig. 5, for any pixel position (e.g., pixel position P in the image) in the feature map of the image to be detected, one or more target pixel regions (e.g., gray regions in the image) may be determined in the feature map of each frame of the target image based on the pixel position, then fusion weights of pixels in the target pixel regions are determined based on similarities between the pixels in the target pixel regions and the pixels at the pixel position in the image to be detected, and features of pixel points in the target pixel regions in the at least two frames of feature maps are fused based on the fusion weights of the pixel points, so as to obtain a feature corresponding to the pixel position in the target feature map.

In some embodiments, when determining a target pixel region in the feature map of each frame of the target image based on the pixel position, as shown in fig. 5, one target pixel region may be determined in the feature map centering on the pixel position. For example, an N × N square area may be determined with the pixel position as the center, or a circular area may also be determined with the pixel position as the center, as long as the target pixel area is a neighboring area surrounding the pixel position, which may be specifically set according to actual requirements. Of course, the target pixel region may not be centered on the pixel position, and may include only the pixel position, for example.

In step S204, when determining the fusion weight of the second pixel based on the similarity between the second pixel and the first pixel, in some embodiments, a second vector characterizing the feature of the second pixel and a first vector characterizing the feature of the first pixel may be obtained first, and then the product of the second vector and the first vector is normalized to obtain the fusion weight of the second pixel. For example, the product of the first vector and the second vector may be determined, and then the product may be input to the softmax function to output the fusion weight.

In some embodiments, the higher the similarity of the second pixel to the first pixel, the greater the fusion weight of the second pixel. Since the higher the similarity between the second pixel and the first pixel, the more likely the second pixel and the first pixel are to represent the same object in the three-dimensional space, the greater the fusion weight corresponding to the second pixel should be.

In step S208, when the second pixels are fused based on the fusion weight to obtain the feature of the pixel position in the target feature map, in some embodiments, second vectors characterizing the feature of the second pixels may be obtained, and then the second vectors are weighted and summed according to the fusion weight corresponding to each second pixel to obtain third vectors characterizing the feature of the pixel position in the target feature map.

Certainly, in some embodiments, since the target feature map prediction and positioning result may be completed through a neural network, in order to reduce network parameters of the neural network as much as possible, the number of channels corresponding to the target feature map obtained by fusion may be kept consistent with the number of channels corresponding to each frame of feature map in the feature maps of the at least two frames of target images, thereby avoiding an increase in the number of channels of the target feature map and an increase in network parameters of the neural network.

In some embodiments, a plurality of frames of images continuously acquired by the image acquisition device may be downsampled to obtain the at least two frames of target images, for example, assuming that the image acquisition device acquires 60 frames of images per second, then the 60 frames of images may be downsampled to obtain 5 frames of images, and then the 5 frames of images are used as the target images, and the middle frames of the 5 frames of images are used as the images to be detected. The target images are obtained by carrying out downsampling processing on the images continuously acquired by multiple frames, so that most scenes between the target images can be ensured to be similar, and meanwhile, some differences are avoided, but the differences are not completely the same. In some embodiments, in order to reduce the amount of computation, save the computation resources, and improve the efficiency of locating the target object in the image, when the target image is obtained from multiple frames of images continuously acquired by the image acquisition device, the region including the target object may be extracted from the original image acquired by the image acquisition device to obtain the target image. Therefore, the region which does not comprise the target object in the original image can be cut off, and the calculation amount in the positioning process is reduced.

In some embodiments, in step S2010, when the position of the target object in the target image is determined according to the target feature map, a positioning probability map corresponding to the target image may be determined according to the target feature map, where the positioning probability map is used to indicate a probability that a pixel point in the target image is a key point of the target object, and the key point may be one point of the target object that can identify or represent the target object, for example, taking the target object as a task as an example, and the key point may be a head center point, a body center point, or the like. The location of the keypoints in the target image can then be determined based on the localization probability map to determine the location of the target object in the target image.

Of course, since the positioning probability map predicted by the neural network may have certain noise, the prediction probability of individual pixel points in the target image is high, and the individual pixel points may be misjudged as key points, in some implementations, in order to suppress the noise in the positioning probability map, the positioning probability map may be pooled to reduce the noise interference. For example, the positioning probability map may be averaged and pooled first to obtain a first probability map, then the positioning probability map may be averaged and pooled maximum sequentially to obtain a second probability map (certainly, the first probability map may also be directly pooled maximum to obtain the second probability map), then, it is determined that a pixel point with the same probability in the first probability map and the second probability map is determined as a target pixel point, and it is determined whether the prediction probability of the target pixel point is greater than a preset threshold, and if so, the target pixel point is determined as a key point.

For example, for a positioning probability map corresponding to a target image output by a neural network, average pooling processing may be performed on the positioning probability map by using a convolution kernel with a certain size and step length (for example, a convolution kernel with a size of 3 × 3 and a step length of 1) to obtain an average pooled first probability map, then maximum pooling processing is performed on the average pooled first probability map by using a convolution kernel with a certain size and step length (for example, a convolution kernel with a size of 3 × 3 and a step length of 1) to obtain a maximum pooled second probability map, then the first probability map and the second probability map are compared, a point with the same probability in the two maps is determined as a target pixel point, that is, a peak pixel point, and then it is determined whether the target pixel point is greater than a preset threshold, and if the target pixel point is greater than the preset threshold, the point is considered as a key point. By the method, the influence of noise can be eliminated, and the peak pixel point can be accurately determined, so that the finally determined key point is more accurate.

After the position of the target object in the target image is determined according to the position of the key point, the positioning result may be output in the form of a key point positioning map, for example, a pixel point that is a key point in the target image may be represented as 1, and a pixel point that is not a key point may be represented as 0, so as to obtain a key point positioning map, and the target object in the target image may be further subjected to subsequent processing such as counting, tracking, and the like according to the key point positioning map.

To further explain the positioning method of the target object in the embodiment of the present application, the following is explained with reference to a specific embodiment.

In the field of video monitoring, people in a monitored video or image generally need to be positioned so as to perform subsequent processing such as counting, tracking, behavior analysis and the like on the people, an accurate positioning result is a key for ensuring the accuracy of a subsequent processing result, and at present, a single-frame image to be detected is generally input into a pre-trained neural network, and the position of the people in the image is predicted through the neural network. When there is a moving object in the image, for example, a person moves, the image may be blurred, and the accuracy of the final prediction result may be affected. Based on this, the embodiment of the application provides a positioning method based on a video, which is characterized in that multi-frame images in a video segment are input into a neural network, then feature maps of the multi-frame images are fused based on a local attention mechanism, and the positions of people are predicted based on the fused feature maps, so that the positioning accuracy is improved.

Specifically, the method comprises a neural network training phase and a neural network prediction phase.

The neural network training phase comprises the following steps:

1. a plurality of sections of crowd videos are collected, the video scenes are diversified as much as possible, and the videos can include places with large people flow, such as squares, markets, subway stations, tourist attractions and the like. After the video is collected, 5 frames per second of down-sampling processing are carried out on the video, the video frames obtained by down-sampling are cut, and interested crowd areas are reserved. And then labeling the cut video frames, and labeling the position of the head center in each frame of video frames.

As for each figure in the video frame, the user only marks one pixel point as the head central point, the number of the head central points in the image is less, and the training of the convolutional neural network is not facilitated. In order to obtain a better neural network training result, one or more adjacent pixel points of the head center point labeled by the user can be labeled as the head center point to obtain a real positioning diagram Y for training the neural network. For example, for each video frame I (where the height and width of the video frame I are H and W, respectively), the center of the head labeled by the user in the video frame is

(a_iThe coordinate of the center point of the head is assumed, and the number of the heads in the figure is n). The real localization map Y for training the convolutional neural network can be determined according to formula (1), formula (2) and formula (3) (where the height and width of the real localization map are H, W, respectively),

wherein:

x is the coordinate in the image, representing the convolution operation, and K is the convolution kernel, e.g., K ═ 0,1, 0; 1,1, 1; 0,1,0], n is the head number, ai is the head center point, δ (·) is the multivariate impulse function, namely:

2. three adjacent frames in the video frames obtained by the method are input into a convolutional neural network, wherein the convolutional neural network has a structure as shown in fig. 6 and comprises a feature extraction module, a local attention module and a positioning prediction module. In the input three adjacent frames of images, one frame of the three adjacent frames of images can be preset as an image to be detected, for example, an intermediate frame is used as the image to be detected. The feature extraction module can be a VGG-16 network front 13 layer pre-trained on ImageNet, and three feature maps with 512 channels and the size of original drawing 1/8 can be obtained after feature extraction is carried out on three adjacent video frames by the feature extraction module.

3. The three feature maps are input to the local attention module, wherein the local attention module typically has three inputs: query map, key map, and value map. Wherein, the query graph is a characteristic graph corresponding to the image to be detected, and each characteristic graph in the three characteristic graphs can be constructedThe key map and the value map are in a pair, that is, the key map and the value map are the same frame feature map, so that 3 pairs of the key map and the value map can be formed in total. Suppose the query graph is Q ∈ R^h×w×cThe key map and the value map are

Kⁱ,Vⁱ∈R^h×w×cWherein h represents the height of the feature map, w represents the width of the feature map, c represents the number of channels of the feature map, N represents the logarithm of the key map and the value map, and there are three feature maps, i is 3, i is 1-3, R represents the feature map for each pixel position (x, y) in the query map Q, and a square neighborhood N (x, y) { (a, b) | x-a | ≦ k, and | y-b | ≦ k } is generated, where (a, b) is the coordinate of the pixel point in the square neighborhood N, and k is the radius of the square neighborhood. Then, the local attention output of the fused feature map at the pixel position (x, y) is calculated by the following formula (4),

therein, can

To characterize a transposed vector of vectors for the features of a pixel in the query graph with pixel coordinates (x, y),

to characterize a vector of the features of a pixel of pixel coordinate (a, b) in the key map, the similarity of two pixels can be determined by inner-product the two vectors, and then further converting the inner-product into a fusion weight by softmax function.

For a vector characterizing a pixel with pixel coordinates (a, b) in the key map, by pairing

Weighted summation is carried out, namely a characteristic target feature map can be obtainedThe medium pixel coordinate is a vector of the features of the pixel of (x, y).

And the local attention output of all pixel positions of all the fused feature maps can be obtained by repeatedly using the formula. The local attention module may output a fused feature map of the same size as the original feature map (1/8 original image).

4. And inputting the fused feature map into a positioning prediction module. The positioning prediction module firstly uses a three-layer convolutional neural network (the convolutional kernel size is 3, the void rate is 2, and the number of channels is 512) to further extract the features of the fused feature map, then uses three transposition convolutions (the convolutional kernel size is 4, the step length is 2, and the number of channels is 256, 128 and 64 respectively) to convert the fused feature map into the original map size, accesses a common convolutional layer (the convolutional kernel size is 3, the void rate is 2, and the number of channels is 256, 128 and 64 respectively) after each transposition convolution to realize feature extraction, and finally uses 1 × 1 convolution to convert the number of channels of the feature map into 1 to obtain a positioning probability map

Assuming a predicted positioning probability map as

The real positioning map is Y. The location cross entropy loss can be calculated according to equation (5)

Where λ is the positive sample weight, responsible for balancing the positive and negative samples, which may be set to 100.

5. Optimizing the network parameters by using random gradient descent after obtaining the loss function, and assuming that the network parameters are theta at the ith step_iThen, the network parameter θ at the step i +1 is calculated by the following formula (6)_i+1：

Where γ is the learning rate and is set to 0.0001. The above steps are repeated until the network parameters are not changed.

The neural network prediction stage is as follows:

after preprocessing such as downsampling and cutting out interested crowd areas of a crowd video to be detected, inputting three adjacent frames in video frames obtained through preprocessing into a trained convolutional neural network, and outputting a predicted positioning probability map of an intermediate frame by the convolutional neural network.

Then, the following non-maximum suppression steps are carried out on the positioning probability map to obtain a final key point positioning map:

firstly carrying out average pooling on the predicted probability graph with the kernel size of 3 and the step size of 1 to suppress noise, and then carrying out maximum pooling operation on the pooled probability graph with the kernel size of 3 and the step size of 1; and then comparing the mean pooling image with the maximum pooling image, taking pixel points with the same probability in the two frames of images as target pixel points, finally comparing the target pixel points with a preset threshold, setting the pixel larger than the preset threshold as 1, and otherwise, setting the pixel to be 0 to obtain a final key point positioning image. Based on the key point location map, the position of the crowd in the image can be determined.

By utilizing the multi-frame video image to position the crowd, the time sequence information in the video image can be mined, and the crowd positioning precision is higher than that based on a single image in the prior art. Meanwhile, by using a local self-attention mechanism, the spatio-temporal information of the feature maps of the multi-frame video images is fused, the local correlation among pixels in the video images is captured, more sufficient information can be mined, and a better positioning effect can be obtained even under the condition that a moving object exists in the video images.

The embodiment of the present disclosure further provides a target object positioning apparatus, as shown in fig. 7, the target object positioning apparatus 70 includes:

an obtaining module 71, configured to obtain at least two frames of target images from continuously acquired multiple frames of images, where the at least two frames of target images include an image to be detected;

a target feature map determining module 72, configured to determine, for each first pixel in the feature map of the image to be detected, a second pixel in the feature maps of the at least two frames of target images based on a pixel position where the first pixel is located; determining a fusion weight of the second pixel based on the similarity of the second pixel to the first pixel; fusing the second pixels based on the fusion weight to obtain the characteristics of the pixel positions in the target characteristic diagram so as to determine the target characteristic diagram;

and the positioning module 73 is configured to determine the position of the target object in the image to be detected based on the positioning prediction of the target feature map.

In some embodiments, the target feature map determining module, when determining the fusion weight of the second pixel based on the similarity between the second pixel and the first pixel, is specifically configured to:

In some embodiments, the target feature map determining module is configured to fuse the second pixels based on the fusion weight to obtain the feature of the pixel position in the target feature map, and when determining the target feature map, is specifically configured to:

obtaining a second vector characterizing the second pixel;

In some embodiments, the positioning module, when determining the position of the target object in the image to be detected based on the positioning prediction of the target feature map, is specifically configured to:

In some embodiments, the positioning module, when determining the location of the keypoint in the target image based on the positioning probability map, is specifically configured to:

An embodiment of the present disclosure further provides an electronic device, as shown in fig. 8, where the electronic device includes a processor 81, a memory 82, and a computer instruction stored in the memory 82 and executable by the processor 81, and when the processor 81 executes the computer instruction, the method in any of the above embodiments is implemented.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method of any of the foregoing embodiments.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the modules described as separate components may or may not be physically separate, and the functions of the modules may be implemented in one or more software and/or hardware when implementing the embodiments of the present disclosure. And part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing is only a specific embodiment of the embodiments of the present disclosure, and it should be noted that, for those skilled in the art, a plurality of modifications and decorations can be made without departing from the principle of the embodiments of the present disclosure, and these modifications and decorations should also be regarded as the protection scope of the embodiments of the present disclosure.

Claims

1. A method for locating a target object, the method comprising:

2. The method of claim 1, wherein the higher the similarity of the second pixel to the first pixel, the greater the fusion weight of the second pixel.

3. The method of claim 1 or 2, wherein determining the fusion weight of the second pixel based on the similarity of the second pixel to the first pixel comprises:

4. The method according to any one of claims 1-3, wherein fusing the second pixels based on the fusion weight to obtain the feature of the pixel position in the target feature map to determine the target feature map comprises:

obtaining a second vector characterizing the second pixel;

5. The method according to any of claims 1-4, wherein the second pixel comprises a pixel within a target pixel region, the target pixel region being a region surrounding the pixel position or a neighboring region of the pixel position in the feature map of the at least two frames of target images.

6. The method according to any of claims 1-5, wherein the at least two frames of target images are obtained based on:

7. The method according to any one of claims 1 to 6, wherein determining the position of the target object in the image to be detected based on the positioning prediction of the target feature map comprises:

8. The method of claim 7, wherein determining the location of the keypoint in the target image based on the localization probability map comprises:

9. The method of claim 1, wherein the method is implemented by a pre-trained neural network trained based on:

determining a loss based on a difference between the sample positioning probability map and a true positioning probability map corresponding to the target sample image, and optimizing the neural network based on the loss, wherein the true positioning probability map is determined based on the labeling information.

10. An apparatus for locating a target object, the apparatus comprising:

11. An electronic device, comprising a processor, a memory, and a computer program stored in the memory for execution by the processor, wherein the processor, when executing the computer program, implements the method of any of claims 1-9.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed, implements the method of any one of claims 1-9.