CN108875750B

CN108875750B - Object detection method, device and system and storage medium

Info

Publication number: CN108875750B
Application number: CN201710740825.0A
Authority: CN
Inventors: 王志成; 俞刚
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2017-08-25
Filing date: 2017-08-25
Publication date: 2021-08-10
Anticipated expiration: 2037-08-25
Also published as: CN108875750A

Abstract

The embodiment of the invention provides an object detection method, device and system and a storage medium. The method comprises the following steps: acquiring an image to be detected; inputting an image to be detected into a scene network in an object detection network to obtain a scene characteristic diagram related to scene information of the image to be detected; inputting an image to be detected into an object network in an object detection network to obtain an object feature map associated with object information of the image to be detected, and determining a first number of object candidate regions indicating object positions in the object feature map; for each of the first number of object candidate regions, mapping the object candidate region onto a scene feature map to determine a scene region feature corresponding to the object candidate region; combining the characteristics of the object candidate region in the object characteristic diagram with the characteristics of the corresponding scene region; and inputting the combined features into a classification network in the object detection network to obtain an object detection result. The invention can improve the accuracy of object detection.

Description

Object detection method, device and system and storage medium

Technical Field

The present invention relates to the field of image processing, and more particularly, to an object detection method, apparatus and system, and a storage medium.

Background

Object detection is an important problem in the field of computational vision, and has wide application, such as: techniques for detecting specific objects (people or objects) in unmanned, robotic, or security scenarios. The current object detection methods are mainly based on a region-based convolutional neural network (RCNN) improved algorithm and a single-step algorithm, and these algorithms are only trained by using a database (e.g., Pascal VOC, COCO, ImageNet-det, etc.) for object detection in a training phase. However, in practical situations, a person recognizes an object as being invisibly affected by the scene, such as: a white spherical object, in combination with scene information, would most likely be considered a cap if it appears in the pool, and would more likely be identified as a volleyball if it appears in the volleyball court. The current object detection method usually indicates understanding of scene information by enlarging a reception field (playback field), and the method is obviously an approximate method and has a large distance from the real scene understanding, so that the effect of the scene understanding in object detection cannot be well played, and the improvement of the object detection accuracy is influenced to a certain extent.

Disclosure of Invention

The present invention has been made in view of the above problems. The invention provides an object detection method, device and system and a storage medium.

According to an aspect of the present invention, there is provided an object detection method. The object detection method includes: acquiring an image to be detected; inputting an image to be detected into a scene network in an object detection network to obtain a scene characteristic diagram related to scene information of the image to be detected; inputting an image to be detected into an object network in an object detection network to obtain an object feature map associated with object information of the image to be detected, and determining a first number of object candidate regions indicating object positions in the object feature map; for each of the first number of object candidate regions, mapping the object candidate region onto a scene feature map to determine a scene region feature corresponding to the object candidate region; for each of the first number of object candidate regions, combining the features of the object candidate region in the object feature map with the corresponding scene region features to obtain combined features; and for each of the first number of object candidate regions, inputting the combined features into a classification network in an object detection network to obtain an object detection result.

For example, for each of the first number of object candidate regions, mapping the object candidate region onto the scene feature map to determine a scene region feature corresponding to the object candidate region comprises: calculating the overlapping degree of the object candidate region and each scene region in the pre-divided scene regions in the scene characteristic diagram; selecting the scene area with the maximum overlapping degree as the associated scene area of the object candidate area; and extracting the feature of the associated scene area from the scene feature map as the scene area feature corresponding to the object candidate area.

For example, for each of the first number of object candidate regions, mapping the object candidate region onto the scene feature map to determine a scene region feature corresponding to the object candidate region comprises: scaling the object candidate region to obtain a scaled region; determining a region in the scene characteristic map, which is consistent with the position of the zoomed region, as an associated scene region of the object candidate region; and extracting the feature of the associated scene area from the scene feature map as the scene area feature corresponding to the object candidate area.

Illustratively, for each of the first number of object candidate regions, inputting the combined features into a classification network of the object detection networks to obtain the object detection result comprises: for each of the first number of object candidate regions, inputting the combined features into a classification network to obtain coordinates of each object candidate region in the second number of object candidate regions output by the classification network and a confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category; filtering the second number of object candidate regions by using a non-maximum suppression algorithm to obtain a third number of object candidate regions; and determining the coordinates of each object candidate region in the third number of object candidate regions and the confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category as the object detection result.

For example, for each of the first number of object candidate regions, combining the feature of the object candidate region in the object feature map with the corresponding scene region feature to obtain a combined feature includes: inputting the characteristics of the object candidate region and the corresponding scene region characteristics in the object characteristic diagram into a splicing network in the object detection network to obtain combined characteristics.

Illustratively, the scene network and/or the object network is a full convolutional network.

Illustratively, the classification network is a fully connected network or a convolutional network.

Illustratively, the object detection method further comprises: and training an object detection network by using the sample object images marked with the object classes in the object classification database.

According to another aspect of the present invention, there is provided an object detecting apparatus comprising: the image acquisition module is used for acquiring an image to be detected; the scene network module is used for inputting the image to be detected into a scene network in the object detection network so as to obtain a scene characteristic diagram related to scene information of the image to be detected; an object network module, configured to input an image to be detected into an object network in an object detection network, to obtain an object feature map related to object information of the image to be detected, and to determine a first number of object candidate regions indicating object positions in the object feature map; the mapping module is used for mapping each object candidate region of the first number onto the scene characteristic map so as to determine the scene region characteristic corresponding to the object candidate region; a combining module, configured to combine, for each of the first number of object candidate regions, a feature of the object candidate region in the object feature map with a corresponding scene region feature to obtain a combined feature; and a detection result obtaining module for inputting the combined features into a classification network in the object detection network for each of the first number of object candidate regions to obtain an object detection result.

Illustratively, the mapping module includes: an overlap degree operator module for calculating, for each of a first number of object candidate regions, an overlap degree of the object candidate region with each of pre-divided scene regions in the scene feature map; a selection submodule, configured to, for each of a first number of object candidate regions, select a scene region with the largest degree of overlap as an associated scene region of the object candidate region; and a first feature extraction sub-module, configured to, for each of the first number of object candidate regions, extract, from the scene feature map, a feature of the associated scene region as a scene region feature corresponding to the object candidate region.

Illustratively, the mapping module includes: a scaling sub-module for scaling, for each of a first number of object candidate regions, the object candidate region to obtain a scaled region; the associated scene area determining submodule is used for determining an area, consistent with the position of the zoomed area, in the scene characteristic map as an associated scene area of the object candidate area for each of the first number of object candidate areas; and a second feature extraction sub-module, configured to, for each of the first number of object candidate regions, extract, from the scene feature map, a feature of the associated scene region as a scene region feature corresponding to the object candidate region.

Illustratively, the detection result obtaining module includes: a classification network input sub-module, configured to input the combined features into a classification network for each of the first number of object candidate regions, so as to obtain coordinates of each object candidate region in the second number of object candidate regions output by the classification network and a confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category; a filtering submodule, configured to filter the second number of object candidate regions by using a non-maximum suppression algorithm to obtain a third number of object candidate regions; and a detection result determining submodule for determining the coordinates of each object candidate region of the third number of object candidate regions and the confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category as an object detection result.

Illustratively, the bonding module includes: and the splicing submodule is used for inputting the characteristics of the object candidate regions in the object characteristic diagram and the characteristics of the corresponding scene regions into a splicing network in the object detection network for each of the first number of object candidate regions so as to obtain the combined characteristics.

Illustratively, the object detection device further includes: and the training module is used for training the object detection network by utilizing the sample object images marked with the object categories in the object classification database.

According to another aspect of the invention, there is provided an object detection system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for execution by the processor to perform the steps of: acquiring an image to be detected; inputting an image to be detected into a scene network in an object detection network to obtain a scene characteristic diagram related to scene information of the image to be detected; inputting an image to be detected into an object network in an object detection network to obtain an object feature map associated with object information of the image to be detected, and determining a first number of object candidate regions indicating object positions in the object feature map; for each of the first number of object candidate regions, mapping the object candidate region onto a scene feature map to determine a scene region feature corresponding to the object candidate region; for each of the first number of object candidate regions, combining the features of the object candidate region in the object feature map with the corresponding scene region features to obtain combined features; and for each of the first number of object candidate regions, inputting the combined features into a classification network in an object detection network to obtain an object detection result.

For example, the step of mapping the object candidate region onto the scene feature map for each of a first number of object candidate regions to determine a scene region feature corresponding to the object candidate region, when executed by the processor, comprises: calculating the overlapping degree of the object candidate region and each scene region in the pre-divided scene regions in the scene characteristic diagram; selecting the scene area with the maximum overlapping degree as the associated scene area of the object candidate area; and extracting the feature of the associated scene area from the scene feature map as the scene area feature corresponding to the object candidate area.

For example, the step of mapping the object candidate region onto the scene feature map for each of a first number of object candidate regions to determine a scene region feature corresponding to the object candidate region, when executed by the processor, comprises: scaling the object candidate region to obtain a scaled region; determining a region in the scene characteristic map, which is consistent with the position of the zoomed region, as an associated scene region of the object candidate region; and extracting the feature of the associated scene area from the scene feature map as the scene area feature corresponding to the object candidate area.

Illustratively, the step of inputting the combined features into a classification network of the object detection network for each of the first number of object candidate regions to obtain object detection results, the computer program instructions being for execution by the processor to include: for each of the first number of object candidate regions, inputting the combined features into a classification network to obtain coordinates of each object candidate region in the second number of object candidate regions output by the classification network and a confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category; filtering the second number of object candidate regions by using a non-maximum suppression algorithm to obtain a third number of object candidate regions; and determining the coordinates of each object candidate region in the third number of object candidate regions and the confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category as the object detection result.

Illustratively, the step of combining the features of the object candidate regions in the object feature map with the corresponding scene region features to obtain combined features for each of the first number of object candidate regions, which is executed by the processor, comprises: inputting the characteristics of the object candidate region and the corresponding scene region characteristics in the object characteristic diagram into a splicing network in the object detection network to obtain combined characteristics.

Illustratively, the computer program instructions when executed by the processor are further for performing the steps of: and training an object detection network by using the sample object images marked with the object classes in the object classification database.

According to another aspect of the present invention there is provided a storage medium having stored thereon program instructions operable when executed to perform the steps of: acquiring an image to be detected; inputting an image to be detected into a scene network in an object detection network to obtain a scene characteristic diagram related to scene information of the image to be detected; inputting an image to be detected into an object network in an object detection network to obtain an object feature map associated with object information of the image to be detected, and determining a first number of object candidate regions indicating object positions in the object feature map; for each of the first number of object candidate regions, mapping the object candidate region onto a scene feature map to determine a scene region feature corresponding to the object candidate region; for each of the first number of object candidate regions, combining the features of the object candidate region in the object feature map with the corresponding scene region features to obtain combined features; and for each of the first number of object candidate regions, inputting the combined features into a classification network in an object detection network to obtain an object detection result.

For example, the step of mapping the object candidate region onto the scene feature map for each of a first number of object candidate regions to determine a scene region feature corresponding to the object candidate region, which is performed by the program instructions when executed, comprises: calculating the overlapping degree of the object candidate region and each scene region in the pre-divided scene regions in the scene characteristic diagram; selecting the scene area with the maximum overlapping degree as the associated scene area of the object candidate area; and extracting the feature of the associated scene area from the scene feature map as the scene area feature corresponding to the object candidate area.

For example, the step of mapping the object candidate region onto the scene feature map for each of a first number of object candidate regions to determine a scene region feature corresponding to the object candidate region, which is performed by the program instructions when executed, comprises: scaling the object candidate region to obtain a scaled region; determining a region in the scene characteristic map, which is consistent with the position of the zoomed region, as an associated scene region of the object candidate region; and extracting the feature of the associated scene area from the scene feature map as the scene area feature corresponding to the object candidate area.

Illustratively, the step of inputting the combined features into a classification network of the object detection network for each of the first number of object candidate regions for execution by the program instructions when executed to obtain the object detection result comprises: for each of the first number of object candidate regions, inputting the combined features into a classification network to obtain coordinates of each object candidate region in the second number of object candidate regions output by the classification network and a confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category; filtering the second number of object candidate regions by using a non-maximum suppression algorithm to obtain a third number of object candidate regions; and determining the coordinates of each object candidate region in the third number of object candidate regions and the confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category as the object detection result.

For example, the step of combining the features of the object candidate regions in the object feature map with the corresponding scene region features to obtain combined features for each of the first number of object candidate regions, which is executed by the program instructions when executed, includes: inputting the characteristics of the object candidate region and the corresponding scene region characteristics in the object characteristic diagram into a splicing network in the object detection network to obtain combined characteristics.

Illustratively, the program instructions are further operable when executed to perform the steps of: and training an object detection network by using the sample object images marked with the object classes in the object classification database.

According to the object detection method, the device and the system as well as the storage medium, the characteristics of the object candidate region and the corresponding scene region characteristics are combined in a correlation mode, and the scene information is really used for object detection, so that the accuracy of object detection can be improved.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail embodiments of the present invention with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 shows a schematic block diagram of an example electronic device for implementing an object detection method and apparatus in accordance with embodiments of the invention;

FIG. 2 shows a schematic flow diagram of an object detection method according to an embodiment of the invention;

FIG. 3 shows a schematic diagram of an object detection procedure according to one embodiment of the invention;

FIG. 4a illustrates an object-scene probability map statistically derived for the probability of each of a plurality of classes of objects appearing in each class of scene, according to one embodiment of the invention;

FIG. 4b illustrates a scene-object probability map statistically obtained for the probability that each of a plurality of classes of scenes contains each class of objects, according to one embodiment of the invention;

FIG. 5 shows a schematic block diagram of an object detection apparatus according to an embodiment of the present invention; and

FIG. 6 shows a schematic block diagram of an object detection system according to one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of embodiments of the invention and not all embodiments of the invention, with the understanding that the invention is not limited to the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention described herein without inventive step, shall fall within the scope of protection of the invention.

In order to solve the above-mentioned problems, embodiments of the present invention provide an object detection method and apparatus. The object detection method described herein is a method of applying scene understanding to object detection based on a conditional probability theory. And the scene network is used as a carrier of scene information and is fused and classified with the object network. In this way, the detected object information is fused with the scene information, so that the scene understanding can be really applied to the object detection. The above fusion process is actually a process of applying conditional probability between an object and a scene to object detection. The object detection method provided by the embodiment of the invention can be well applied to various fields adopting object detection technologies, such as technologies for detecting specific objects (people or objects) in unmanned driving, robots or security scenes.

First, an example electronic device 100 for implementing an object detection method and apparatus according to an embodiment of the present invention is described with reference to fig. 1.

As shown in FIG. 1, electronic device 100 includes one or more processors 102, one or more memory devices 104, an input device 106, an output device 108, and an image capture device 110, which are interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 102 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement client-side functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images and/or sounds) to an external (e.g., user), and may include one or more of a display, a speaker, etc.

The image capture device 110 may capture images (including video frames) and store the captured images in the storage device 104 for use by other components. The image capture device 110 may be a surveillance camera. It should be understood that the image capture device 110 is merely an example, and the electronic device 100 may not include the image capture device 110. In this case, the image to be detected may be captured by other devices having image capturing capability, and the captured image may be transmitted to the electronic apparatus 100.

Exemplary electronic devices for implementing the object detection methods and apparatus according to embodiments of the present invention may be implemented on devices such as personal computers or remote servers, for example.

Next, an object detection method according to an embodiment of the present invention will be described with reference to fig. 2. FIG. 2 shows a schematic flow diagram of an object detection method 200 according to one embodiment of the invention. As shown in fig. 2, the object detection method 200 includes the following steps.

In step S210, an image to be detected is acquired.

The image to be detected may be any image that requires object detection. Object detection may include detection of the location of an object and its category to which it belongs. The image to be detected can be an original image acquired by an image acquisition device, or can be an image obtained after the original image is preprocessed. In addition, the image to be detected may be a single still image or a certain video frame in the video stream.

The image to be detected may be sent by a client device (such as a security device including a monitoring camera) to the electronic device 100 for object detection by the processor 102 of the electronic device 100, or may be acquired by an image acquisition device 110 included in the electronic device 100 and transmitted to the processor 102 for object detection.

FIG. 3 shows a schematic diagram of an object detection procedure according to one embodiment of the invention. Referring to fig. 3, after receiving an image to be detected, it is simultaneously sent to two networks. The two networks correspond to a scene network and an Object network (Object-net), respectively. The images to be detected can be respectively input into a scene network and an object network for processing.

In step S220, the image to be detected is input into a scene network in the object detection network to obtain a scene feature map related to scene information of the image to be detected.

By way of example and not limitation, a Scene network (Scene-net) may be a full convolutional network, e.g., the Scene network may be implemented using a full convolutional network portion in a VGG or ResNet model. After an image to be detected is input into a scene network, one or more feature maps (feature maps) can be output finally through operations such as convolution, pooling and the like for multiple times, namely the scene feature maps. In one example, the scene network outputs 512 scene feature maps.

In step S230, an image to be detected is input to an object network in the object detection network to obtain an object feature map related to object information of the image to be detected, and a first number of object candidate regions indicating object positions in the object feature map are determined.

By way of example and not limitation, the object network may be a full convolutional network, e.g., the object network may be implemented using a full convolutional network portion in a VGG or ResNet model.

After the image to be detected is input into the object Network, the object Network may process the image to be detected by using a processing method similar to a Region suggestion Network (RPN), and finally output one or more feature maps (feature maps), that is, the object feature maps. In one example, the object network outputs 512 object feature maps. The number of object feature maps and the number of scene feature maps may be the same or different. In addition, the object network may also output coordinates of several (e.g., 200) object candidate regions (candidate). In this way, a first number of object candidate regions may be determined. The first number refers to the number of object candidate regions of coordinates output by the object network, and the value thereof may be changed as needed, which is not limited in the present invention. Illustratively, the object candidate region may be represented by a rectangular box.

In step S240, for each of the first number of object candidate regions, the object candidate region is mapped onto the scene feature map to determine a scene region feature corresponding to the object candidate region.

In one example, step S240 may include: for each of the first number of object candidate regions, calculating a degree of overlap of the object candidate region with each of the pre-divided scene regions in the scene feature map; for each of the first number of object candidate regions, selecting a scene region with the largest degree of overlap as an associated scene region of the object candidate region; and for each of the first number of object candidate regions, extracting features of the associated scene region from the scene feature map as scene region features corresponding to the object candidate regions.

By way of example and not limitation, at least two sub-regions (referred to as scene regions) may be partitioned in the scene feature map. It is noted that there may be overlap between different scene areas. The scene areas of different scene feature maps are divided in the same manner.

With continued reference to fig. 3, in the example shown in fig. 3, each scene feature map is divided into five scene regions, including four scene regions divided by a cross line and one scene region divided with the center of the feature map as the center. A scene region that overlaps with a certain object candidate region to the maximum extent is selected from the five scene regions, and is used as an associated scene region of the object candidate region, which is a scene region located at the center in the example shown in fig. 3 (shown by a thick line in fig. 3). It should be noted that there may be many object candidate regions detected by the object network, and fig. 3 only shows one of the object candidate regions as an example. Illustratively, the degree of overlap may be represented by an intersection ratio (IoU) of the object candidate region and the scene region.

After the associated scene area is found, the feature value corresponding to the associated scene area (i.e. the feature of the associated scene area) is extracted from the scene feature map to obtain the scene area feature corresponding to the object candidate area.

The method for dividing the scene area, especially dividing the scene area according to the method shown in fig. 3, is a method that is more reasonable for dividing the scene through practice verification, and after the associated scene area is determined and the scene area characteristics are obtained based on the method, the accuracy of the finally obtained object detection result is higher.

Of course, the present invention is not limited to the above examples, and other reasonable ways may be adopted to determine the associated scene area and further obtain the scene area characteristics. Another way of determining the associated scene area is described below.

In another example, step S240 may include: for each of a first number of object candidate regions, scaling the object candidate region to obtain a scaled region; for each of the first number of object candidate regions, determining a region in the scene feature map, which is consistent with the position of the zoomed region, as an associated scene region of the object candidate region; and for each of the first number of object candidate regions, extracting features of the associated scene region from the scene feature map as scene region features corresponding to the object candidate regions.

For example, after determining a certain object candidate region, the area of the object candidate region may be enlarged by 1.5 times to obtain a scaled region. It will be appreciated that the coordinates of the scaled region may be calculated from the coordinates of the object candidate region. Subsequently, the coordinates of the scaled region may be directly applied to the scene feature map, resulting in a corresponding region, i.e., the associated scene region. The manner of extracting the scene area features later is consistent with the previous example, and is not described again. The acquisition mode of the associated scene area provided in this example is simple to implement, and the area obtained after scaling according to the position of the object candidate area can relatively accurately contain the scene information around the object candidate area, so that the features of the object candidate area can be associated and fused with the features of the scene area with higher association degree with the object candidate area, which is beneficial to further improving the accuracy of object detection.

Preferably, the scene feature map is consistent with the object feature map in size, so as to find the relevant scene region corresponding to the object candidate region and having a proper size and position based on the object candidate region.

In step S250, for each of the first number of object candidate regions, the features of the object candidate region in the object feature map are combined with the corresponding scene region features to obtain combined features.

In one example, step S250 may include: for each of the first number of object candidate regions, inputting the features of the object candidate region in the object feature map and the corresponding scene region features into a stitching network in the object detection network to obtain combined features.

The mode of binding may be splicing or ligation (concatenate). The features of the object candidate regions and the corresponding scene region features are concatenated and input to a classification network. And if the first number is greater than 1, processing each object candidate region respectively, splicing the respective features with the scene region features corresponding to the respective features together, and inputting the features into a classification network.

It will be appreciated by those skilled in the art that stitching may be implemented using a stitching network (e.g., concat layer) that supports merging in a certain dimension (num dimension or channel dimension). The meaning of stitching is to combine multiple inputs into one output. During the splicing process, appropriate pooling operations may be performed.

In step S260, for each of the first number of object candidate regions, the combined features are input into a classification network in the object detection network to obtain an object detection result.

Illustratively, the classification network may be a fully connected (convoluted) network or a convolutional network. For example, a classification network may be made up of one or more fully connected layers. For example, the classification network may also be made up of one or more convolutional layers. For example, the classification network may also be constructed of one or more fully connected layers plus one or more convolutional layers. Those skilled in the art will appreciate that the classification network may have other types of layers and may have other combinations of layers, not to mention one.

The output of the classification network may include the coordinates of one or more object candidate regions and the coordinates of each object

A confidence (score) that the object candidate region corresponding to the body candidate region belongs to each predetermined category. With continued reference to FIG. 3, the output of the classification network includes two parts, an "object candidate region" refers to the coordinates of the object candidate region, and knowing the coordinates allows the location of the object candidate region to be determined. The object candidate regions may be indicated by rectangular boxes on the original image to be detected.

The output of the classification network is embodied in digital form. For example, the coordinates of each object candidate region may be represented by a four-dimensional vector [ x1, y1, x2, y2], where [ x1, y1] represents the abscissa and ordinate of the point at the upper left corner of each object candidate region (which is a rectangular box), and [ x2, y2] represents the abscissa and ordinate of the point at the lower right corner of each object candidate region. The confidence level indicates a probability that an object in the corresponding object candidate region belongs to a certain class, which may be represented by a floating point decimal. The number of predetermined categories may be set as desired. For example, assuming that there are 10 predetermined categories, 10 confidences may be obtained for each object candidate region, respectively representing the probability that the object candidate region belongs to the 10 predetermined categories.

In one example, the object detection result may include coordinates of each object candidate region of the second number of object candidate regions output by the classification network and a confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category. That is, the output of the classification network may be directly taken as the object detection result. In general, the second number and the first number have the same value.

In another example, the output of the classification network may be further processed to obtain object detection results. Exemplarily, step S260 may include: for each of the first number of object candidate regions, inputting the combined features into a classification network to obtain coordinates of each object candidate region in the second number of object candidate regions output by the classification network and a confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category; filtering the second number of object candidate regions by using a non-maximum suppression algorithm to obtain a third number of object candidate regions; and determining the coordinates of each object candidate region in the third number of object candidate regions and the confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category as the object detection result.

In the output result of the classification network, there may be a plurality of object candidate regions related to the same object, that is, there are unnecessary useless object candidate regions. Redundant object candidate regions can be excluded by adopting a non-maximum suppression (NMS) algorithm so as to improve the accuracy and the credibility of the object detection result. The implementation of the NMS algorithm is understood by those skilled in the art and will not be described in detail here.

The scene network, the object network, and the classification network described herein are network structures mainly composed of some layers (e.g., convolutional layers, pooling layers, fully-connected layers, etc.), so that all the network structures involved in the object detection method 200 can be regarded as a total network model, and are herein represented by the object detection network. The object detection network may include a scene network, an object network, a classification network, and may further include a network structure (such as the above-mentioned mosaic network) for combining the features of the object candidate regions with the corresponding scene region features. The entire object detection network may be trained in advance, and in actual use, the trained object detection network is used to implement the object detection method 200 to determine whether an object exists in the image to be detected, the position of the object, and the confidence level that the object belongs to each predetermined category.

The order of execution of the steps of the object detection method 200 is not limited to the order shown in fig. 2, and may have other reasonable orders of execution. For example, step S230 may be performed before step S220. Preferably, step S230 is executed synchronously with step S220.

The object detection scheme according to the embodiment of the invention integrates the theoretical basis of conditional probability. The theory of conditional probability is briefly described below.

According to one example, detection (detection) data in the Pascal VOC 2007 dataset is used as training data, which is processed by a scene classification network (the network structure may be the same as or similar to the scene network described herein), to which scene class the region where the object is located belongs, and then the associated scene class and the number of objects are counted. FIG. 4a illustrates an object-scene probability map statistically derived for the probability of each of a plurality of classes of objects appearing in each class of scene, according to one embodiment of the invention; FIG. 4b illustrates a scene-object probability map statistically obtained for the probability that each of a plurality of classes of scenes contains each class of objects, according to one embodiment of the invention. In fig. 4a and 4b, the rows represent 401 scene categories (including airports, bus stations, etc.), the columns represent 20 object categories in the Pascal VOC 2007 dataset, the intersection of each row and column represents the degree of association between the row-corresponding scene and the column-corresponding object, the darker the color, the higher the degree of association. As can be seen from fig. 4a and 4b, there is a large correlation between the scene category and the object category. For example, the bus and bus stop have the highest degree of association, and other highly associated pairs are reasonable.

Therefore, when the object type is predicted, the scene features extracted by the scene network and the features of the objects can be spliced and then classified, and the detection mode integrates the theoretical basis of the conditional probability between the objects and the scene.

According to the object detection method provided by the embodiment of the invention, the features of the object candidate region and the corresponding scene region features are combined in a correlation mode, and the scene information is really used for object detection, so that the accuracy of object detection can be improved.

Illustratively, the object detection method according to embodiments of the present invention may be implemented in a device, apparatus, or system having a memory and a processor.

The object detection method according to the embodiment of the invention can be deployed at an image acquisition end, for example, in the field of unmanned driving, and can be deployed at a road vision recognition acquisition end of a driving system. Alternatively, the object detection method according to the embodiment of the present invention may also be distributively deployed at the server side (or cloud side) and the client side. For example, an image to be detected may be collected at a client, the client transmits the collected image to a server (or a cloud), and the server (or the cloud) detects an object.

According to an embodiment of the present invention, the object detection method 200 may further include: and training a scene network by using the sample scene images marked with the scene categories in the scene classification database.

In one example, the scene network may be trained in advance by using a scene classification database, such training is referred to as initialization of the scene network, and the obtained scene network has initialization parameters. The scene classification database may be the places2 database. Of course, other suitable databases for scene classification may be used to train the scene network, and the invention is not limited thereto.

The scene network with the initialization parameters can be further trained in the subsequent training process of the whole object detection network. Of course, the scene network may not participate in the training of other network structures. That is to say, in the whole training process of the object detection network, the object network and the classification network and the network structure (such as the above mentioned splicing layer) for combining the features of the object candidate region and the corresponding scene region features may be mainly trained, and the parameters in the scene network are kept fixed and are not updated. Of course, this training method is merely an example and is not limited, and the training of the network structure of each portion in the entire object detection network may be implemented in other suitable manners, which is not limited by the present invention.

According to an embodiment of the present invention, the object detection method 200 may further include: and training an object detection network by using the sample object images marked with the object classes in the object classification database.

For example, detection (detection) data in the Pascal VOC 2007 dataset can be used as training data, the training data includes a large number of sample object images, and the object type and the object position in each sample object image are labeled and recorded by using corresponding labeling data. The confidence that the object in the sample object image belongs to each predetermined category and the position of the object can be calculated by using a scene network, an object network, a classification network, a splicing network and the like with initialization parameters, and parameters (or weights) adopted in the object detection network are adjusted by using a back propagation algorithm until the training converges.

According to another aspect of the present invention, an object detecting apparatus is provided. FIG. 5 shows a schematic block diagram of an object detection apparatus 500 according to one embodiment of the present invention.

As shown in fig. 5, the object detecting apparatus 500 according to the embodiment of the present invention includes an image acquiring module 510, a scene network module 520, an object network module 530, a mapping module 540, a combining module 550, and a detection result acquiring module 560. The various modules may each perform the various steps/functions of the object detection method described above in connection with fig. 2-4. Only the main functions of the respective components of the object detection apparatus 500 will be described below, and details that have been described above will be omitted.

The image obtaining module 510 is used for obtaining an image to be detected. The image acquisition module 510 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The scene network module 520 is configured to input the image to be detected into a scene network in the object detection network to obtain a scene characteristic map related to scene information of the image to be detected. The scene network module 520 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The object network module 530 is configured to input an image to be detected into an object network in an object detection network to obtain an object feature map related to object information of the image to be detected, and determine a first number of object candidate regions indicating object positions in the object feature map. The object network module 530 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The mapping module 540 is configured to map, for each of the first number of object candidate regions, the object candidate region onto the scene feature map to determine a scene region feature corresponding to the object candidate region. The mapping module 540 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The combining module 550 is configured to combine, for each of the first number of object candidate regions, the feature of the object candidate region in the object feature map with the corresponding scene region feature to obtain a combined feature. The combining module 550 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The detection result obtaining module 560 is configured to, for each of the first number of object candidate regions, input the combined features into a classification network in an object detection network to obtain an object detection result. The detection result obtaining module 560 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

Illustratively, the mapping module 540 includes: an overlap degree operator module for calculating, for each of a first number of object candidate regions, an overlap degree of the object candidate region with each of pre-divided scene regions in the scene feature map; a selection submodule, configured to, for each of a first number of object candidate regions, select a scene region with the largest degree of overlap as an associated scene region of the object candidate region; and a first feature extraction sub-module, configured to, for each of the first number of object candidate regions, extract, from the scene feature map, a feature of the associated scene region as a scene region feature corresponding to the object candidate region.

Illustratively, the mapping module 540 includes: a scaling sub-module for scaling, for each of a first number of object candidate regions, the object candidate region to obtain a scaled region; the associated scene area determining submodule is used for determining an area, consistent with the position of the zoomed area, in the scene characteristic map as an associated scene area of the object candidate area for each of the first number of object candidate areas; and a second feature extraction sub-module, configured to, for each of the first number of object candidate regions, extract, from the scene feature map, a feature of the associated scene region as a scene region feature corresponding to the object candidate region.

Illustratively, the detection result obtaining module 560 includes: a classification network input sub-module, configured to input the combined features into a classification network for each of the first number of object candidate regions, so as to obtain coordinates of each object candidate region in the second number of object candidate regions output by the classification network and a confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category; a filtering submodule, configured to filter the second number of object candidate regions by using a non-maximum suppression algorithm to obtain a third number of object candidate regions; and a detection result determining submodule for determining the coordinates of each object candidate region of the third number of object candidate regions and the confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category as an object detection result.

Illustratively, the bonding module 550 includes: and the splicing submodule is used for inputting the characteristics of the object candidate regions in the object characteristic diagram and the characteristics of the corresponding scene regions into a splicing network in the object detection network for each of the first number of object candidate regions so as to obtain the combined characteristics.

Illustratively, the object detection device 500 further includes: and a training module (not shown) for training the object detection network by using the sample object images labeled with the object classes in the object classification database.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

FIG. 6 shows a schematic block diagram of an object detection system 600 according to one embodiment of the invention. Object detection system 600 includes an image capture device 610, a storage device 620, and a processor 630.

The image capturing device 610 is used to capture an image to be detected. Image capture device 610 is optional and object detection system 600 may not include image capture device 610. In this case, the image to be detected may be acquired by using other image acquisition devices and the acquired image may be transmitted to the object detection system 600.

The storage means 620 stores computer program instructions for implementing the respective steps in the object detection method according to an embodiment of the invention.

The processor 630 is configured to run the computer program instructions stored in the storage device 620 to perform the corresponding steps of the object detection method according to the embodiment of the present invention, and is configured to implement the image acquisition module 510, the scene network module 520, the object network module 530, the mapping module 540, the combining module 550, and the detection result acquisition module 560 in the object detection device 500 according to the embodiment of the present invention.

In one embodiment, the computer program instructions, when executed by the processor 630, are for performing the steps of: acquiring an image to be detected; inputting an image to be detected into a scene network in an object detection network to obtain a scene characteristic diagram related to scene information of the image to be detected; inputting an image to be detected into an object network in an object detection network to obtain an object feature map associated with object information of the image to be detected, and determining a first number of object candidate regions indicating object positions in the object feature map; for each of the first number of object candidate regions, mapping the object candidate region onto a scene feature map to determine a scene region feature corresponding to the object candidate region; for each of the first number of object candidate regions, combining the features of the object candidate region in the object feature map with the corresponding scene region features to obtain combined features; and for each of the first number of object candidate regions, inputting the combined features into a classification network in an object detection network to obtain an object detection result.

Illustratively, the step of mapping the object candidate regions onto the scene feature map for each of the first number of object candidate regions to determine the scene region features corresponding to the object candidate regions, which is performed by the processor 630, comprises: calculating the overlapping degree of the object candidate region and each scene region in the pre-divided scene regions in the scene characteristic diagram; selecting the scene area with the maximum overlapping degree as the associated scene area of the object candidate area; and extracting the feature of the associated scene area from the scene feature map as the scene area feature corresponding to the object candidate area.

Illustratively, the step of inputting the combined features into a classification network of an object detection network for each of the first number of object candidate regions to obtain object detection results, the computer program instructions being for execution by the processor 630 to include: for each of the first number of object candidate regions, inputting the combined features into a classification network to obtain coordinates of each object candidate region in the second number of object candidate regions output by the classification network and a confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category; filtering the second number of object candidate regions by using a non-maximum suppression algorithm to obtain a third number of object candidate regions; and determining the coordinates of each object candidate region in the third number of object candidate regions and the confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category as the object detection result.

Illustratively, the step of combining the features of the object candidate regions in the object feature map with the corresponding scene region features to obtain combined features for each of the first number of object candidate regions, which is executed by the processor, comprises: for each of the first number of object candidate regions, inputting the features of the object candidate region in the object feature map and the corresponding scene region features into a stitching network in the object detection network to obtain combined features.

Illustratively, the computer program instructions when executed by the processor 630 are further operable to perform the steps of: and training an object detection network by using the sample object images marked with the object classes in the object classification database.

Further, according to an embodiment of the present invention, there is also provided a storage medium on which program instructions are stored, which when executed by a computer or a processor are used for executing the respective steps of the object detection method according to an embodiment of the present invention and for implementing the respective modules in the object detection apparatus according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

In one embodiment, the program instructions, when executed by a computer or a processor, may cause the computer or the processor to implement the respective functional modules of the object detection apparatus according to the embodiment of the present invention, and/or may perform the object detection method according to the embodiment of the present invention.

In one embodiment, the program instructions are operable when executed to perform the steps of: acquiring an image to be detected; inputting an image to be detected into a scene network in an object detection network to obtain a scene characteristic diagram related to scene information of the image to be detected; inputting an image to be detected into an object network in an object detection network to obtain an object feature map associated with object information of the image to be detected, and determining a first number of object candidate regions indicating object positions in the object feature map; for each of the first number of object candidate regions, mapping the object candidate region onto a scene feature map to determine a scene region feature corresponding to the object candidate region; for each of the first number of object candidate regions, combining the features of the object candidate region in the object feature map with the corresponding scene region features to obtain combined features; and for each of the first number of object candidate regions, inputting the combined features into a classification network in an object detection network to obtain an object detection result.

For example, the step of combining the features of the object candidate regions in the object feature map with the corresponding scene region features to obtain combined features for each of the first number of object candidate regions, which is executed by the program instructions when executed, includes: for each of the first number of object candidate regions, inputting the features of the object candidate region in the object feature map and the corresponding scene region features into a stitching network in the object detection network to obtain combined features.

The modules in the object detection system according to embodiments of the present invention may be implemented by a processor of an electronic device implementing object detection according to embodiments of the present invention running computer program instructions stored in a memory, or may be implemented when computer instructions stored in a computer readable storage medium of a computer program product according to embodiments of the present invention are run by a computer.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the foregoing illustrative embodiments are merely exemplary and are not intended to limit the scope of the invention thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present invention should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some of the modules in an object detection apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiment of the present invention or the description thereof, and the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An object detection method comprising:

acquiring an image to be detected;

inputting the image to be detected into a scene network in an object detection network to obtain a scene characteristic diagram related to scene information of the image to be detected;

inputting the image to be detected into an object network in the object detection network to obtain an object feature map related to object information of the image to be detected, and determining a first number of object candidate regions for indicating object positions in the object feature map;

for each of the first number of object candidate regions,

mapping the object candidate region to the scene feature map to determine a scene region feature corresponding to the object candidate region;

combining the characteristics of the object candidate region in the object characteristic diagram with the characteristics of the corresponding scene region to obtain combined characteristics; and

and inputting the combined features into a classification network in the object detection network to obtain an object detection result.

2. The object detection method according to claim 1, wherein the mapping, for each of the first number of object candidate regions, the object candidate region onto the scene feature map to determine the scene region feature corresponding to the object candidate region comprises:

calculating the overlapping degree of the object candidate region and each scene region in the pre-divided scene regions in the scene characteristic diagram;

selecting the scene area with the maximum overlapping degree as the associated scene area of the object candidate area; and

and extracting the characteristics of the associated scene area from the scene characteristic graph as the scene area characteristics corresponding to the object candidate area.

3. The object detection method according to claim 1, wherein the mapping, for each of the first number of object candidate regions, the object candidate region onto the scene feature map to determine the scene region feature corresponding to the object candidate region comprises:

scaling the object candidate region to obtain a scaled region;

determining a region in the scene feature map, which is consistent with the position of the zoomed region, as an associated scene region of the object candidate region; and

4. The object detection method of claim 1, wherein said inputting the combined features into a classification network of the object detection network for each of the first number of object candidate regions to obtain an object detection result comprises:

for each of the first number of object candidate regions, inputting the combined features into the classification network to obtain coordinates of each of a second number of object candidate regions output by the classification network and a confidence level that the object candidate region corresponding to each object candidate region belongs to each predetermined category;

filtering the second number of object candidate regions by using a non-maximum suppression algorithm to obtain a third number of object candidate regions; and

determining the coordinates of each object candidate region in the third number of object candidate regions and the confidence that the object candidate region corresponding to each object candidate region belongs to each predetermined category as the object detection result.

5. The object detection method according to claim 1, wherein, for each of the first number of object candidate regions, combining the features of the object candidate region in the object feature map with the corresponding scene region features to obtain combined features comprises:

and inputting the characteristics of the object candidate region and the corresponding scene region characteristics in the object characteristic diagram into a splicing network in the object detection network to obtain the combined characteristics.

6. The object detection method according to claim 1, wherein the scene network and/or the object network is a full convolutional network.

7. The object detection method of claim 1, wherein the classification network is a fully connected network or a convolutional network.

8. The object detection method according to claim 1, wherein the object detection method further comprises:

and training the object detection network by using the sample object images marked with the object classes in the object classification database.

9. An object detecting device comprising:

the image acquisition module is used for acquiring an image to be detected;

the scene network module is used for inputting the image to be detected into a scene network in an object detection network so as to obtain a scene characteristic diagram related to scene information of the image to be detected;

an object network module, configured to input the image to be detected into an object network in the object detection network, to obtain an object feature map related to object information of the image to be detected, and to determine a first number of object candidate regions indicating object positions in the object feature map;

a mapping module, configured to map, for each of the first number of object candidate regions, the object candidate region onto the scene feature map to determine a scene region feature corresponding to the object candidate region;

a combining module, configured to combine, for each of the first number of object candidate regions, a feature of the object candidate region in the object feature map with a corresponding scene region feature to obtain a combined feature; and

a detection result obtaining module, configured to, for each of the first number of object candidate regions, input the combined features into a classification network in the object detection network to obtain an object detection result.

10. An object detection system comprising a processor and a memory, wherein the memory has stored therein computer program instructions which, when executed by the processor, are operable to perform the steps of:

acquiring an image to be detected;

for each of the first number of object candidate regions,

11. A storage medium having stored thereon program instructions which when executed are for performing the steps of:

acquiring an image to be detected;

for each of the first number of object candidate regions,