CN111144408A

CN111144408A - Image recognition method, image recognition device, electronic equipment and storage medium

Info

Publication number: CN111144408A
Application number: CN201911347935.6A
Authority: CN
Inventors: 杨统; 戴秋菊
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-12

Abstract

The embodiment of the application discloses an image identification method, which comprises the following steps: obtaining a first image to be identified; inputting the first image into a convolutional neural network model to obtain a plurality of feature maps corresponding to a target layer in a plurality of convolutional layers of the convolutional neural network model; extracting a second image where the target object is located in the first image based on the plurality of feature maps; and inputting the second image into the trained classification model to obtain a recognition result associated with the target object in the first image, and outputting the recognition result. The embodiment of the application also discloses an image recognition device, electronic equipment and a storage medium.

Description

Image recognition method, image recognition device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image recognition, and in particular, to an image recognition method, an image recognition apparatus, an electronic device, and a storage medium.

Background

In the wonderful world, there are too many things we do not know. Currently, a user can capture an unknown object by using a capture function of an electronic device to obtain an image including the object, and the electronic device can recognize the image to obtain a recognition result of the object included in the image. However, this method of recognizing the entire captured image has a problem of low recognition accuracy.

Disclosure of Invention

The embodiment of the application is expected to provide an image identification method, an image identification device, an electronic device and a storage medium, and solves the problem that the identification accuracy rate is low in the method for identifying the whole shot image in the related art.

The technical scheme of the application is realized as follows:

an image recognition method, the method comprising:

obtaining a first image to be identified;

inputting the first image into a convolutional neural network model to obtain a plurality of feature maps corresponding to a target layer in a plurality of convolutional layers of the convolutional neural network model;

extracting a second image where a target object is located in the first image based on the plurality of feature maps;

and inputting the second image into a classification model obtained through training to obtain a recognition result associated with the target object in the first image, and outputting the recognition result.

Optionally, the extracting, based on the plurality of feature maps, a second image where the target object is located in the first image includes:

adding the pixels in the characteristic images pixel by pixel along the direction of each channel to obtain a sum activation image;

determining a target position corresponding to the target object in the adding activation graph;

and determining the second image corresponding to the target position in the first image, and extracting the second image.

Optionally, the determining a target position corresponding to the target object in the summation activation graph includes:

searching a plurality of positions where a plurality of pixels with activation values larger than a target activation value corresponding to the pixels in the addition activation image are located;

determining a first position corresponding to a minimum activation value and a second position corresponding to a maximum activation value in the plurality of positions in the first direction;

determining a third position corresponding to a minimum activation value and a fourth position corresponding to a maximum activation value in the plurality of positions in the second direction; an included angle between the second direction and the first direction is a right angle; the target position includes the first position, the second position, the third position, and the fourth position.

Optionally, the searching for multiple positions where the activation value corresponding to the pixel in the added activation map is greater than the target activation value includes:

obtaining the maximum activation value in all the activation values corresponding to all the pixels in the addition activation image;

multiplying the maximum activation value by a preset parameter to obtain a target activation value;

finding the plurality of positions in the activation value in the summed activation map that are greater than the target activation value.

Optionally, the determining the second image corresponding to the target position in the first image includes:

obtaining a position mapping relation between each pixel in the summation activation graph and each pixel in the first image;

and determining the second image corresponding to the target position in the first image based on the position mapping relation.

Optionally, the target layer is a layer of which the number of layers is smaller than a target threshold value among the plurality of convolutional layers.

Optionally, the target layer is a second layer in the plurality of convolutional layers, and the mapping relationship indicates that the position of the same pixel in the summation activation map is the same as the position of the same pixel in the first image.

An image recognition device, the image recognition device comprising:

an obtaining unit configured to obtain a first image to be recognized;

the first processing unit is used for inputting the first image into a convolutional neural network model to obtain a plurality of characteristic graphs corresponding to a target layer in a plurality of convolutional layers of the convolutional neural network model;

the second processing unit is used for extracting a second image where the target object is located in the first image based on the plurality of feature maps;

and the third processing unit is used for inputting the second image into the trained classification model, obtaining a recognition result associated with the target object in the first image, and outputting the recognition result.

An electronic device, the electronic device comprising: a processor, a memory, and a communication bus;

the communication bus is used for realizing communication connection between the processor and the memory;

the processor is configured to execute an image recognition program stored in the memory to implement the steps of the image recognition method as described above.

A storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the image recognition method as described above.

The embodiment of the application provides an image identification method, an image identification device, an electronic device and a storage medium, and a first image to be identified is obtained; inputting the first image into a convolutional neural network model to obtain a plurality of feature maps corresponding to a target layer in a plurality of convolutional layers of the convolutional neural network model; extracting a second image where the target object is located in the first image based on the plurality of feature maps; preprocessing the first image, and extracting a second image where a main body part with most obvious semantic information is located; inputting the second image into the trained classification model to obtain a recognition result associated with the target object in the first image, and outputting the recognition result; therefore, before recognition, the influence of the background and noise of the main body area on the final recognition result is eliminated, and only the main body area is recognized, so that the problem of low recognition accuracy in the method for recognizing the whole image obtained by shooting is solved, the recognition accuracy is improved, and the intelligent degree of the electronic equipment is improved.

Drawings

Fig. 1 is a schematic flowchart of an image recognition method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another image recognition method according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of another image recognition method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of an image recognition method according to another embodiment of the present application;

fig. 5 is a schematic diagram of a first image according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of activation conditions of a plurality of feature maps corresponding to a second layer according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

It should be appreciated that reference throughout this specification to "an embodiment of the present application" or "an embodiment described previously" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in the embodiments of the present application" or "in the embodiments" in various places throughout this specification are not necessarily all referring to the same embodiments. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

An embodiment of the present application provides an image recognition method, which is applied to an electronic device, and as shown in fig. 1, the method includes the following steps:

step 101, obtaining a first image to be identified.

Here, the electronic device may be a smart terminal, for example, a mobile terminal device with wireless communication capability such as a mobile phone (mobile phone), a tablet computer, and a notebook computer, or an intelligent terminal device which is not convenient to move such as a desktop computer. The electronic equipment is used for image recognition.

The first image may be an image acquired by the electronic device in real time; the first image can also be an image extracted by the electronic equipment from a video stream shot in real time; of course, the first image may also be a pre-captured image, for example, an image pre-captured by the electronic device or an image pre-captured by another device obtained by the electronic device. Here, the first image is used as an object to be recognized, and the source of the first image is not particularly limited in this embodiment of the application.

The image recognition method provided by the embodiment of the application can be applied to an article recognition scene, a face recognition scene, a license plate recognition scene and the like, and certainly, the image recognition method provided by the embodiment of the application can also be applied to other scenes such as a focus detection scene, and the application scene is not specifically limited by the embodiment of the application.

And 102, inputting the first image into the convolutional neural network model to obtain a plurality of characteristic graphs corresponding to a target layer in a plurality of convolutional layers of the convolutional neural network model.

Convolutional Neural Networks (CNN) are a class of Feed forward Neural Networks (Feed forward Neural Networks) that contain convolution computations and have a deep structure, and are one of the representative algorithms for deep learning (deep). Convolutional neural Networks have a feature learning (representation learning) capability, and can perform Shift-Invariant classification (Shift-Invariant classification) on input information according to a hierarchical structure thereof, and are also called Shift-Invariant artificial neural Networks (SIANN).

The convolutional neural network includes an input layer, a hidden layer, and an output layer. The input layer plays a fan-out role of an input signal; it is not logged when calculating the number of layers of the neural network that are responsible for receiving information from outside the network. The hidden layer, other layers except the input layer and the output layer are called hidden layers, that is, the hidden layer does not directly receive external signals, and does not directly send signals to the outside. And the output layer is responsible for outputting the calculation result of the neural network.

Here, the hidden layer of the convolutional neural network includes 3 types of common structures, i.e., convolutional layer, pooling layer, and fully-connected layer. The function of the convolution layer is to perform feature extraction on input data such as an input image, and the convolution layer internally comprises a plurality of convolution kernels, and each element forming the convolution kernels corresponds to a weight coefficient and a bias vector (bias vector), and is similar to a neuron (neuron) of a feedforward neural network.

Here, the convolutional layer parameters include convolutional kernel size, step size, and padding, which together determine the size of the convolutional layer output feature map, and are hyper-parameters of the convolutional neural network.

Where the convolution kernel size can be specified as an arbitrary value smaller than the input image size, the larger the convolution kernel, the more complex the input features that can be extracted.

The convolution step defines the distance between positions of the convolution kernel when the convolution kernel sweeps the feature map twice, when the convolution step is 1, the convolution kernel sweeps elements of the feature map one by one, and when the step is n, n-1 pixels are skipped in the next scanning.

Further, as can be seen from the cross calculation of convolution kernels, the size of the feature map gradually decreases as convolution layers are stacked; to this end, padding is a method of artificially increasing the size of the feature map before it passes through the convolution kernel to offset the effects of size shrinkage in the computation.

In the embodiment of the application, the convolutional neural network model is adjusted by setting the size, the step length and each filling parameter of the convolutional kernel.

In practical application, the electronic device inputs the first image into the convolutional neural network model to obtain a plurality of feature maps of each convolutional layer in the plurality of convolutional layers of the convolutional neural network model, and further extracts a plurality of feature maps corresponding to the target layer from the plurality of feature maps of each convolutional layer in the plurality of convolutional layers.

In some embodiments, the convolutional neural network model may be a super-resolution test sequence (VGG) model. The VGG model can be divided into a plurality of configurations according to the difference between the size of the convolution kernel and the number of convolution layers, and in the embodiment of the present application, two configurations, namely VGG16 or VGG19, can be selected.

For example, using VGG16 as an example, VGG16 includes 13 convolutional layers, 3 fully connected layers, and 5 pooling layers. Among them, the convolutional layer and the fully-connected layer have a weight coefficient and are also called as weight layers, and the total number is 13+3 — 16, which is the source of 16 in VGG 16. Here, the salient feature of using VGG16 is simplicity, and is embodied in that: in the first aspect, convolutional layers all use the same convolution kernel parameters, and are all denoted as conv3-XXX, where conv3 indicates that the size (kernel size) of the convolution kernel used by the convolutional layer is 3, i.e., width (width) and height (height) are both 3, 3 × 3 is a very small convolution kernel size, and in combination with other parameters (step size 1, padding size) it is possible to keep each convolutional layer the same width and height as the previous layer. XXX represents the number of channels in the convolutional layer. In a second aspect, the pooling layers all use the same pooling kernel parameters. In the third aspect, the model is formed by stacking (stack) a plurality of convolution layers and pooling layers, and a deep network structure is relatively easy to form.

In other embodiments of the present application, the target layer is a layer having a number of layers smaller than the target threshold value among the plurality of convolutional layers. Illustratively, the value range of the target threshold is [1,5], and if the electronic device inputs the first image into the convolutional neural network model, a plurality of feature maps corresponding to a first layer of the plurality of convolutional layers of the convolutional neural network model are obtained. For another example, the electronic device inputs the first image into the convolutional neural network model to obtain a plurality of feature maps corresponding to a second layer of the plurality of convolutional layers of the convolutional neural network model. For another example, the electronic device inputs the first image into the convolutional neural network model to obtain a plurality of feature maps corresponding to a fifth layer in the plurality of convolutional layers of the convolutional neural network model.

And 103, extracting a second image where the target object is located in the first image based on the plurality of feature maps.

Here, after obtaining a plurality of feature maps corresponding to the target layer, the electronic device determines a main body region corresponding to the target object from the first image, that is, a second image that is a partial image where the target object is located, based on the plurality of feature maps, and further extracts the second image where the target object is located from the first image, thereby cropping the second image from the first image.

And 104, inputting the second image into the trained classification model to obtain a recognition result associated with the target object in the first image, and outputting the recognition result.

After the electronic equipment obtains the second image, inputting the second image into the trained classification model to obtain a recognition result associated with the target object in the first image, and outputting the recognition result; that is to say, in the process of identifying a target object in a first image in the embodiment of the present application, the first image to be identified is cut, a partial image corresponding to a main body region with the most obvious semantic information, that is, a second image, is extracted, the main body region is effectively located, the influence of a non-main body region background and noise on a final identification result is better eliminated, the problem that the whole first image is sent to a classification model for identification to introduce more environmental backgrounds and noises and increase the difficulty of identification is avoided, the interference of other objects or environmental factors is also introduced when a plurality of objects exist is eliminated, and the identification effect is improved.

According to the image identification method provided by the embodiment of the application, a first image to be identified is obtained; inputting the first image into a convolutional neural network model to obtain a plurality of feature maps corresponding to a target layer in a plurality of convolutional layers of the convolutional neural network model; extracting a second image where the target object is located in the first image based on the plurality of feature maps; preprocessing the first image, and extracting a second image where a main body part with most obvious semantic information is located; inputting the second image into the trained classification model to obtain a recognition result associated with the target object in the first image, and outputting the recognition result; therefore, before recognition, the influence of the background and noise of the main body area on the final recognition result is eliminated, and only the main body area is recognized, so that the problem of low recognition accuracy in the method for recognizing the whole image obtained by shooting is solved, the recognition accuracy is improved, and the intelligent degree of the electronic equipment is improved.

An embodiment of the present application provides an image recognition method, which is applied to an electronic device, and as shown in fig. 2, the method includes the following steps:

step 201, obtaining a first image to be identified.

Step 202, inputting the first image into the convolutional neural network model to obtain a plurality of feature maps corresponding to a target layer in a plurality of convolutional layers of the convolutional neural network model.

And step 203, adding the pixels in the plurality of feature maps pixel by pixel along the direction of each channel to obtain an addition activation map.

Here, the additive activation map can be understood as the overall feature map of the fused target layer all feature maps 1: 1.

And step 204, determining a target position corresponding to the target object in the summation activation graph.

Here, the target position includes vertex coordinates of a region in which the target object is located in the addition activation map.

In this embodiment of the application, the step 204 of determining the target position corresponding to the target object in the addition activation graph may include the following steps:

step 204a, searching a plurality of positions where a plurality of pixels with activation values larger than the target activation value corresponding to the pixels in the activation map are located.

Here, the plurality of positions where the plurality of pixels having the activation values corresponding to the pixels in the added activation map larger than the target activation value are located include respective distribution points of the target object in the added activation map.

In some embodiments of the present application, the step 204a of finding a plurality of positions where a plurality of pixels, which are added to the activation map and have activation values corresponding to the pixels larger than the target activation value, are located, may include the following steps:

step1, obtaining the maximum activation value of all the activation values corresponding to all the pixels in the activation map.

And Step2, multiplying the maximum activation value by a preset parameter to obtain a target activation value.

Here, the preset parameter is related to the size of the area of the activation region selected in the addition activation map, i.e., the region where the target object is located; for example, the value range of the preset parameter is between 0.5 and 1, and the smaller the value of the preset parameter is, the larger the area of the activation region is, and the larger the value of the preset parameter is, the smaller the area of the activation region is. Illustratively, the value of the preset parameter may be 0.65, so as to ensure that the second image with the target area is extracted.

Step3, finding a plurality of positions in the activation value in the addition activation map which are larger than the target activation value.

And 204b, determining a first position corresponding to the minimum activation value and a second position corresponding to the maximum activation value in the plurality of positions in the first direction.

And 204c, determining a third position corresponding to the minimum activation value and a fourth position corresponding to the maximum activation value in the plurality of positions in the second direction.

Wherein, the included angle between the second direction and the first direction is a right angle. The target position includes a first position, a second position, a third position, and a fourth position.

Here, in the case where the electronic device determines the first position, the second position, the third position, and the fourth position, a rectangular region may be constructed based on the first position, the second position, the third position, and the fourth position, and a mapping relationship may exist between the position of each pixel in the rectangular region and the position of each pixel in a region where the second image to be extracted in the first image is located.

Step 205, determining a second image corresponding to the target position in the first image, and extracting the second image.

In this embodiment of the application, the determining the second image corresponding to the target position in the first image in step 205 may include the following steps:

step 205a, obtaining a position mapping relation between each pixel in the summation activation image and each pixel in the first image.

Here, the position mapping relationship between each pixel in the summation activation map and each pixel in the first image is related to the values of stride and padding. The electronic device may determine a position mapping relationship between each pixel in the summation activation map and each pixel in the first image based on the values of stride and padding.

And step 205b, determining a second image corresponding to the target position in the first image based on the position mapping relation.

In some embodiments of the present application, the target layer is a second layer of the plurality of convolutional layers, and the mapping relationship indicates that the position of the same pixel in the summation activation map is the same as the position in the first image.

When the target layer is the second layer of the plurality of convolutional layers, stride and padding take a value of 1, and at this time, the mapping relationship represents that the position of the same pixel in the addition activation map is the same as the position of the same pixel in the first image; furthermore, the electronic device determines a second image corresponding to the target position in the first image based on the position mapping relationship.

And step 206, inputting the second image into the trained classification model to obtain a recognition result associated with the target object in the first image, and outputting the recognition result.

Here, the recognition result includes, but is not limited to, attribute information of the target object, for example, a category to which the target object belongs or a name of the target object. The classification model comprises a deep learning model, and the electronic equipment can obtain a recognition result associated with the target object in the first image after inputting the second image into the trained classification model and output the recognition result.

It should be noted that, for the descriptions of the same steps and the same contents in this embodiment as those in other embodiments, reference may be made to the descriptions in other embodiments, which are not described herein again.

An embodiment of the present application provides an image recognition method, which is shown in fig. 3 and 4, and includes the following steps:

step 301, the electronic device obtains a first image to be recognized, and sends the first image to a VGG16 pre-training model for calculation.

Here, the VGG16 pre-training model is a convolutional neural network model trained using imagenet.

Illustratively, referring to fig. 5, when a user plays outdoors, a bird on a branch is photographed by using an electronic device, and the electronic device obtains a first image; and then, the electronic equipment sends the first image to a VGG16 pre-training model for calculation to obtain a plurality of feature maps corresponding to each of a plurality of convolutional layers of the convolutional neural network model.

Step 302, the electronic device extracts a plurality of feature maps corresponding to a second layer in feature maps of each layer of the pre-trained model of the VGG 16.

The dimensions of a plurality of characteristic graphs corresponding to the second layer are h multiplied by w multiplied by c: 224 × 224 × 64. Here, h denotes the height of the feature map, w denotes the width of the feature map, and c denotes the number of channels.

For example, referring to fig. 6, fig. 6 shows activation conditions of a plurality of feature maps corresponding to the second layer, where an activation value of a region corresponding to a black grid in each feature map is greater than activation values of other blank regions; this area is described as the area where the target object is located.

Step 303, the electronic device adds the plurality of feature maps corresponding to the second layer element by element along the channel to obtain a two-dimensional addition activation map, and calculates the maximum value of the activation map, which is marked as MaxY.

And step 304, finding out all elements with the activation values being more than or equal to 0.65 x MaxY in the addition activation diagram by the electronic equipment, and finding out the minimum and maximum values of the elements in the x and y directions to obtain xmin, xmax, ymin and ymax.

Here, (xmin, ymax), (xmax, ymin) are the upper left and lower right coordinates of the body region, and the body positioning is realized.

In step 305, the electronic device maps the (xmin, ymax), (xmax, ymin) positions of the feature map to the original map to obtain (x 'min, y' max), (x 'max, y' min), and cuts out the main body region using the coordinates in the original map.

It should be noted that, since the convolution kernel size of the 1 st and 2 nd layer convolutions of VGG16 is selected to be 3, stride is 1, and pad is 1, the summation activation map point corresponds to the first image point one by one, so x 'min is xmin, x' ma is xmax, y 'min is ymin, and y' max is ymax.

And step 306, the electronic equipment sends the cut main body area to a classification model obtained by training on a target data set for classification and recognition to obtain a recognition result associated with the target object in the first image, and outputs the recognition result.

In the image recognition method provided by the embodiment of the application, the recognition process refers to an attention mechanism (attentionchannels), and in order to reasonably utilize limited visual information processing resources, a specific part in a visual region needs to be selected and then focused on. For example, when a person is reading, only a few words to be read are usually attended to and processed. Certain feature regions are selectively enhanced or suppressed according to the feature distribution. In summary, the attention mechanism has two main aspects: deciding which part of the input needs to be focused on; limited information processing resources are allocated to the important parts. In the embodiment of the present application, the main body region is the region that needs attention.

Therefore, in the embodiment of the application, the image to be recognized is preprocessed by using the convolutional neural network model pre-trained by imagenet, the main body part with the most obvious semantic information is extracted, the main body part has a relatively stronger activation value on the feature map, the main body area is effectively positioned, the influence of the background and noise of the non-main body area on the final recognition result is well eliminated, and the recognition effect is improved.

An embodiment of the present application provides an image recognition apparatus, which can be applied to an image recognition method provided in the embodiment corresponding to fig. 1 to 2, and as shown in fig. 7, the image recognition apparatus 4 includes:

an obtaining unit 41 for obtaining a first image to be recognized.

The first processing unit 42 is configured to input the first image into the convolutional neural network model to obtain a plurality of feature maps corresponding to a target layer in a plurality of convolutional layers of the convolutional neural network model.

And a second processing unit 43, configured to extract a second image where the target object is located in the first image based on the plurality of feature maps.

And the third processing unit 44 is configured to input the second image into the trained classification model, obtain a recognition result associated with the target object in the first image, and output the recognition result.

In other embodiments of the present application, the second processing unit 43 is further configured to add each pixel in the multiple feature maps pixel by pixel along each channel direction to obtain an addition activation map; determining a target position corresponding to a target object in the adding activation graph; and determining a second image corresponding to the target position in the first image, and extracting the second image.

In other embodiments of the present application, the second processing unit 43 is further configured to search a plurality of positions where a plurality of pixels, corresponding to pixels in the added activation map, have activation values larger than the target activation value are located; determining a first position corresponding to a minimum activation value and a second position corresponding to a maximum activation value in a plurality of positions in the first direction; determining a third position corresponding to the minimum activation value and a fourth position corresponding to the maximum activation value in a plurality of positions in the second direction; an included angle between the second direction and the first direction is a right angle; the target position includes a first position, a second position, a third position, and a fourth position.

In other embodiments of the present application, the second processing unit 43 is further configured to obtain a maximum activation value of all activation values corresponding to all pixels in the activation map; multiplying the maximum activation value by a preset parameter to obtain a target activation value; and searching a plurality of positions in the activation value of the sum activation graph, wherein the positions are larger than the target activation value.

In other embodiments of the present application, the second processing unit 43 is further configured to obtain a position mapping relationship between each pixel in the summation activation map and each pixel in the first image; and determining a second image corresponding to the target position in the first image based on the position mapping relation.

In other embodiments of the present application, the target layer is a layer having a number of layers smaller than the target threshold value among the plurality of convolutional layers.

In other embodiments of the present application, the target layer is a second layer of the plurality of convolutional layers, and the mapping relationship indicates that the position of the same pixel in the summation activation map is the same as the position in the first image.

Based on the foregoing embodiments, an embodiment of the present application provides an electronic device, which can be applied to an image recognition method provided in the embodiments corresponding to fig. 1-2, and as shown in fig. 8, the electronic device 5 (the electronic device 5 in fig. 5 corresponds to the image recognition apparatus 4 in fig. 4) includes: a processor 51, a memory 52, and a communication bus 53, wherein:

the communication bus 53 is used to realize a communication connection between the processor 51 and the memory 52.

The processor 51 is configured to execute an image recognition program stored in the memory 52 to implement the steps of:

obtaining a first image to be identified;

extracting a second image where the target object is located in the first image based on the plurality of feature maps;

and inputting the second image into the trained classification model to obtain a recognition result associated with the target object in the first image, and outputting the recognition result.

In other embodiments of the present application, the processor 51 is configured to execute an image recognition program stored in the memory 52 to implement the following steps:

adding pixels in the multiple feature maps pixel by pixel along each channel direction to obtain a sum activation map;

determining a target position corresponding to a target object in the adding activation graph;

and determining a second image corresponding to the target position in the first image, and extracting the second image.

searching a plurality of positions where a plurality of pixels with activation values larger than the target activation value corresponding to the pixels in the activation image are located;

determining a first position corresponding to a minimum activation value and a second position corresponding to a maximum activation value in a plurality of positions in the first direction;

determining a third position corresponding to the minimum activation value and a fourth position corresponding to the maximum activation value in a plurality of positions in the second direction; an included angle between the second direction and the first direction is a right angle; the target position includes a first position, a second position, a third position, and a fourth position.

obtaining the maximum activation value in all the activation values corresponding to all the pixels in the activation map;

and searching a plurality of positions in the activation value of the sum activation graph, wherein the positions are larger than the target activation value.

obtaining a position mapping relation between each pixel in the summation activation image and each pixel in the first image;

and determining a second image corresponding to the target position in the first image based on the position mapping relation.

In other embodiments of the present application, the target layer is a layer having a number of layers smaller than a target threshold value among the plurality of convolutional layers.

Based on the foregoing embodiments, embodiments of the present application provide a computer storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of:

obtaining a first image to be identified;

In other embodiments of the present application, the one or more programs are executable by the one or more processors and further implement the steps of:

In other embodiments of the present application, the target layer is a layer of the plurality of convolutional layers having a number of layers less than a target threshold.

In other embodiments of the present application, the target layer is a second layer of the plurality of convolutional layers, and the mapping characterizes a position of the same pixel in the summation activation map as in the first image.

The computer storage medium/Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); but may also be various terminals such as mobile phones, computers, tablet devices, personal digital assistants, etc., that include one or any combination of the above-mentioned memories.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing module, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit. Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image recognition method, characterized in that the method comprises:

obtaining a first image to be identified;

2. The method according to claim 1, wherein the extracting a second image in which a target object is located in the first image based on the plurality of feature maps comprises:

3. The method of claim 2, wherein the determining the target location corresponding to the target object in the summation activation graph comprises:

4. The method of claim 3, wherein finding the plurality of locations in the summed activation map where the activation value corresponding to the pixel is greater than the target activation value comprises:

5. The method of any of claims 2 to 4, wherein the determining the second image of the first image corresponding to the target location comprises:

6. The method of claim 5, wherein the target layer is a layer of the plurality of convolutional layers having a number of layers less than a target threshold.

7. The method of claim 5 or 6, wherein the target layer is a second layer of the plurality of convolutional layers, and the mapping characterizes the same pixel in the same position in the summation activation map as in the first image.

8. An image recognition apparatus, characterized in that the image recognition apparatus comprises:

an obtaining unit configured to obtain a first image to be recognized;

9. An electronic device, characterized in that the electronic device comprises: a processor, a memory, and a communication bus;

the processor is configured to execute an image recognition program stored in the memory to implement the steps of the image recognition method according to any one of claims 1 to 7.

10. A storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the image recognition method according to any one of claims 1 to 7.