CN110991533B

CN110991533B - Image recognition method, recognition device, terminal device and readable storage medium

Info

Publication number: CN110991533B
Application number: CN201911219591.0A
Authority: CN
Inventors: 贾玉虎
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2023-08-04
Anticipated expiration: 2039-12-03
Also published as: CN110991533A

Abstract

The application provides an image recognition method, a recognition device, a terminal device and a readable storage medium. The method comprises the following steps: acquiring an image to be identified, and determining global depth characteristics of the image to be identified; determining position indication information based on the image to be identified, wherein the position indication information is used for indicating: if the image to be identified contains a target object, the position of the target object in the image to be identified; determining depth characteristics of an image area indicated by the position indication information in the image to be identified, and obtaining local depth characteristics of the image to be identified; based on the global depth features and the local depth features, it is determined whether the class of the image to be identified is a target class. The method and the device can avoid training the deep learning model by adopting a large amount of training data and longer training time, and quicken the development period of the terminal equipment to a certain extent.

Description

Image recognition method, recognition device, terminal device and readable storage medium

Technical Field

The application belongs to the technical field of image recognition, and particularly relates to an image recognition method, a recognition device, terminal equipment and a readable storage medium.

Background

Currently, when identifying the category of an image, a deep learning model (for example, alexNet, VGGNet or res net) is often laid out in a terminal device, a global depth feature of the image to be identified is extracted by adopting the deep learning model, and then the category of the image is determined based on the global depth feature.

When the images to be identified are similar, in order to distinguish the categories of the images, a deep learning model is required to extract the deep features capable of reflecting the details of the images. In order to ensure that the deep learning model can extract depth features which embody more image details, a large amount of training data and longer training time are required to train the deep learning model, which undoubtedly prolongs the development period of the terminal equipment.

Disclosure of Invention

In view of this, the embodiments of the present application provide an image recognition method, a recognition device, a terminal device, and a readable storage medium, which can recognize similar image types without training a deep learning model with a large amount of training data and a long training time, and can speed up the development cycle of the terminal device to a certain extent.

A first aspect of an embodiment of the present application provides an image recognition method, including:

acquiring an image to be identified, and determining global depth characteristics of the image to be identified based on a first deep learning model;

determining position indication information based on the image to be identified, wherein the position indication information is used for indicating: if the image to be identified contains a target object, the position of the target object in the image to be identified;

determining depth characteristics of an image area indicated by the position indication information in the image to be identified based on a second deep learning model so as to obtain local depth characteristics of the image to be identified;

and determining whether the category of the image to be identified is a target category based on the global depth feature and the local depth feature, wherein the target category is a category containing the target object, and the scene is an image category under a preset scene.

A second aspect of an embodiment of the present application provides an image recognition apparatus, including:

the global feature module is used for acquiring an image to be identified and determining global depth features of the image to be identified based on the first deep learning model;

the position determining module is used for determining position indication information based on the image to be identified, wherein the position indication information is used for indicating: if the image to be identified contains a target object, the position of the target object in the image to be identified;

the local feature module is used for determining the depth features of the image area indicated by the position indication information in the image to be identified based on a second deep learning model so as to obtain the local depth features of the image to be identified;

the identification module is used for determining whether the category of the image to be identified is a target category based on the global depth feature and the local depth feature, wherein the target category is the category of the image containing the target object and the scene is the preset scene.

A third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect when the processor executes the computer program.

A fourth aspect of the embodiments of the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above in the first aspect.

A fifth aspect of embodiments of the present application provides a computer program product comprising a computer program which, when executed by one or more processors, implements the steps of the method as described above in the first aspect.

From the above, the present application provides an image recognition method. Firstly, determining global depth characteristics of an image to be identified based on a first deep learning model; next, position indication information for indicating: if the image to be identified contains a target object, the target object may be located in a position area; thirdly, determining depth characteristics of the image area indicated by the position indication information based on a second deep learning model (the second deep learning model can be the same as the first deep learning model) as local depth characteristics of the image to be identified; and finally, determining whether the category of the image to be identified is a target category based on the global depth feature and the local depth feature, wherein the target category is the category containing the target object, and the scene is the category of the image under the preset scene.

Therefore, the image recognition method provided by the application determines whether the category of the image to be recognized is the target category based on the global depth feature and the depth feature of the area where the target object is likely to be located, and not only depends on the global depth feature. In addition, even when the images are visually similar, the difference between the image areas indicated by the position indication information is often obvious when the images are of the target type or not, so that in this case, the detail information of the image to be identified is not required to be represented by the global depth feature, and the depth feature of the image area indicated by the position indication information is not required to represent more details, so that a large amount of training data and a long training time are not required to train the first and second deep learning models, and therefore, the image identification method provided by the application can identify similar image types under the condition that a large amount of training data and a long training time are not required to train the deep learning models, and can speed up the development period of the terminal equipment to a certain extent.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present application.

Fig. 1 is a flowchart of an image recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a training process for performing the neural network model of step S102;

FIG. 3 is a schematic diagram of a process for obtaining candidate windows for indicating location indication information according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a P-Net network according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an R-Net network according to an embodiment of the present disclosure;

fig. 6 is a flowchart of another image recognition method according to the second embodiment of the present application;

fig. 7 is a schematic structural diagram of an image recognition device according to a third embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The method provided by the embodiment of the application can be applied to a terminal device, and the terminal device includes, but is not limited to: smart phones, tablet computers, notebooks, desktop computers, cloud servers, etc.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In addition, in the description of the present application, the terms "first," "second," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

In order to illustrate the technical solutions described above, the following description is made by specific embodiments.

Example 1

Referring to fig. 1, the method for identifying an image according to the first embodiment of the present application is described below, and the method includes:

in step S101, an image to be identified is acquired, and global depth features of the image to be identified are determined based on a first deep learning model;

currently, a convolutional neural network (Convolutional Neural Networks, CNN) model is generally used to learn the features of an image, that is, the entire image is input into the CNN model, so as to obtain the global depth features of the image output by the CNN model. Common CNN models are AlexNet model, VGGNet model, google Inception Net model and ResNet model. The specific model architecture is prior art and will not be described in detail herein.

In this step S101, the global depth feature of the image to be identified may be obtained using an AlexNet model, VGGNet model, google Inception Net model, or ResNet model, which are commonly used in the prior art.

In addition, experiments prove that the global depth features obtained after downsampling the image to be identified are closer to the global depth features obtained by directly inputting the image to be identified into the first deep learning model compared with the global depth features obtained by directly inputting the image to be identified into the first deep learning model without downsampling the image to be identified, so that the downsampling of the image to be identified can be performed firstly and then the image to be identified can be input into the first deep learning model in order to reduce the operation amount. That is, the step S101 may include: and downsampling the image to be identified, and inputting the downsampled image into the first deep learning model to obtain the global depth characteristics of the image to be identified, which are output by the first deep learning model.

In step S102, based on the image to be identified, position indication information for indicating: if the image to be identified contains a target object, the position of the target object in the image to be identified;

in step S102, it is necessary to estimate the possible position of the target object if the image to be identified includes the target object. It should be understood by those skilled in the art that, whether or not the image to be recognized actually includes the target object, the step S102 needs to give the position indication information.

According to the habit of a user to acquire an image to be identified by using a terminal device, a target object of interest is usually located in the middle area of the image to be identified, and therefore, the position information of the middle area of the image to be identified can be used as the position indication information.

In addition, in the embodiment of the present application, the above-mentioned location indication information may be obtained by training a neural network model in advance (that is, the neural network model is used to estimate a location where a target object may exist in an image input to the neural network model), and a general process of training the neural network model is discussed below with reference to fig. 2.

Fig. 2 shows a schematic diagram of a training process of the neural network model X, which can be used to determine the possible positions of flowers in an image of a plant scene through the training process shown in fig. 2.

As shown in fig. 2, N sample images including flowers and having a plant scene can be obtained in advance, where each sample image corresponds to a label, each sample image is input into a neural network model X, and parameters of the neural network model X are continuously adjusted according to an output result of the neural network model X and the labels corresponding to each sample image respectively until the neural network model X can accurately identify the positions of the flowers in each sample image.

Through the training process shown in fig. 2, the trained neural network model X can identify the possible positions of flowers in the image of the plant scene. However, it should be understood by those skilled in the art that the neural network model X can still give the position indication information when the image input into the trained neural network model X is a plant scene image without flowers or when the image is input not.

In addition, the possible position of the target object in the image to be identified may be determined based on a manner of cascading a suggested Network (P-Net) with a refined Network (R-Net) (for example, the possible position of the flower in the input image may be determined by cascading P-Net and R-Net after training). In particular, the location indication information may be determined by the method shown in fig. 3. That is, the step S102 may include the steps of:

s1021, inputting the image to be identified into a trained suggested network P-Net, and outputting a candidate window for indicating the position indication information by the P-Net;

step S1022, correcting the candidate window of the P-Net output based on a boundary window regression algorithm Bounding box regression and a non-maximum suppression algorithm NMS;

step S1023, inputting the image to be identified and the candidate window corrected by the Bounding box regression and NMS algorithm into a trained improved network R-Net to obtain a candidate window for re-correction of the R-Net output;

and step S1024, correcting the candidate window output by the R-Net again based on the Bounding box regression and NMS algorithm to obtain a final candidate window for indicating the position indication information.

Fig. 4 and 5 of the present embodiments discuss a specific P-Net and R-Net network architecture.

As shown in fig. 4, a specific P-Net network architecture is shown. The input is a 3-channel 12 x 12 size image. First, by 10 convolution kernels of 3 x 3, 2×2 Max Pooling (stride=2), 10 feature maps of 5×5 are generated; next, 16 3×3 feature maps are generated by 16 3×3×10 convolution kernels; again, through 32 3×3×16 convolution kernels, 32 1×1 feature maps are generated; then, for 32 1×1 feature maps, 21×1 feature maps can be generated for classification by 21×1×32 convolution kernels; through 7 convolution kernels of 1×1×32, 9 feature maps of 1×1 are generated for regression frame judgment.

As shown in fig. 5, a specific R-Net network architecture is shown. The input is a 24 x 24 sized image of 3 channels. First of all, by 28 convolutions of 3 x 3 core and 3 x 3 Max Pooling (stride=2) to generate 28 11×11 feature maps; next, 48 4×4 feature maps are generated by 48 convolution kernels of 3×3×28 and Max Pooling (stride=2) of 3×3; again, after passing through 64 2×2×48 convolution kernels, 64 3×3 feature maps are generated; then, the feature map of 3×3×64 is converted into a 128-size full-connection layer, then the full-connection layer for the regression frame classification problem is converted, and the full-connection layer for the position regression problem of the bounding box is converted.

In step S103, determining depth features of the image area indicated by the position indication information in the image to be identified based on a second deep learning model, so as to obtain local depth features of the image to be identified;

the specific implementation process of the step S103 is substantially the same as that of the step S101, except that the image according to the step S101 is the whole image to be identified, the image according to the step S103 is a partial image area in the image to be identified, that is, the image area indicated by the position indication information may be input into the second deep learning model, so as to obtain the depth feature output by the second deep learning model.

In order to reduce the data operand, the image area indicated by the position indication information may be downsampled, and then the depth feature of the downsampled image area may be obtained as the depth feature of the image to be identified, as in step S101.

In addition, in order to reduce the occupation amount of the storage space of the terminal device, the second deep learning model may be the first deep learning model, and it is easy for those skilled in the art to understand that when the second deep learning model is the same as the first deep learning model, the development period of the terminal device can be further accelerated.

In step S104, based on the global depth feature and the local depth feature, determining whether the class of the image to be identified is a target class, where the target class is a class including the target object and the scene is an image class under a preset scene;

in this embodiment of the present application, the step S104 may be performed by using a recognition model (for example, a support vector machine SVM classifier), that is, the global depth feature and the local depth feature are input into the classifier, and the class of the image to be recognized is determined based on the classifier (for example, the classifier may output which of the preset classes the class of the image to be recognized is), so as to determine whether the image to be recognized is the target class.

According to the method, different types of all images with similar images can be accurately identified, for example, the images to be identified are images in all potting scenes, some of the images contain flowers, and some of the images do not contain flowers.

It should be noted that, in the first embodiment of the present application, a potting scene is listed, but it can be understood by those skilled in the art that the application scene of the image recognition method of the embodiment of the present application is not limited to potting scene recognition, and the image recognition method of the embodiment of the present application may be applied to scenes in which each image to be recognized is relatively similar, specifically, the image recognition method provided in the first embodiment of the present application determines whether the category of the image to be recognized is the target category based on the global depth feature and the depth feature of the area where the target object may be located, instead of relying solely on the global depth feature. Therefore, in this case, the global depth feature is not required to represent the detail information of the image to be identified, and when the image is of the target type or is not of the target type, the difference between the image areas indicated by the position indication information is often relatively obvious, so the depth feature of the image area indicated by the position indication information is not required to represent more details, a large amount of training data and a long training time are not required to train the first and second deep learning models, and therefore, the image identification method provided by the application can accelerate the development period of the terminal equipment to a certain extent.

Example two

Referring to fig. 6, another image recognition method provided in the second embodiment of the present application is described below, and the method includes:

in step S201, an image to be identified is acquired, and global depth features of the image to be identified are determined based on a first deep learning model;

in step S202, based on the image to be identified, position indication information for indicating: if the image to be identified contains a target object, the position of the target object in the image to be identified;

in step S203, determining depth features of the image area indicated by the position indication information in the image to be identified based on a second deep learning model, so as to obtain local depth features of the image to be identified;

the specific implementation manner of the steps S201 to S203 is identical to that of the steps S101 to S103 in the first embodiment, and may be specifically referred to the description of the first embodiment, and will not be repeated here.

In step S204, determining an artificial feature of the image to be identified, and determining whether the category of the image to be identified is a target category based on the artificial feature, the global depth feature and the local depth feature, wherein the target category is a category of the image including the target object and the scene is a preset scene;

unlike the first embodiment, this second embodiment further relies on the artificial features of the image to be identified to determine the category of the image to be identified. The artificial features may be color histogram features, texture descriptor features, spatial envelope features, scale invariant feature transforms, and/or directional gradient histogram features, etc.

Several artificial features of the present solution are described in detail below:

1) Color histogram features: the color histogram features can be applied in image retrieval and scene classification, and have the characteristics of simplicity, efficiency and easiness in calculation, and the main advantage of the color histogram features is that the color histogram features are unchanged for translation and rotation around a visual axis. The color histogram feature is also sensitive to small illumination variations and quantization errors.

2) Texture descriptor features: common texture descriptor features include gray level co-occurrence matrix, gabor features, local binary pattern features, etc., which are very effective in identifying texture scene images, especially those with repetitive arrangement characteristics.

3) Spatial envelope characteristics: the spatial envelope features provide a global description of the spatial structure used to represent the major dimensions and directions of the scene, in particular, in the standard spatial envelope features, the image is first convolved using a plurality of steerable pyramid filters, and then divided into 4 x 4 grids for which the azimuth histogram is extracted. Because of its simplicity and efficiency, spatial envelope features are widely used for scene representation.

4) Scale invariant feature transform: the scale-invariant feature transform describes the sub-regions by gradient information around identified keypoints. Standard scale invariant feature transforms, also known as sparse scale invariant feature transforms, are a combination of keypoint detection and histogram-based gradient representations. It generally has four steps, namely scale space extremum searching, sub-pixel keypoint refinement, dominant direction assignment and feature description. In addition to sparse scale invariant feature transforms, dense scale invariant feature transforms exist, such as accelerated robust features (Speed Up Robust Features, SURF). The scale-invariant feature transform is highly unique and invariant to scale, rotation and illumination variations.

5) Directional gradient histogram features: the directional gradient histogram feature represents an object by calculating the distribution of gradient intensities and directions in a spatially distributed sub-region, which has been accepted as one of the best features to capture the edge or local shape information of the object.

The selection of the artificial features can be determined in particular from the application scenario of the image recognition. The above-described artificial features, each of which is used in a specific scenario, contribute to an improvement in recognition rate. Generally speaking, the depth features obtained by using the deep learning model can reflect the texture of the image to some extent, so, for better recognition of the image category, the artificial feature described in the step S204 may be selected as a feature other than the texture descriptor feature, such as a color histogram feature.

It should be understood by those skilled in the art that, in the second embodiment of the present application, the step of acquiring the artificial feature is performed in step S204, but the present application is not limited to the specific execution sequence of "acquiring the artificial feature".

In a second embodiment of the present application, the determining whether the category of the image to be identified is the target category based on the artificial feature, the global depth feature, and the local depth feature may include:

splicing the artificial feature, the global depth feature and the local depth feature to obtain a feature vector;

and inputting the feature vector into a trained recognition model to obtain a recognition result which is output by the recognition model and is used for indicating the type of the image to be recognized.

Compared with the first embodiment, the method and the device further depend on the artificial characteristics of the image to be identified, so that the category of the image to be identified can be identified more accurately to a certain extent compared with the first embodiment.

Example III

The third embodiment of the application provides an image recognition device. For convenience of explanation, only a portion relevant to the present application is shown, and as shown in fig. 7, the image recognition apparatus 300 includes:

the global feature module 301 is configured to obtain an image to be identified, and determine global depth features of the image to be identified based on a first deep learning model;

a position determining module 302, configured to determine, based on the image to be identified, position indication information, where the position indication information is used to indicate: if the image to be identified contains a target object, the position of the target object in the image to be identified;

a local feature module 303, configured to determine depth features of an image area indicated by the position indication information in the image to be identified based on a second deep learning model, so as to obtain local depth features of the image to be identified;

the identifying module 304 is configured to determine whether the class of the image to be identified is a target class based on the global depth feature and the local depth feature, where the target class is a class including the target object, and the scene is an image class under a preset scene.

Optionally, the location determining module 302 includes:

the P-Net unit is used for inputting the image to be identified into a trained suggested network P-Net, and the P-Net outputs a candidate window for indicating the position indication information;

a correction unit, configured to correct the candidate window output by the P-Net based on a boundary window regression algorithm Bounding box regression and a non-maximum suppression algorithm NMS;

the R-Net unit is used for inputting the image to be identified and the candidate window corrected by the Bounding box regression and NMS algorithm into a trained improved network R-Net to obtain a candidate window corrected again for the R-Net output;

and the re-correction unit is used for re-correcting the candidate window output by the R-Net based on the Bounding box regression and NMS algorithm to obtain a final candidate window for indicating the position indication information.

Optionally, the global feature module 301 is specifically configured to:

and downsampling the image to be identified, and inputting the downsampled image to the first deep learning model to obtain the global depth characteristics of the image to be identified, which are output by the first deep learning model.

Optionally, the image recognition apparatus 300 further includes:

the artificial feature module is used for determining the artificial feature of the image to be identified;

accordingly, the identification module 304 is specifically configured to:

and determining whether the category of the image to be identified is a target category based on the artificial feature, the global depth feature and the local depth feature.

Optionally, the identification module 304 includes:

the splicing unit is used for splicing the artificial feature, the global depth feature and the local depth feature to obtain a feature vector;

the recognition unit is used for inputting the feature vector into the trained recognition model to obtain a recognition result which is output by the recognition model and used for indicating the image category to be recognized.

Optionally, the above artificial feature module is specifically configured to:

and determining the color histogram characteristics of the image to be identified.

It should be noted that, because the content of the information interaction and the execution process between the devices/units is based on the same concept as the first embodiment and the second embodiment of the method, specific functions and technical effects thereof may be referred to in the corresponding method embodiment section, and will not be described herein.

Example IV

Fig. 8 is a schematic diagram of a terminal device provided in a fourth embodiment of the present application. As shown in fig. 8, the terminal device 400 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented when the processor 401 executes the computer program 403 described above. Alternatively, the processor 401 may implement the functions of the modules/units in the above-described embodiments of the apparatus when executing the computer program 403.

Illustratively, the computer program 403 may be divided into one or more modules/units, which are stored in the memory 402 and executed by the processor 401 to complete the present application. The one or more modules/units may be a series of instruction segments of a computer program capable of performing a specific function, which instruction segments are used to describe the execution of the computer program 403 in the terminal device 400. For example, the computer program 403 may be divided into a global feature module, a location determination module, a local feature module, and an identification module, where each module specifically functions as follows:

The terminal device may include, but is not limited to, a processor 401, a memory 402. It will be appreciated by those skilled in the art that fig. 8 is merely an example of a terminal device 400 and is not intended to limit the terminal device 400, and may include more or fewer components than shown, or may combine certain components, or different components, such as the terminal device described above may also include input-output devices, network access devices, buses, etc.

The processor 401 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 402 may be an internal storage unit of the terminal device 400, for example, a hard disk or a memory of the terminal device 400. The memory 402 may be an external storage device of the terminal device 400, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided in the terminal device 400. Further, the memory 402 may also include both an internal storage unit and an external storage device of the terminal device 400. The memory 402 is used for storing the computer program and other programs and data required for the terminal device. The memory 402 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units described above is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of each of the above-described method embodiments, or may be implemented by a computer program to instruct related hardware, where the above-described computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the above-described method embodiments. The computer program comprises computer program code, and the computer program code can be in a source code form, an object code form, an executable file or some intermediate form and the like. The computer readable medium may include: any entity or device capable of carrying the computer program code described above, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium described above can be appropriately increased or decreased according to the requirements of the jurisdiction's legislation and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the legislation and the patent practice.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. An image recognition method, comprising:

determining depth features of the image area indicated by the position indication information in the image to be identified based on a second deep learning model so as to obtain local depth features of the image to be identified;

determining whether the category of the image to be identified is a target category based on the global depth feature and the local depth feature, wherein the target category is a category containing the target object, and the scene is an image under a preset scene;

the determining the position indication information based on the image to be identified comprises the following steps:

inputting the image to be identified into a trained suggested network P-Net, and outputting a candidate window for indicating the position indication information by the P-Net;

correcting the candidate window of the P-Net output based on a boundary window regression algorithm Bounding box regression and a non-maximum suppression algorithm NMS;

inputting the image to be identified and the candidate window corrected by the boundary window-based regression algorithm Bounding box regression and the non-maximum suppression algorithm NMS algorithm into a trained improved network R-Net to obtain a candidate window for re-correction of the R-Net output;

and correcting the candidate window output by the R-Net again based on the boundary window-based regression algorithm Bounding box regression and a non-maximum suppression algorithm NMS algorithm to obtain a final candidate window for indicating the position indication information.

2. The image recognition method of claim 1, wherein the determining global depth features of the image to be recognized based on a first deep learning model comprises:

3. The image recognition method according to any one of claims 1 to 2, characterized in that the image recognition method further comprises:

determining the artificial characteristics of the image to be identified;

accordingly, the determining, based on the global depth feature and the local depth feature, whether the category of the image to be identified is a target category includes:

4. The image recognition method of claim 3, wherein the determining whether the class of the image to be recognized is a target class based on the artificial feature, the global depth feature, and the local depth feature comprises:

5. The image recognition method of claim 3, wherein the determining the artificial feature of the image to be recognized comprises:

6. An image recognition apparatus, comprising:

the global feature module is used for acquiring an image to be identified and determining global depth features of the image to be identified based on a first deep learning model;

the local feature module is used for determining depth features of the image area indicated by the position indication information in the image to be identified based on a second deep learning model so as to obtain local depth features of the image to be identified;

the identification module is used for determining whether the category of the image to be identified is a target category based on the global depth feature and the local depth feature, wherein the target category comprises the target object, and the scene is the category of the image under a preset scene;

the location determination module includes:

the R-Net unit is used for inputting the image to be identified and the candidate window corrected by the boundary window regression algorithm Bounding box regression and the non-maximum suppression algorithm NMS algorithm into the trained improved network R-Net to obtain a candidate window for re-correction of the R-Net output;

and the re-correction unit is used for re-correcting the candidate window output by the R-Net based on the boundary window regression algorithm Bounding box regression and the non-maximum suppression algorithm NMS algorithm to obtain a final candidate window for indicating the position indication information.

7. Terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the image recognition method according to any one of claims 1 to 5 when the computer program is executed.

8. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the image recognition method according to any one of claims 1 to 5.