CN115130543B

CN115130543B - Image recognition method and device, storage medium and electronic equipment

Info

Publication number: CN115130543B
Application number: CN202210468883.3A
Authority: CN
Inventors: 叶虎; 韩骁; 蔡德; 肖凯文; 马兆轩; 周彦宁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2024-04-12
Anticipated expiration: 2042-04-29
Also published as: CN115130543A

Abstract

The application discloses an image recognition method and device, a storage medium and electronic equipment, and can be applied to the field of image processing. Wherein the method comprises the following steps: acquiring a target image; inputting the target image into a target neural network model, determining M suspicious view images from the target image through a target view classification model in the target neural network model, and predicting the M suspicious view images through a target full-scale classification model in the target neural network model to obtain a predicted image label of the target image. The method and the device solve the technical problem of low image recognition accuracy.

Description

Image recognition method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of computers, and in particular, to an image recognition method and apparatus, a storage medium, and an electronic device.

Background

In the related art, the quality of a product needs to be detected before the product leaves the factory, and flaw detection is an important detection link. For example, defects (e.g., scratches, specks, etc.) in the appearance of the product, or dead spots in the display screen of the electronic product, etc. The defects are small, the detection method commonly used at present is manual detection, but the manual detection also has certain errors and has the problem of low accuracy. And the manual detection efficiency is low.

In the related art, cancer cells seriously threaten human health. In the field of cancer cell detection, in the prior art, a professional doctor performs pathological analysis on a full-field digital slice of a patient to determine whether the patient has cancer, and the method depends on the professional doctor and has the problem of low detection efficiency.

With the development of image recognition technology, image detection can be applied to various fields. But for smaller flaws on the product and smaller cancer cells on the full-field digital section, the image recognition accuracy is lower.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides an image recognition method and device, a storage medium and electronic equipment, so as to at least solve the technical problem of low image recognition accuracy.

According to an aspect of an embodiment of the present application, there is provided an image recognition method including: acquiring a target image, wherein the target image is an image to be identified obtained by scanning a target object; inputting the target image into a target neural network model, determining M suspicious view images from the target image through a target view classification model in the target neural network model, and predicting the M suspicious view images through a target full-scale classification model in the target neural network model to obtain a predicted image label of the target image, wherein M is greater than or equal to 1, and the M suspicious view images are view images with predicted suspicious probability greater than or equal to a preset probability threshold in the target image; the target neural network model is a model obtained by training a neural network model to be trained by using a training sample image and a training visual field image until the following convergence condition is met: and a first loss condition is met between a known image label of the training sample image and a predicted view label of a target suspicious view image, wherein the target suspicious view image is a suspicious view image with the maximum predicted suspicious probability in the training sample image determined by a to-be-trained view classification model in the to-be-trained neural network model.

Optionally, the convergence condition further includes: a second loss condition is met between the known image label of the training sample image and the predicted image label of the training sample image determined by the to-be-trained full-scale classification model in the to-be-trained neural network model, and a third loss condition is met between the known view label of the training view image and the predicted view label of the training view image determined by the to-be-trained view classification model in the to-be-trained neural network model.

Optionally, before the inputting the target image into the target neural network model, the method further comprises: acquiring the training sample image and the training visual field image; and carrying out multi-round combined training on the to-be-trained visual field classification model and the to-be-trained full-scale classification model in the to-be-trained neural network model through the training sample image and the training visual field image to obtain the target neural network model, wherein in the training process, if the to-be-trained neural network model does not meet the convergence condition, parameters in the to-be-trained visual field classification model and the to-be-trained full-scale classification model are adjusted, if the to-be-trained neural network model meets the convergence condition, training is finished, the to-be-trained neural network model when training is finished is determined to be the target neural network model, and the to-be-trained visual field classification model and the to-be-trained full-scale classification model when training is finished are respectively determined to be the target visual field classification model and the target full-scale classification model in the target neural network model.

Optionally, the performing multi-round joint training on the to-be-trained visual field classification model and the to-be-trained full-scale classification model in the to-be-trained neural network model through the training sample image and the training visual field image includes: performing an i-th round of combined training on the to-be-trained visual field classification model and the to-be-trained full-sheet classification model in the to-be-trained neural network model, wherein i is a positive integer greater than or equal to 1, and the visual field classification model and the full-sheet classification model obtained by the 0-th round training are the untrained to-be-trained visual field classification model and the to-be-trained full-sheet classification model in the to-be-trained neural network model, and the method comprises the following steps: inputting the training visual field image into a visual field classification model obtained by the i-1 th wheel training to obtain a predicted visual field label of the training visual field image determined by the i-1 th wheel training; inputting the training sample image into a visual field classification model obtained by the i-1 th round training and a full-scale classification model obtained by the i-1 th round training to obtain a predicted visual field label of the target suspicious visual field image determined by the i-th round training and a predicted image label of the training sample image determined by the i-th round training; and ending training when the predicted view label of the training view image determined by the ith training, the predicted view label of the target suspicious view image determined by the ith training and the predicted image label of the training sample image determined by the ith training meet the convergence condition, and determining the view classification model and the full-scale classification model obtained by the i-1 training as the target view classification model and the target full-scale classification model in the target neural network model respectively.

Optionally, the inputting the training sample image into the field classification model obtained by the i-1 th round training and the full-scale classification model obtained by the i-1 th round training, to obtain a predicted field label of the target suspicious field image determined by the i-th round training, and a predicted image label of the training sample image determined by the i-th round training, including: inputting the training sample image into a visual field classification model obtained by the i-1 th round training to obtain a predicted visual field label of the target suspicious visual field image determined by the i-1 th round training and S suspicious training visual field images determined by the i-th round training, wherein S is greater than or equal to 1, the S suspicious training visual field images are visual field images with the predicted suspicious probability greater than or equal to the preset probability threshold in the training sample image, and the target suspicious visual field images are training visual field images with the maximum predicted suspicious probability in the training sample image; and inputting the S suspicious training visual field images into a full-film classification model obtained by the i-1 th round training to obtain a predicted image label of the training sample image determined by the i-1 th round training.

Optionally, the inputting the training sample image into the field classification model obtained by the i-1 th round training, obtaining a predicted field label of the target suspicious field image determined by the i-1 th round training, and S suspicious training field images determined by the i-1 th round training, including: inputting the training sample image into a visual field classification model obtained by the i-1 th round training, and dividing the training sample image into N training visual field images through the visual field classification model obtained by the i-1 th round training; determining training visual field images with predicted suspicious probability larger than or equal to the preset probability threshold value from the N training visual field images through the visual field classification model obtained by the i-1 th round of training, and obtaining S suspicious training visual field images determined by the i-1 th round of training, wherein N is larger than or equal to S; and determining a training visual field image with the largest predicted suspicious probability in the N training visual field images as the target suspicious visual field image through a visual field classification model obtained through the i-1 th round training, and determining a label corresponding to the predicted suspicious probability of the target suspicious visual field image as a predicted visual field label of the target suspicious visual field image.

Optionally, in the ith round of joint training, the method further includes: inputting the predicted suspicious probability corresponding to the predicted visual field label of the target suspicious visual field image determined by the ith training and the known probability corresponding to the known image label of the training sample image into a first loss function to obtain a first loss value; inputting the predicted suspicious probability corresponding to the predicted image label of the training sample image and the known probability corresponding to the known image label of the training sample image determined by the ith training into a second loss function to obtain a second loss value; inputting the predicted suspicious probability corresponding to the predicted visual field label of the training visual field image and the known probability corresponding to the known visual field label of the training visual field image determined by the ith training into a third loss function to obtain a third loss value; determining whether the first loss value satisfies the first loss condition, whether the second loss value satisfies the second loss condition, and whether the third loss value satisfies the third loss condition; and determining that the predicted view label of the training view image determined by the ith training, the predicted view label of the target suspicious view image determined by the ith training and the predicted image label of the training sample image determined by the ith training meet the convergence condition when the first loss value meets the first loss condition, the second loss value meets the second loss condition and the third loss value meets the third loss condition.

Optionally, the predicting the M suspicious field-of-view images through a target full-film classification model in the target neural network model to obtain a predicted image tag of the target image includes: extracting features of the M suspicious view images through the target view classification model in the target neural network model to obtain M feature vectors; and inputting the feature average vectors of the M feature vectors into a classifier in the target full-film classification model, and classifying the feature average vectors through the classifier to obtain a predicted image label of the target image.

Optionally, the determining M suspicious field images from the target images through a target field classification model in the target neural network model includes: dividing the target image into N view images by the target view classification model in the target neural network model, wherein N is greater than or equal to M; and determining the M suspicious field images from the N field images through the target field classification model, wherein the M suspicious field images are field images with the predicted suspicious probability larger than or equal to the preset probability threshold value in the N field images.

According to another aspect of the embodiments of the present application, there is also provided an image recognition apparatus, including: the acquisition module is used for acquiring a target image, wherein the target image is an image to be identified obtained by scanning a target object; the input module is used for inputting the target image into a target neural network model, determining M suspicious view images from the target image through a target view classification model in the target neural network model, and predicting the M suspicious view images through a target full-scale classification model in the target neural network model to obtain a predicted image label of the target image, wherein M is greater than or equal to 1, and the M suspicious view images are view images with predicted suspicious probability greater than or equal to a preset probability threshold value in the target image; the target neural network model is a model obtained by training a neural network model to be trained by using a training sample image and a training visual field image until the following convergence condition is met: and a first loss condition is met between a known image label of the training sample image and a predicted view label of a target suspicious view image, wherein the target suspicious view image is a suspicious view image with the maximum predicted suspicious probability in the training sample image determined by a to-be-trained view classification model in the to-be-trained neural network model.

According to yet another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described image recognition method when run.

According to yet another aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the image recognition method as above.

According to still another aspect of the embodiments of the present application, there is also provided an electronic device including a memory in which a computer program is stored, and a processor configured to execute the image recognition method described above by the computer program.

In the embodiment of the application, the target neural network model comprises a target visual field classification model and a target full-scale classification model, and the target visual field model can determine that the predicted suspicious probability is greater than or equal to M suspicious visual field images with a preset probability threshold value in target images to be identified, so that the visual field of flaws or cancer cells on the images can be determined in the target images. Further, the prediction image labels of the target images can be obtained by predicting M suspicious field images through the target full-film classification model. The accuracy of image recognition can be improved.

In addition, in the training process of the target neural network model, the suspicious view image with the largest predicted suspicious probability in the training sample image determined by the visual field classification model to be trained and the known label of the training sample image meet the first loss condition, so that constraint loss is increased for the visual field classification model to be trained, and the image recognition efficiency of the target neural network model can be improved. And further solves the technical problem of lower image recognition accuracy.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a schematic illustration of an application environment of an alternative image recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of an alternative image recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative target neural network model structure, according to an embodiment of the present application;

FIG. 4 is a schematic illustration of an alternative target image according to an embodiment of the present application;

FIG. 5 is a training flow diagram of an alternative neural network model to be trained, according to an embodiment of the present application;

FIG. 6 is a schematic illustration of a positive field of view and a negative field of view of yet another alternative cancer cell according to an embodiment of the present application;

FIG. 7 is a schematic flow chart diagram of yet another alternative according to an embodiment of the present application;

FIG. 8 is a schematic illustration of an alternative overall structure according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an alternative image recognition device according to an embodiment of the present application;

FIG. 10 is a block diagram of a computer system of an alternative electronic device according to an embodiment of the present application;

fig. 11 is a schematic structural view of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the invention can be applied to various scenes such as cloud technology, artificial intelligence and the like.

First, partial terms or terminology appearing in describing embodiments of the present application are applicable to the following explanation:

full field digital slice: (whole slide images, abbreviated as WSI), a pathological section is digitized to form a WSI.

Negative positive: negative in medical examination, generally, represents normal, positive represents problematic. Negative and positive are used more medically, and have become a term that generally refers to the presence or absence, or to the result of a test.

The present application is described below with reference to examples:

according to an aspect of the embodiment of the present invention, there is provided an image recognition method, optionally, as an optional implementation manner, the image recognition method may be applied, but not limited to, to an application environment as shown in fig. 1, where the application environment may include: terminal device 102, network 110, and server 112.

Alternatively, in this embodiment, the above terminal device may include, but is not limited to, at least one of the following: a mobile phone (e.g., an Android mobile phone, iOS mobile phone, etc.), a notebook computer, a tablet computer, a palm computer, a MID (Mobile Internet Devices, mobile internet device), a PAD, a desktop computer, a smart television, a medical device, etc. The terminal device may be configured with a target client, which may be a game client, an instant messaging client, a browser client, a video client, a shopping client, etc. In this embodiment, the terminal device may, but is not limited to: a memory 104, a processor 106, and a display 108. The memory 104 may be used to store data, for example, the image to be identified obtained by scanning the target object. The processor may be configured to input the target image into a target neural network model. The display 108 may be used to display the target image, as well as display a predictive image label for the target image.

Alternatively, the network 110 may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: local area networks, metropolitan area networks, and wide area networks, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communications.

Alternatively, the server 112 may be a single server, a server cluster including a plurality of servers, or a cloud server. Server 112 may, but is not limited to: a database 114 and a processing engine 116. The database 114 may be used to store data, such as may be used to store the target images. The processing engine may be configured to perform the steps of:

and step S12, inputting the target image into a target neural network model, determining M suspicious view images from the target image through a target view classification model in the target neural network model, and predicting the M suspicious view images through a target full-scale classification model in the target neural network model to obtain a predicted image label of the target image. The above is merely an example, and is not limited in any way in the present embodiment.

Optionally, as an optional embodiment, as shown in fig. 2, the image identifying method includes:

Step S202, acquiring a target image, wherein the target image is an image to be identified obtained by scanning a target object;

the target object may be a product to be detected, for example, an electronic product, a home product, or the like, and the target image is an image shot by an image acquisition device (for example, a camera) of the product to be detected.

The target object may also be a body part of an animal or a human, such as a diseased part of a patient. The target image is a full-field digital slice obtained by scanning the diseased portion with the medical device.

Step S204, inputting the target image into a target neural network model, determining M suspicious view images from the target image through a target view classification model in the target neural network model, and predicting the M suspicious view images through a target full-scale classification model in the target neural network model to obtain a predicted image label of the target image, wherein M is greater than or equal to 1, and the M suspicious view images are view images with predicted suspicious probabilities greater than or equal to a preset probability threshold in the target image;

the target neural network model is a model obtained by training a neural network model to be trained by using a training sample image and a training visual field image until the following convergence condition is met: and a first loss condition is met between a known image label of the training sample image and a predicted view label of a target suspicious view image, wherein the target suspicious view image is a suspicious view image with the maximum predicted suspicious probability in the training sample image determined by a to-be-trained view classification model in the to-be-trained neural network model.

The target neural network model structure shown in fig. 3 includes a target field-of-view classification model and a target full-slice classification model. And inputting the target image into a target visual field classification model, outputting M suspicious visual field images through the target visual field classification model, inputting the M suspicious visual field images into a target full-scale classification model, and outputting a predicted image label of the target image through the target full-scale classification model. The above-described predicted image tag may be used to represent a classification type of the target image. For example, in the product flaw detection scenario, the above-mentioned predictive image label may be used to indicate whether there is a flaw in the target image, and may be represented by 0 and 1, where 0 indicates no flaw and 1 indicates flaw. In the cancer cell detection scenario, the predictive image label may be used to indicate whether or not a cancer cell exists in the target image, 0 is used to indicate that there is no cancer cell, 1 is used to indicate that there is a cancer cell, and alternatively, 0 is used to indicate that it is negative, and 1 is used to indicate that it is positive.

The target image may be divided into a plurality of view images by the target view classification model, such as the target image shown in fig. 4, the target image is divided into N view images by the target view classification model, and M suspicious view images among the N view images may be determined by the target view classification model.

In the product flaw detection scene, the M suspicious field images are used for indicating that flaws may exist in the suspicious field images, the probability of flaws existing in the suspicious field images may be indicated by predicting suspicious probability, the probability may be indicated by a value of 0-1, and the probability value is larger the greater the probability value is.

In the case of cancer cells, the M suspicious field images are used to indicate that there is a possibility of cancer cells in the suspicious field image, the probability of cancer cells in the suspicious field image may be expressed by a predicted suspicious probability, and the probability of cancer cells in the suspicious field image may be expressed by a value of 0-1, wherein the larger the probability value is, the larger the probability of cancer cells in the suspicious field image is.

As an alternative embodiment, the neural network model to be trained is an untrained neural network model and a neural network model in the training process. During the training process, the neural network model to be trained can be constrained by the following three loss conditions:

the first loss condition includes: the suspicious visual field image with the largest predicted suspicious probability in the training sample image has a predicted visual field label which is consistent with the known image label of the training sample image. The visual field image with the largest predicted suspicious probability determined in the training sample image through the visual field classification model to be trained is called a target suspicious visual field image. Assuming that the predicted suspicious probability of the target suspicious field image is 0.8, the suspicious probability of the field image with the largest predicted suspicious probability in the training sample image is 0.8, and the predicted suspicious probabilities of other field images are all smaller than or equal to 0.8.

In the product flaw detection scene, the suspicious visual field image with the predicted suspicious probability being greater than or equal to 0.5 is assumed to indicate that flaws exist in the suspicious visual field image, so that flaws exist in a training sample image where the suspicious visual field image is located can be determined. Then the target suspicious field image with a predicted suspicious probability of 0.8 (0.8 is only for illustration of the present embodiment, and the specific numerical value may be determined according to the actual situation) indicates that there is a flaw in the target suspicious field image, that is, there is a flaw in the training sample image where the target suspicious field image is located. Since the known image label of the training sample image is used for indicating whether a flaw exists in the training sample image (the known image label is 1 when the flaw exists and is 0 when the flaw does not exist), that is, whether the flaw exists in the training sample image is known, the constraint loss can be increased for training the to-be-trained visual field classification model through the first loss condition, so that the recognition accuracy of the to-be-recognized image of the target visual field classification model in the target neural network model is improved.

In the cancer cell detection scene, the visual field image with the predicted suspicious probability of more than or equal to 0.5 is assumed to represent that cancer cells exist in the visual field image, so that the existence of the cancer cells in the training sample image where the visual field image is located can be determined. Then a target suspicious field image with a predicted suspicious probability of 0.8 indicates that cancer cells are present in the target suspicious field image, i.e., in the training sample image in which the target suspicious field image is located. Since the known image tag of the training sample image is used for indicating whether cancer cells actually exist in the training sample image (the known image tag is 1 when cancer cells exist and is 0 when cancer cells do not exist), that is, whether cancer cells exist in the training sample image is known, through the first loss condition, constraint loss can be increased for training the to-be-trained visual field classification model, so that the identification accuracy of the to-be-identified image of the target visual field classification model in the target neural network model is improved.

The second loss condition includes: the known image labels of the training sample images are consistent with the predicted image labels of the training sample images determined by the full-scale classification model to be trained in the neural network model to be trained.

In the product flaw detection scenario, a training sample image with a predicted suspicious probability of greater than or equal to 0.5 is assumed to indicate that flaws exist in the training sample image. Since the known image label of the training sample image is used for indicating whether a flaw exists in the training sample image (the known image label is 1 when the flaw exists and is 0 when the flaw does not exist), that is, whether the flaw exists in the training sample image is known, the constraint loss can be increased for training the full-film classification model to be trained through the second loss condition, so that the recognition accuracy of the target full-film classification model in the target neural network model to the image to be recognized can be improved.

In the cancer cell detection scenario, it is assumed that a training sample image with a predicted suspicious probability greater than or equal to 0.5 indicates that cancer cells exist in the training sample image, and since a known image tag of the training sample image is used to indicate whether cancer cells actually exist in the training sample image (the known image tag is 1 in the case of cancer cells and the known image tag is 0 in the case of no cancer cells), that is, whether cancer cells exist in the training sample image is known, by the above second loss condition, a constraint loss can be added to training of the full-segment classification model to be trained, so as to improve the accuracy of identifying the image to be identified by the target full-segment classification model in the target neural network model.

The third loss condition includes: the known view labels of the training view images are consistent with the predicted view labels of the training view images determined by the to-be-trained view classification model in the to-be-trained neural network model.

In the product flaw detection scenario, a training field image with a predicted suspicious probability of greater than or equal to 0.5 is assumed to indicate that flaws exist in the training field image. Since the known view labels of the training view image are used for indicating whether the defect exists in the training view image (the known view labels are 1 when the defect exists and are 0 when the defect does not exist), that is, whether the defect exists in the training view image is known, the constraint loss can be increased for training the to-be-trained view classification model through the third loss condition, so that the recognition accuracy of the to-be-recognized image of the target view classification model in the target neural network model is improved.

In the cancer cell detection scenario, a training field image in which the predicted suspicious probability is greater than or equal to 0.5 is assumed to indicate the presence of cancer cells in the training field image. Since the known view labels of the training view image are used to indicate whether cancer cells actually exist in the training view image (the known view labels are 1 when cancer cells exist and 0 when cancer cells do not exist), that is, whether cancer cells exist in the training view image are known, the constraint loss can be increased for training the to-be-trained view classification model through the third loss condition, so that the recognition accuracy of the to-be-recognized image of the target view classification model in the target neural network model is improved.

As an alternative embodiment, the image tag of the training sample image is known (1 in the case of a flaw, 0 in the case of no flaw, or 1 in the case of a cancer cell, and 0 in the case of no cancer cell). The visual field label of the training visual field image is also known (1 for the case of a flaw, 0 for the case of no flaw, or 1 for the case of a cancer cell, and 0 for the case of no cancer cell).

And carrying out multi-round combined training on the vision classification model to be trained and the full-scale classification model to be trained in the neural network model to be trained through training vision images of a small number of known vision labels and training sample images of a small number of known image labels. The training flow chart of the neural network model to be trained as shown in fig. 5 comprises the following steps:

step S500, acquiring a training sample image and a training visual field image, wherein the image label of the training sample image is known (the known image label is 1 when a flaw exists, the known image label is 0 when no flaw exists, or the known image label is 1 when a cancer cell exists, the known image label is 0 when no cancer cell exists), the visual field label of the training visual field image is also known (the known visual field label is 1 when a flaw exists, the known visual field label is 0 when no flaw exists, or the known visual field label is 1 when a cancer cell exists, and the known visual field label is 0 when no cancer cell exists);

step S501, inputting a training visual field image into a visual field classification model to be trained to obtain a prediction visual field label of the training visual field image; wherein the visual field classification model to be trained is a neural network model. In the product flaw detection scenario, the predicted visual field label is used to indicate the result of predicting whether or not there is a flaw in the training visual field image (the flaw-present predicted visual field label is 1, and the flaw-present predicted visual field label is not 0). The prediction visual field label in the cancer cell detection scene is used to indicate the prediction result of whether or not there is a cancer cell in the training visual field image (the presence of a cancer cell prediction visual field label is 1, and the absence is 0);

Step S503, judging whether the predicted view label of the training view image and the known view label of the training view image satisfy the third convergence condition, if so, executing step S514, and if not, executing step S505;

step S505, adjusting model parameters of the vision field classification model to be trained to obtain an updated vision field classification model to be trained, and continuously executing step S501 and step S502;

step S502, inputting training sample images into a visual field classification model to be trained to obtain S suspicious training visual field images, and predicting visual field images with maximum suspicious probability and predicted visual field labels of the target suspicious visual field images. The visual field classification model to be trained can divide a training sample image into a plurality of training visual field images, a training visual field image with a predicted suspicious probability larger than or equal to a preset probability threshold is determined in the plurality of training visual field images, S suspicious training visual field images are obtained, S is larger than or equal to 1, and the preset probability threshold can be set according to actual conditions, for example: 0.3, 0.4, 0.5, etc. And determining a training visual field image with the largest predicted suspicious probability from the training visual field images as a target suspicious visual field image, and determining a predicted visual field label of the target suspicious visual field image according to the predicted suspicious probability of the target suspicious visual field image. In the product flaw detection scene, the prediction visual field label corresponding to the prediction suspicious probability is 1 (indicating that flaws exist in the training sample image) when the prediction suspicious probability is greater than or equal to 0.5, and the prediction visual field label corresponding to the prediction suspicious probability is 0 (indicating that flaws exist in the training sample image) when the prediction suspicious probability is less than 0.5. In the cancer cell detection scene, the prediction visual field label corresponding to the prediction suspicious probability is 1 (indicating that cancer cells exist in the training sample image) when the prediction suspicious probability is greater than or equal to 0.5, and the prediction visual field label corresponding to the prediction suspicious probability is 0 (indicating that cancer cells do not exist in the training sample image) when the prediction suspicious probability is less than 0.5.

Step S504, inputting S suspicious training field images into a full-scale classification model to be trained to obtain a predicted image label of a training sample image;

step S506, judging whether the predicted image label of the training sample image and the known image label of the training sample image meet the second convergence condition, executing step S514 if the second convergence condition is met, and executing step S508 if the second convergence condition is not met;

step S508, adjusting model parameters of the full-sheet classification model to be trained to obtain an updated full-sheet classification model to be trained, and continuing to execute step S502;

step S510, judging whether the predicted view label of the target suspicious view image and the predicted image label of the training sample image meet the first convergence condition, executing step S514 if the predicted view label of the target suspicious view image and the predicted image label of the training sample image meet the first convergence condition, and executing step S512 if the predicted view label of the target suspicious view image and the predicted image label of the training sample image do not meet the first convergence condition;

step S512, adjusting model parameters of the vision field classification model to be trained to obtain an updated vision field classification model to be trained, and continuing to execute the steps S501 and S502;

and step S514, finishing training to obtain the target neural network model. The target neural network model comprises a target visual field classification model and a target full-slice classification model which are trained.

As an optional implementation manner, the ith training is any training of multiple training rounds of the neural network model to be trained. The i-1 th training is the training preceding the i-th training.

Before training the visual field classification model to be trained, the training visual field image needs to be marked, the size of the training visual field image can be set according to practical conditions, for example, the size of the training visual field image can be 256x256, a marked known visual field label can be represented by 0 or 1, a product flaw detection scene 0 is represented as a flawless visual field, and a product flaw detection scene 1 is represented as a flawed visual field. In the cancer cell detection scenario, 0 is represented as a negative field of view, 1 is represented as a positive field of view, i.e., the image contains abnormalities, where the abnormalities include 11 types of lesions, including 6 types of cytolesions: ASCUS (atypical squamous cells of undetermined significance, atypical squamous cell, not clearly meaningful), LSIL (low squamous intraepithelial lesion, low-grade squamous intraepithelial lesions), ASCH (atypical squamous cell-cannot exclude HISL, atypical squamous cell, prone to high-grade lesions), HSIL (high squamous intraepithelial lesion, high-grade squamous intraepithelial lesions), SCC (squamous cell carcinoma, squamous carcinoma) and AdC (adenoma), AGC (atypical glandular cells ). The positive field image (containing LSIL lesion cells) and the negative field image are shown in FIG. 6, wherein the positive field image contains cancer cells, and the cells in the negative field image are normal cells.

The view classification model to be trained may employ an image classification model, which may be ResNet50, for example, or other classification models such as EfficientNet, and the like. The final classifier of the model can be set to be 2 classification, and a 2 classification result table in the product flaw detection scene indicates that flaws exist and no flaws exist in the visual field image. In the cancer cell detection scenario, the 2-classification result indicates that the visual field image is positive or negative. The number of training field of view images here may be small, and may be initialized with pre-training model weights on the ImageNet dataset.

For training sample images, as shown in the flowchart of fig. 7, a conventional image segmentation method may be used to extract a foreground region (for example, a cell region), and then the foreground region is segmented into a field-of-view image with a fixed size based on a mesh division manner, where the size may be according to the actual situation (for example, 256×256). And (3) processing training sample images by using the vision classification model obtained by the i-1 th round of training to obtain N training vision images and predicted suspicious probabilities of the N training vision images. In the product flaw detection scene, the predicted suspicious probability of the N training field images is used for representing the probability of flaws in each training field image, and the predicted suspicious probability of the N training field images in the cancer cell detection scene is used for representing the probability (0-1) of cancer cells in each training field image. The N training field images are arranged in a descending order according to the predicted suspicious probability, and then the top S field is selected as the suspicious positive area of the training field image, where S may be determined according to practical situations, for example, 32, that is, the first 32 suspicious training field images are selected. And the visual field classification model obtained through the i-1 th round training can also obtain a predicted visual field label of the target suspicious visual field image with the maximum predicted suspicious probability in the training sample image. S suspicious training field images are input into a full-segment classification model obtained by training in the ith-1 round, and a predicted image label (which can be represented by 0 and 1) of a training sample image can be obtained by the full-segment classification model obtained by training in the ith-1 round.

And ending training to obtain a target neural network model when the predicted view label of the training view image and the predicted view label of the predicted view label target suspicious view image of the training sample image meet the convergence condition (the first loss condition, the second loss condition and the third loss condition).

For training of the full-film classification model, a combined training method, such as the overall structure shown in fig. 8, may be adopted, and in each training batch (batch), besides the S suspicious field images extracted from the target image, the training field images already marked are included. Likewise, the visual field classification loss (loss) needs to be calculated through the visual field classification model. By default, each training batch contains one training sample image and two labeled training field images, so that a combined training is formed, and the loss of training comprises: the classification loss of the training sample image and the classification loss of the training field of view image. The classification loss of the training field image plays a role of regularization method to improve the model effect. Further, a constraint is added here, and the predicted suspicious probability of a view image with the largest predicted suspicious probability of a training sample image should be consistent with the known image labels of the training sample image. Specifically, in each training iteration, firstly, a visual field classification model is used to calculate the predicted suspicious probability of the suspicious visual field image in the training sample image, then, the visual field with the largest predicted suspicious probability is selected, and then, the predicted suspicious probability and the classification loss of the known image labels of the training sample image are calculated, wherein the loss is simply called as visual field constraint loss. Therefore, here the joint training penalty contains three parts in total: a classification loss of the training sample image (second loss condition), a classification loss of the training field of view image (third loss condition), and a field of view constraint loss of the training sample image (first loss condition).

As an alternative embodiment, the first, second and third loss functions may be loss functions of the prior art, such as cross entropy loss functions.

L＝-[y*logp+(1-y)*log(1-p)]

Where L represents a loss value, y represents a known label, and p is a predicted probability value predicted as y.

The cross entropy loss function of the first loss function is:

L ₁ ＝-[y ₁ *logp1+(1-y ₁ )*log(1-p1)]

wherein L is ₁ Representing the first loss value, y ₁ The known probability (1 or 0,1 indicates the presence of a flaw or cancer cell, 0 indicates no flaw or cancer cell) corresponding to the known image label representing the training sample image, and p1 is the predicted field of view label of the target suspicious field of view image and is y ₁ Is a predictive probability of suspicion.

The cross entropy loss function of the second loss function is:

L ₂ ＝-[y ₂ *logp2+(1-y ₂ )*log(1-p2)]

wherein L is ₂ Representing a second loss value, y ₂ The known probability (1 or 0,1 indicates the presence of a flaw or cancer cell, 0 indicates no flaw or cancer cell) corresponding to the known image label representing the training sample image, and p2 is the predicted image label of the training sample image as y ₂ Is a predictive probability of suspicion.

The cross entropy loss function of the third loss function is:

L ₃ ＝-[y ₃ *logp3+(1-y ₃ )*log(1-p3)]

wherein L is ₃ Represents a third loss value, y ₃ A known probability (1 or 0, 1 indicates the presence of a flaw or cancer cell, 0 indicates no flaw or cancer cell), p3 is the predicted field of view tag of the training field of view image y ₃ Is a predictive probability of suspicion.

The first loss condition may be that the first loss value is less than or equal to a first loss threshold, and the first loss threshold may be, for example, 0.05, 0.1, 0.15, or the like, according to the actual situation. The second loss condition may be that the second loss value is less than or equal to a second loss threshold, and the second loss threshold may be, for example, 0.05, 0.1, 0.15, or the like, according to the actual situation. The third loss condition may be that the third loss value is less than or equal to a third loss threshold, and the third loss threshold may be, for example, 0.05, 0.1, 0.15, or the like, depending on the actual situation. And if the first loss value meets the first loss condition, the second loss value meets the second loss condition, and the third loss value meets the third loss condition, stopping training to obtain the target neural network model.

As an optional implementation manner, the input of the target full-film classification model is M suspicious field images extracted from the target image, and the M suspicious field images are field images with the predicted suspicious probability greater than or equal to a preset probability threshold in the target image. And extracting the feature vector of each suspicious view image from the main body part of the target view classification model for each suspicious view image, and then sending the extracted feature vector into transformer block, wherein transformer block comprises mult-head section and MLP (multi-level point) for modeling the features of each suspicious view image to obtain new features, taking the feature average vector of all suspicious view images, and sending the feature average vector of all suspicious view images into a classifier to obtain the predicted image tag of the target image. The transformer block model described above may be replaced by other models such as GNN.

As an alternative embodiment, the present application may be applied to a variety of fields such as product flaw detection, cancer cell detection, and the like.

In product flaw detection, firstly, a product to be detected is photographed to obtain a target image, then the target image is predicted through a trained target neural network model, so that a visual field image without flaws is eliminated, and a visual field image with flaws is screened out. The method can improve the detection efficiency of the product flaws.

In a cancer cell detection scene, firstly, diseased part cells are scanned into WSI, then, a trained target neural network model is used for predicting the whole WSI slice, so that negative slices are eliminated, positive slices are screened out, and positive cell areas are provided for a pathologist to make final diagnosis. The method can assist a pathologist in cancer cytological diagnosis, reduce the workload of the doctor and improve the efficiency.

Only a small amount of training visual field images are needed to train a weak visual field classification model, so that dependence on fine annotation data is reduced; when the full-sheet classification model is trained, a combined training method is provided to make up for the deficiency of the visual field classification model; the full-sheet classification model herein considers the interaction relationship between the fields of view; when the full-sheet classification model is trained, the model performance is improved by a constraint method of no-annotation visual field in a training sample image;

It will be appreciated that in the specific embodiments of the present application, related data such as user information is referred to, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

According to another aspect of the embodiments of the present application, there is also provided an image recognition apparatus for implementing the above image recognition method. As shown in fig. 9, the apparatus includes: the acquiring module 92 is configured to acquire a target image, where the target image is an image to be identified obtained by scanning a target object; the input module 94 is configured to input the target image into a target neural network model, determine M suspicious field images from the target image through a target field classification model in the target neural network model, and predict the M suspicious field images through a target full-scale classification model in the target neural network model to obtain a predicted image tag of the target image, where M is greater than or equal to 1, and the M suspicious field images are field images in the target image, where the predicted suspicious probability is greater than or equal to a preset probability threshold; the target neural network model is a model obtained by training a neural network model to be trained by using a training sample image and a training visual field image until the following convergence condition is met: and a first loss condition is met between a known image label of the training sample image and a predicted view label of a target suspicious view image, wherein the target suspicious view image is a suspicious view image with the maximum predicted suspicious probability in the training sample image determined by a to-be-trained view classification model in the to-be-trained neural network model.

Optionally, the above convergence condition further includes: a second loss condition is met between the known image label of the training sample image and the predicted image label of the training sample image determined by the to-be-trained full-scale classification model in the to-be-trained neural network model, and a third loss condition is met between the known view label of the training view image and the predicted view label of the training view image determined by the to-be-trained view classification model in the to-be-trained neural network model.

Optionally, the apparatus is further configured to acquire the training sample image and the training field of view image before the inputting the target image into the target neural network model; and carrying out multi-round combined training on the to-be-trained visual field classification model and the to-be-trained full-scale classification model in the to-be-trained neural network model through the training sample image and the training visual field image to obtain the target neural network model, wherein in the training process, if the to-be-trained neural network model does not meet the convergence condition, parameters in the to-be-trained visual field classification model and the to-be-trained full-scale classification model are adjusted, if the to-be-trained neural network model meets the convergence condition, training is finished, the to-be-trained neural network model when training is finished is determined to be the target neural network model, and the to-be-trained visual field classification model and the to-be-trained full-scale classification model when training is finished are respectively determined to be the target visual field classification model and the target full-scale classification model in the target neural network model.

Optionally, the device is further configured to perform an i-th round of joint training on the to-be-trained visual field classification model and the to-be-trained full-slice classification model in the to-be-trained neural network model, where i is a positive integer greater than or equal to 1, and the visual field classification model and the full-slice classification model obtained by the 0-th round of training are the untrained to-be-trained visual field classification model and the to-be-trained full-slice classification model in the to-be-trained neural network model, and the device includes: inputting the training visual field image into a visual field classification model obtained by the i-1 th wheel training to obtain a predicted visual field label of the training visual field image determined by the i-1 th wheel training; inputting the training sample image into a visual field classification model obtained by the i-1 th round training and a full-scale classification model obtained by the i-1 th round training to obtain a predicted visual field label of the target suspicious visual field image determined by the i-th round training and a predicted image label of the training sample image determined by the i-th round training; and ending training when the predicted view label of the training view image determined by the ith training, the predicted view label of the target suspicious view image determined by the ith training and the predicted image label of the training sample image determined by the ith training meet the convergence condition, and determining the view classification model and the full-scale classification model obtained by the i-1 training as the target view classification model and the target full-scale classification model in the target neural network model respectively.

Optionally, the device is further configured to input the training sample image into a field classification model obtained by the i-1 th round of training, obtain a predicted field label of the target suspicious field image determined by the i-1 th round of training, and S suspicious training field images determined by the i-th round of training, where S is greater than or equal to 1, the S suspicious training field images are field images in which a predicted suspicious probability in the training sample image is greater than or equal to the preset probability threshold, and the target suspicious field image is a training field image in which a predicted suspicious probability in the training sample image is the largest; and inputting the S suspicious training visual field images into a full-film classification model obtained by the i-1 th round training to obtain a predicted image label of the training sample image determined by the i-1 th round training.

Optionally, the device is further configured to input the training sample image into a field-of-view classification model obtained by the i-1 th round of training, and divide the training sample image into N training field-of-view images through the field-of-view classification model obtained by the i-1 th round of training; determining training visual field images with predicted suspicious probability larger than or equal to the preset probability threshold value from the N training visual field images through the visual field classification model obtained by the i-1 th round of training, and obtaining S suspicious training visual field images determined by the i-1 th round of training, wherein N is larger than or equal to S; and determining a training visual field image with the largest predicted suspicious probability in the N training visual field images as the target suspicious visual field image through a visual field classification model obtained through the i-1 th round training, and determining a label corresponding to the predicted suspicious probability of the target suspicious visual field image as a predicted visual field label of the target suspicious visual field image.

Optionally, in the ith round of joint training, the device is further configured to input a first loss function to a predicted suspicious probability corresponding to a predicted field label of the target suspicious field image determined by the ith round of training and a known probability corresponding to a known image label of the training sample image, so as to obtain a first loss value; inputting the predicted suspicious probability corresponding to the predicted image label of the training sample image and the known probability corresponding to the known image label of the training sample image determined by the ith training into a second loss function to obtain a second loss value; inputting the predicted suspicious probability corresponding to the predicted visual field label of the training visual field image and the known probability corresponding to the known visual field label of the training visual field image determined by the ith training into a third loss function to obtain a third loss value; determining whether the first loss value satisfies the first loss condition, whether the second loss value satisfies the second loss condition, and whether the third loss value satisfies the third loss condition; and determining that the predicted view label of the training view image determined by the ith training, the predicted view label of the target suspicious view image determined by the ith training and the predicted image label of the training sample image determined by the ith training meet the convergence condition when the first loss value meets the first loss condition, the second loss value meets the second loss condition and the third loss value meets the third loss condition.

Optionally, the device is further configured to perform feature extraction on the M suspicious field images through the target field classification model in the target neural network model, so as to obtain M feature vectors; and inputting the feature average vectors of the M feature vectors into a classifier in the target full-film classification model, and classifying the feature average vectors through the classifier to obtain a predicted image label of the target image.

Optionally, the device is further configured to segment the target image into N view images by using the target view classification model in the target neural network model, where N is greater than or equal to M; and determining the M suspicious field images from the N field images through the target field classification model, wherein the M suspicious field images are field images with the predicted suspicious probability larger than or equal to the preset probability threshold value in the N field images.

According to one aspect of the present application, a computer program product is provided, comprising a computer program/instructions containing program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. When executed by the central processor 1001, performs the various functions provided by the embodiments of the present application.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

Fig. 10 schematically shows a block diagram of a computer system for implementing an electronic device according to an embodiment of the present application.

It should be noted that, the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 10, the computer system 1000 includes a central processing unit 1001 (Central Processing Unit, CPU) which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 1002 (ROM) or a program loaded from a storage section 1008 into a random access Memory 1003 (Random Access Memory, RAM). In the random access memory 1003, various programs and data necessary for the system operation are also stored. The cpu 1001, the rom 1002, and the ram 1003 are connected to each other via a bus 1004. An Input/Output interface 1005 (i.e., an I/O interface) is also connected to bus 1004.

The following components are connected to the input/output interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a local area network card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the input/output interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The computer programs, when executed by the central processor 1001, perform the various functions defined in the system of the present application.

According to still another aspect of the embodiments of the present application, there is further provided an electronic device for implementing the above image recognition method, where the electronic device may be a terminal device or a server as shown in fig. 1. The present embodiment is described taking the electronic device as a terminal device as an example. As shown in fig. 11, the electronic device comprises a memory 1102 and a processor 1104, the memory 1102 having stored therein a computer program, the processor 1104 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.

Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

s1, acquiring a target image, wherein the target image is an image to be identified obtained by scanning a target object;

s2, inputting the target image into a target neural network model, determining M suspicious view images from the target image through a target view classification model in the target neural network model, and predicting the M suspicious view images through a target full-scale classification model in the target neural network model to obtain a predicted image label of the target image, wherein M is greater than or equal to 1, and the M suspicious view images are view images with predicted suspicious probability greater than or equal to a preset probability threshold value in the target image;

the target neural network model is a model obtained by training a neural network model to be trained by using a training sample image and a training visual field image until the following convergence condition is met: and a first loss condition is met between a known image label of the training sample image and a predicted view label of a target suspicious view image, wherein the target suspicious view image is a suspicious view image with the largest predicted suspicious probability in the training sample image determined by the to-be-trained view classification model.

Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 11 is only schematic, and the electronic device may be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 11 is not limited to the structure of the electronic device and the electronic apparatus described above. For example, the electronics may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 11, or have a different configuration than shown in FIG. 11.

The memory 1102 may be used to store software programs and modules, such as program instructions/modules corresponding to the image recognition method and apparatus in the embodiments of the present application, and the processor 1104 executes the software programs and modules stored in the memory 1102 to perform various functional applications and data processing, that is, implement the image recognition method described above. Memory 1102 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 1102 may further include memory located remotely from processor 1104, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1102 may be used to store information such as a target image, but is not limited to. As an example, as shown in fig. 11, the memory 1102 may include, but is not limited to, the acquisition module 92 and the input module 94 in the target image device. In addition, other module units in the target image device may be included, but are not limited to, and are not described in detail in this example.

Optionally, the transmission device 1106 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission device 1106 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 1106 is a Radio Frequency (RF) module for communicating wirelessly with the internet.

In addition, the electronic device further includes: a display 1108 for displaying the target image; and a connection bus 1110 for connecting the respective module parts in the above-described electronic apparatus.

In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the plurality of nodes through a network communication. Among them, the nodes may form a Peer-To-Peer (P2P) network, and any type of computing device, such as a server, a terminal, etc., may become a node in the blockchain system by joining the Peer-To-Peer network.

According to one aspect of the present application, there is provided a computer-readable storage medium, from which a processor of a computer device reads the computer instructions, the processor executing the computer instructions, so that the computer device performs the image recognition method provided in the above-mentioned various alternative implementations.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for performing the steps of:

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. An image recognition method, comprising:

acquiring a target image, wherein the target image is an image to be identified obtained by scanning a target object;

inputting the target image into a target neural network model, and determining M suspicious view images from the target image through a target view classification model in the target neural network model; extracting features of the M suspicious view images through the target view classification model in the target neural network model to obtain M feature vectors; inputting the feature average vectors of the M feature vectors into a classifier in a target full-film classification model, classifying the feature average vectors through the classifier to obtain a predicted image label of the target image, wherein M is greater than or equal to 1, and the M suspicious field images are field images with predicted suspicious probabilities greater than or equal to a preset probability threshold in the target image;

2. The method of claim 1, wherein the convergence condition further comprises:

a second loss condition is satisfied between a known image tag of the training sample image and a predicted image tag of the training sample image determined by a full-scale classification model to be trained in the neural network model to be trained, and a third loss condition is satisfied between a known view tag of the training view image and a predicted view tag of the training view image determined by the full-scale classification model to be trained in the neural network model to be trained.

3. The method of claim 2, wherein prior to said inputting the target image into a target neural network model, the method further comprises:

Acquiring the training sample image and the training visual field image;

and carrying out multi-round combined training on the to-be-trained visual field classification model and the to-be-trained full-scale classification model in the to-be-trained neural network model through the training sample image and the training visual field image to obtain the target neural network model, wherein in the training process, if the to-be-trained neural network model does not meet the convergence condition, parameters in the to-be-trained visual field classification model and the to-be-trained full-scale classification model are adjusted, if the to-be-trained neural network model meets the convergence condition, training is finished, the to-be-trained neural network model when training is finished is determined to be the target neural network model, and the to-be-trained visual field classification model and the to-be-trained full-scale classification model when training is finished are respectively determined to be the target visual field classification model and the target full-scale classification model in the target neural network model.

4. A method according to claim 3, wherein the multi-round joint training of the vision classification model to be trained and the full-scale classification model to be trained in the neural network model to be trained by the training sample image and the training vision image comprises:

Performing an i-th round of combined training on the to-be-trained visual field classification model and the to-be-trained full-sheet classification model in the to-be-trained neural network model, wherein i is a positive integer greater than or equal to 1, and the visual field classification model and the full-sheet classification model obtained by the 0-th round training are the untrained to-be-trained visual field classification model and the to-be-trained full-sheet classification model in the to-be-trained neural network model, and the method comprises the following steps:

inputting the training visual field image into a visual field classification model obtained by the i-1 th round of training to obtain a predicted visual field label of the training visual field image determined by the i-th round of training;

inputting the training sample image into a visual field classification model obtained by the i-1 th round training and a full-scale classification model obtained by the i-1 th round training to obtain a predicted visual field label of the target suspicious visual field image determined by the i-th round training and a predicted image label of the training sample image determined by the i-th round training;

and ending training when the predicted view label of the training view image determined by the ith training, the predicted view label of the target suspicious view image determined by the ith training and the predicted image label of the training sample image determined by the ith training meet the convergence condition, and determining the view classification model and the full-scale classification model obtained by the i-1 training as the target view classification model and the target full-scale classification model in the target neural network model respectively.

5. The method according to claim 4, wherein the inputting the training sample image into the field-of-view classification model obtained by the i-1 th round training and the full-segment classification model obtained by the i-1 th round training, obtaining a predicted field-of-view label of the target suspicious field-of-view image determined by the i-th round training, and a predicted image label of the training sample image determined by the i-th round training, includes:

inputting the training sample image into a visual field classification model obtained by the i-1 th round training to obtain a predicted visual field label of the target suspicious visual field image determined by the i-1 th round training and S suspicious training visual field images determined by the i-th round training, wherein S is greater than or equal to 1, the S suspicious training visual field images are visual field images with the predicted suspicious probability greater than or equal to the preset probability threshold in the training sample image, and the target suspicious visual field images are training visual field images with the maximum predicted suspicious probability in the training sample image;

and inputting the S suspicious training visual field images into a full-film classification model obtained by the i-1 th round training to obtain a predicted image label of the training sample image determined by the i-1 th round training.

6. The method of claim 5, wherein inputting the training sample image into the field of view classification model obtained by the i-1 th training to obtain a predicted field of view tag of the target suspicious field of view image determined by the i-th training, and the S suspicious training field of view images determined by the i-th training, comprises:

inputting the training sample image into a visual field classification model obtained by the i-1 th round training, and dividing the training sample image into N training visual field images through the visual field classification model obtained by the i-1 th round training;

determining training visual field images with predicted suspicious probability larger than or equal to the preset probability threshold value from the N training visual field images through the visual field classification model obtained by the i-1 th round of training, and obtaining S suspicious training visual field images determined by the i-1 th round of training, wherein N is larger than or equal to S;

and determining a training visual field image with the largest predicted suspicious probability in the N training visual field images as the target suspicious visual field image through a visual field classification model obtained through the i-1 th round training, and determining a label corresponding to the predicted suspicious probability of the target suspicious visual field image as a predicted visual field label of the target suspicious visual field image.

7. The method according to any one of claims 4 to 6, wherein in the ith round of joint training, the method further comprises:

inputting the predicted suspicious probability corresponding to the predicted visual field label of the target suspicious visual field image determined by the ith training and the known probability corresponding to the known image label of the training sample image into a first loss function to obtain a first loss value;

inputting the predicted suspicious probability corresponding to the predicted image label of the training sample image and the known probability corresponding to the known image label of the training sample image determined by the ith training into a second loss function to obtain a second loss value;

inputting the predicted suspicious probability corresponding to the predicted visual field label of the training visual field image and the known probability corresponding to the known visual field label of the training visual field image determined by the ith training into a third loss function to obtain a third loss value;

determining whether the first loss value satisfies the first loss condition, whether the second loss value satisfies the second loss condition, and whether the third loss value satisfies the third loss condition;

and determining that the predicted view label of the training view image determined by the ith training, the predicted view label of the target suspicious view image determined by the ith training and the predicted image label of the training sample image determined by the ith training meet the convergence condition when the first loss value meets the first loss condition, the second loss value meets the second loss condition and the third loss value meets the third loss condition.

8. The method of claim 1, wherein the determining M suspicious field of view images from the target image by a target field of view classification model in the target neural network model comprises:

dividing the target image into N view images by the target view classification model in the target neural network model, wherein N is greater than or equal to M;

and determining the M suspicious field images from the N field images through the target field classification model, wherein the M suspicious field images are field images with the predicted suspicious probability larger than or equal to the preset probability threshold value in the N field images.

9. An image recognition apparatus, comprising:

the acquisition module is used for acquiring a target image, wherein the target image is an image to be identified obtained by scanning a target object;

the input module is used for inputting the target image into a target neural network model, and determining M suspicious view images from the target image through a target view classification model in the target neural network model; extracting features of the M suspicious view images through the target view classification model in the target neural network model to obtain M feature vectors; inputting the feature average vectors of the M feature vectors into a classifier in a target full-film classification model, classifying the feature average vectors through the classifier to obtain a predicted image label of the target image, wherein M is greater than or equal to 1, and the M suspicious field images are field images with predicted suspicious probabilities greater than or equal to a preset probability threshold in the target image;

10. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program is executable by a terminal device or a computer to perform the method of any one of claims 1 to 8.

11. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 8.

12. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 8 by means of the computer program.