CN115713769A

CN115713769A - Training method and device of text detection model, computer equipment and storage medium

Info

Publication number: CN115713769A
Application number: CN202211423572.1A
Authority: CN
Inventors: 陈鹏宇; 张斌; 赵逸如; 张玉琦; 李捷
Original assignee: Shanghai Pudong Development Bank Co Ltd
Current assignee: Shanghai Pudong Development Bank Co Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-02-24

Abstract

The application relates to a training method, a training device, a computer device, a storage medium and a computer program product of a text detection model. The method comprises the following steps: acquiring an image sample set, wherein the image sample set comprises at least one image sample; acquiring a first feature map and a first prediction result of an image sample through a reference model; acquiring a second characteristic diagram and a second prediction result of the image sample through the model to be trained; determining similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample; and training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model. The method can improve the precision of text detection.

Description

Training method and device of text detection model, computer equipment and storage medium

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a method and an apparatus for training a text detection model, a computer device, a storage medium, and a computer program product.

Background

With the development of computer vision technology, the requirement on the precision of text detection is high in a financial scene. The existing text detection method usually obtains a text detection model by training a model to be trained through an image sample. However, the method for training the text detection model has the problem of low detection precision in the detection of the long text in the image.

Disclosure of Invention

Based on this, it is necessary to provide a training method, an apparatus, a computer device, a computer readable storage medium, and a computer program product for a text detection model, which can improve the detection accuracy of the text detection model, for the problem that the detection accuracy of the conventional text detection model is low.

In a first aspect, the present application provides a method for training a text detection model. The method comprises the following steps:

acquiring an image sample set, wherein the image sample set comprises at least one image sample;

acquiring a first characteristic diagram and a first prediction result of the image sample through the reference model, wherein the first prediction result represents the prediction result of the reference model on the category of each pixel on the image sample;

acquiring a second characteristic diagram and a second prediction result of the image sample through the model to be trained, wherein the second prediction result represents the prediction result of the model to be trained on the category of each pixel on the image sample;

determining similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample;

and training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model.

In one embodiment, determining the similarity loss of the image sample based on the first feature map and the second feature map comprises:

obtaining the correlation among all pixels in the first characteristic diagram and the correlation among all pixels in the second characteristic diagram;

and determining the similarity loss of the image sample based on the correlation between the pixels in the first feature map and the correlation between the pixels in the second feature map.

In one embodiment, the first characteristic diagram and the second characteristic diagram are equal in size;

determining the similarity loss of the image sample based on the correlation between the pixels in the first feature map and the correlation between the pixels in the second feature map, including:

acquiring the width and the height of the first characteristic diagram;

determining a first similarity graph according to the correlation among the pixels in the first feature graph;

determining a second similarity graph according to the correlation among all pixels in the second feature graph;

for each pixel in the pixels in the first similarity graph, obtaining the square of the difference between the current pixel and the corresponding pixel in the second similarity graph;

summing the squares of the differences corresponding to each pixel in the first similarity graph to obtain a summation result;

based on the summation result, the width and height of the first feature map, a similarity loss of the image sample is determined.

In one embodiment, the first prediction result further comprises a first probability map and a first approximate binary map, the first probability map represents the probability that each pixel on the image sample predicted by the reference model belongs to the text, and the first approximate binary map represents the binary result that each pixel on the image sample predicted by the reference model belongs to the text or the non-text; the second prediction result also comprises a second probability graph and a second approximate binary graph, the second probability graph represents the probability that each pixel on the image sample predicted by the model to be trained belongs to the character, and the first approximate binary graph represents the binary classification result that each pixel on the image sample predicted by the model to be trained belongs to the character or the non-character; determining pixel level loss of the image sample based on the first prediction and the second prediction, comprising:

expanding the first probability map by adopting an expansion function to obtain a first expansion result;

expanding the first approximate binary image by adopting an expansion function to obtain a second expansion result;

and determining the pixel-level loss of the image sample according to the first expansion result, the second probability map and the second approximate binary map.

In one embodiment, the first prediction result comprises a first threshold map, and the first threshold map represents the probability that each pixel on the image sample predicted by the reference model belongs to the text boundary; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample, including:

determining a first true value loss based on the first probability map and the annotation result of the image sample;

determining a second truth value loss based on the first threshold value graph and the labeling result of the image sample;

determining a third truth loss based on the first approximate binary image and the annotation result of the image sample;

and determining the true loss of the image sample according to the first true loss, the second true loss and the third true loss.

In one embodiment, training the model to be trained based on the similarity loss, the pixel-level loss, and the truth loss to obtain a text detection model, includes:

determining a target loss based on the similarity loss, the pixel level loss and the truth value loss;

adjusting parameters in the model to be trained according to the target loss;

and continuing training based on the adjusted model until the training stop condition is reached, and taking the model obtained after the training stop as a text detection model.

In one embodiment, the training method of the text detection model further includes:

acquiring an image to be detected;

inputting the image to be detected into a text detection model to obtain a prediction result of the category to which each pixel on the image to be detected belongs;

and determining and marking a text region in the image to be detected according to the prediction result of the category to which each pixel on the image to be detected belongs.

In a second aspect, the application further provides a training device for the text detection model. The device comprises:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring an image sample set, and the image sample set comprises at least one image sample;

the first result obtaining module is used for obtaining a first feature map and a first prediction result of the image sample through the reference model, and the first prediction result represents the prediction result of the reference model on the category of each pixel on the image sample;

the second result obtaining module is used for obtaining a second characteristic diagram and a second prediction result of the image sample through the model to be trained, and the second prediction result represents the prediction result of the model to be trained on the category of each pixel on the image sample;

the loss determining module is used for determining the similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample;

and the model training module is used for training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain the text detection model.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

acquiring a first feature map and a first prediction result of an image sample through a reference model;

acquiring a second characteristic diagram and a second prediction result of the image sample through the model to be trained;

determining similarity loss of the image sample based on the first feature map and the second feature map;

determining pixel-level loss of the image sample according to the first prediction result and the second prediction result;

determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample;

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

acquiring a second feature map and a second prediction result of the image sample through the model to be trained, wherein the second prediction result represents the prediction result of the model to be trained on the category of each pixel on the image sample;

According to the training method, the training device, the computer equipment, the storage medium and the computer program product of the text detection model, the first feature map and the first prediction result of the image samples in the image sample set are obtained through the reference model, and the second feature map and the second prediction result of the image samples in the image sample set are obtained through the model to be trained. Determining similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; and determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample. And training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model. The training method of the text detection model combines the feature map and the prediction result which respectively correspond to the reference model and the model to be trained to obtain the similarity loss, the pixel level loss and the truth value loss of the image sample, carries out model training based on various losses, and can improve the detection precision of the text detection model compared with direct training.

Drawings

FIG. 1 is a diagram of an exemplary environment in which a method for training a text detection model may be implemented;

FIG. 2 is a flowchart illustrating a method for training a text detection model in one embodiment;

FIG. 3 is a flowchart illustrating a method for training a text detection model in another embodiment;

FIG. 4 is a schematic sub-flow chart of S204 in one embodiment;

FIG. 5 is a schematic view of a sub-flow of S204 in another embodiment;

FIG. 6 is a flowchart illustrating a general method for training a text detection model in one embodiment;

FIG. 7 is a block diagram of an apparatus for training a text detection model in accordance with an embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The training method of the text detection model provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server. The training method of the text detection model provided by the embodiment of the application can be executed by the terminal 102 or the server 104 alone, or executed by the terminal 102 and the server 104 in cooperation, so that the terminal 102 executes as an example alone: the terminal 102 acquires an image sample set, wherein the image sample set comprises at least one image sample; acquiring a first feature map and a first prediction result of the image sample through the reference model, wherein the first prediction result represents the prediction result of the reference model on the category of each pixel on the image sample; acquiring a second characteristic diagram and a second prediction result of the image sample through the model to be trained, wherein the second prediction result represents the prediction result of the model to be trained on the category of each pixel on the image sample; determining similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample; and training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In one embodiment, as shown in fig. 2, a method for training a text detection model is provided, which is described by taking the method as an example applied to a computer device (the computer device may be the terminal or the server in fig. 1), and includes the following steps:

s201, an image sample set is obtained, wherein the image sample set comprises at least one image sample.

Wherein the computer device obtains a sample set of images. The image sample is an image containing textual information. The set of image samples includes at least one image sample. Each sample in the image sample set is used for training of the text detection model.

S202, a first feature map and a first prediction result of the image sample are obtained through the reference model, and the first prediction result represents the prediction result of the reference model on the category of each pixel on the image sample.

Wherein the reference model is a machine learning model. For example, the reference model may adopt a DBNet (scalable binary network model) model whose backbone network is ResNet 50. The DBNet model is adopted to carry out self-adaptive binarization on each pixel point in the image sample, the binarization threshold value is obtained by network learning, the step of binarization is thoroughly added into the network for training, and the final output image has better robustness for the threshold value.

The computer device inputs the image sample into the reference model to obtain a first feature map of the image sample and a first prediction result. The reference model includes a trunk module (backbone), a feature fusion enhancement (neck), and a prediction module (head). Specifically, the image sample is input to a trunk module in the reference model for feature extraction, so as to obtain a feature map of each feature layer, and the feature map of the last feature layer is used as a first feature map. In some embodiments, the feature layers are respectively denoted as C2, C3, C4 and C5, the image sample input is sequentially subjected to down-sampling of the feature layers, the C2 layer down-sampling is 4 times, the C3 layer down-sampling is 8 times, the C4 layer down-sampling is 16 times, and the C5 layer down-sampling is 32 times, and the feature map of the C5 layer is taken as the first feature map. And the feature maps of all the feature layers are subjected to merging and connection (concat) in the feature fusion enhancement module to obtain an intermediate feature map. And the intermediate feature map enters a prediction module to obtain a first prediction result of the image sample. And the first prediction result represents the prediction result of the reference model on the category of each pixel on the image sample.

S203, a second feature map and a second prediction result of the image sample are obtained through the model to be trained, and the second prediction result represents the prediction result of the model to be trained on the category of each pixel on the image sample.

Wherein, the model to be trained is a machine learning model. In some embodiments, the model to be trained may employ a DBNet model whose backbone network is ResNet 50. I.e. the model to be trained and the reference model have the same network structure. The same network structure is adopted, so that the precision and the recall rate of the trained text detection model exceed those of the reference model. In other embodiments, the model to be trained can also adopt a lightweight network MobileNetV3 (an efficient model proposed for mobile and embedded devices) as a DBNet model of a backbone network. The model to be trained adopts a light-weight backbone network, so that the calculated amount and the reasoning speed of the trained text detection model are superior to those of a reference model, and the precision of the model to be trained has certain loss compared with the reference model, but the precision loss is smaller than that of the model to be trained which directly trains the same network structure.

And the computer equipment acquires a second feature map and a second prediction result of the image sample through the model to be trained. Specifically, the computer device inputs the image sample into a trunk module of the model to be trained for feature extraction to obtain a feature map of each feature layer, and the feature map of the last feature layer is used as a second feature map. And combining and connecting the feature maps of the feature layers in the feature fusion enhancement module to obtain a second intermediate feature map. And the second intermediate feature map enters a prediction module to obtain a second prediction result of the image sample. And the second prediction result represents the prediction result of the reference model on the category of each pixel on the image sample.

S204, determining the similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; and determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample.

Wherein the computer device determines a similarity loss of the image sample based on the first feature map and the second feature map. Specifically, the similarity loss of the image sample can be calculated according to the correlation between each pixel point in the first characteristic diagram and the correlation between each pixel point in the second characteristic diagram. The similarity loss characterization is a pairwise loss between two pixels of the image sample. The computer device determines a pixel level loss for the image sample based on the first prediction and the second prediction. The pixel level loss characterizes the loss of each pixel of the image sample. And the computer equipment determines the true value loss of the image sample according to the second prediction result and the annotation result of the image sample. The loss of truth represents the original loss of the image sample in the model training of the model to be trained. And the marking result refers to that the text in the image sample is marked by adopting marking software. The annotation result can be an image of an annotation box containing text. The size of the annotation result is equal to the size of the image sample.

And S205, training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model.

And in the process of training the model to be trained, adjusting parameters of the model to be trained based on the similarity loss, the pixel level loss and the truth value loss, wherein the model after training is the text detection model.

In the training method of the text detection model, a first feature map and a first prediction result of the image samples in the image sample set are obtained through the reference model, and a second feature map and a second prediction result of the image samples in the image sample set are obtained through the model to be trained. Determining similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; and determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample. And training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model. The training method of the text detection model combines the feature map and the prediction result which respectively correspond to the reference model and the model to be trained to obtain the similarity loss, the pixel level loss and the truth value loss of the image sample, carries out model training based on various losses, and can improve the detection precision of the text detection model compared with direct training.

In one embodiment, determining a similarity loss of the image sample based on the first feature map and the second feature map comprises: acquiring the correlation among all pixels in the first characteristic diagram and the correlation among all pixels in the second characteristic diagram; and determining the similarity loss of the image sample based on the correlation among the pixels in the first feature map and the correlation among the pixels in the second feature map.

Wherein, the correlation among the pixels refers to the adjacent pixels in the feature mapThe correlation between two pixels. The characteristic diagram is a three-dimensional image. The correlation between the respective pixels refers to the correlation between two adjacent rows of pixels in the feature map. Exemplarily a _ij Representing the correlation between two adjacent pixels. f. of _i And f _j Respectively representing the pixel values of two adjacent rows of pixels in the feature map. I f _i || ₂ And f _j || ₂ Respectively representing the modulus of two adjacent lines of pixel vectors in the feature map. The pixel correlation calculation formula is: a is a _ij ＝f _i ^T ·f _j /(||f _i || ₂ ·||f _j || ₂ )。

And the computer equipment brings the pixel values of all the pixels in the first characteristic diagram into a pixel correlation calculation formula to obtain the correlation among all the pixels in the first characteristic diagram. And the computer equipment brings the pixel values of all the pixels in the second characteristic diagram into a pixel correlation calculation formula to obtain the correlation among all the pixels in the second characteristic diagram. The computer device determines a loss of similarity for the image sample based on the correlation between pixels in the first feature map and the correlation between pixels in the second feature map.

In this embodiment, the similarity loss of the image sample is determined by the correlation between the pixels in the first feature map and the correlation between the pixels in the second feature map, and the detection accuracy of the text detection model is improved by combining the similarity loss in the model training.

In one embodiment, as shown in FIG. 3, the first profile and the second profile are equal in size; determining the similarity loss of the image sample based on the correlation between the pixels in the first feature map and the correlation between the pixels in the second feature map, including:

s301, acquiring the width and the height of the first characteristic diagram.

Wherein the computer device obtains the width and height of the first signature. The width refers to the width of the first feature map, and the height refers to the height of the first feature map.

S302, determining a first similarity graph according to the correlation among the pixels in the first feature graph; and determining a second similarity graph according to the correlation among the pixels in the second feature graph.

And the computer equipment takes the correlation among all the pixels in the obtained first characteristic diagram as an image pixel value, and the obtained new image is the first similarity diagram. And the computer equipment takes the correlation among all the pixels in the obtained second feature map as an image pixel value, and the obtained new image is the second similarity map.

S303, for each pixel in the pixels in the first similarity map, obtaining a square of a difference between the current pixel and a corresponding pixel in the second similarity map.

The first feature map and the second feature map are equal in size, and the first similarity map and the second similarity map are equal in size. For each pixel in the first similarity graph, the computer device calculates the square of the difference between the pixel value of the current pixel and the pixel value of the corresponding pixel in the second similarity graph, and obtains the square of the difference corresponding to the current pixel.

And S304, summing the squares of the differences corresponding to the pixels in the first similarity graph to obtain a summation result.

And the computer equipment sums the squares of the differences corresponding to the pixels in the first similarity graph to obtain a summation result.

And S305, determining the similarity loss of the image sample based on the summation result and the width and the height of the first feature map.

And the computer equipment multiplies the width and the height of the first characteristic diagram and squares to obtain a first square result. And dividing the summation result by the first square result to obtain a value which is used as the similarity loss of the image sample. Illustratively, the similarity loss calculation formula is as follows:

wherein l _pa (S) represents a loss of similarity of the image samples, W represents the secondThe width of one feature, H represents the height of the first feature,

a pixel value representing the first similarity map,

the pixel values of the second similarity map are represented, and R represents the region of the first similarity map.

And the computer equipment brings the width and the height of the first feature map, the pixel value of each pixel in the first similarity map and the pixel value of each pixel in the second similarity map into a similarity loss calculation formula to obtain the similarity loss of the image sample.

In this embodiment, the similarity loss of the image sample is determined through the width and the height of the first feature map, the first similarity map and the second similarity map, and the detection accuracy of the text detection model is improved by combining the similarity loss in the model training.

In one embodiment, as shown in fig. 4, the first prediction result further includes a first probability map and a first approximate binary map, the first probability map representing the probability that each pixel on the image sample predicted by the reference model belongs to the text, and the first approximate binary map representing the result of classifying each pixel on the image sample predicted by the reference model as belonging to the text or the non-text; the second prediction result also comprises a second probability graph and a second approximate binary graph, the second probability graph represents the probability that each pixel on the image sample predicted by the model to be trained belongs to the character, and the first approximate binary graph represents the binary classification result that each pixel on the image sample predicted by the model to be trained belongs to the character or the non-character; determining pixel level loss of the image sample based on the first prediction and the second prediction, comprising:

s402, expanding the first probability map by adopting an expansion function to obtain a first expansion result; and expanding the first approximate binary image by adopting an expansion function to obtain a second expansion result.

The dilation function is a function for dilating a highlight area in an image, and the image processed by the dilation function is larger than the highlight area of the original image. And the computer equipment expands the first probability map by adopting an expansion function to obtain a first expansion result. And the computer equipment adopts an expansion function to expand the first approximate binary image to obtain a second expansion result. The first probability graph represents the probability that each pixel on the image sample predicted by the reference module belongs to the character, and the first approximate binary graph represents the classification result that each pixel on the image sample predicted by the reference module belongs to the character or the non-character. The probability that each pixel represented in the first probability map belongs to the character is smaller than the labeling result of the image sample, and the result that each pixel represented in the first approximate binary image belongs to the character is smaller than the labeling result of the image sample. Therefore, the expansion function is adopted to perform expansion processing on the first probability map and the second probability map, so that the first expansion result and the second expansion result can be ensured to be closer to the size of the labeling result of the image sample.

S404, determining the pixel-level loss of the image sample according to the first expansion result, the second probability map and the second approximate binary map.

And the computer equipment calculates to obtain the first loss through a cross entropy loss calculation formula according to the first expansion result and the second probability map. The computer device determines a second loss through a die loss calculation formula according to the second expansion result and the second approximate binary value map. The computer device will determine a pixel-level penalty for the image sample based on the first penalty and the second penalty. In some embodiments, the sum of the first loss and the second loss may be taken as a pixel-level loss of the image sample. In other embodiments, the sum of the first penalty of the first preset multiple and the second penalty of the second preset multiple may be taken as the pixel level penalty of the image sample. In some embodiments, the pixel level loss calculation formula is:

loss _distill ＝γl _p (S _out ,f _dila (T _out ))+l _b (S _out ,f _dila (T _out ))

among them, loss _distill Representing pixel level loss of image samples, gamma representing superThe parameter, in some embodiments, γ may take 5. f. of _dila Representing an expansion function, in some embodiments, f _dila The kernel of (a) may be a second order matrix with parameters of 1. l _p (S _out ,f _dila (T _out ) Represents a first loss, which is a cross-entropy loss. l _b (S _out ,f _dila (T _out ) Represents a second loss, which is a die loss.

And the computer equipment brings the first probability map, the second probability map, the first approximate binary map, the second approximate binary map and the expansion function into a pixel-level loss calculation formula to obtain the pixel-level loss of the image sample.

In this embodiment, the first probability map and the first approximate binary map are expanded through an expansion function, the pixel-level loss of the image sample is determined based on the expanded result, the second probability map and the second approximate binary map, and the pixel-level loss of the image sample is combined in model training, which is beneficial to improving the detection accuracy of the text detection model.

In one embodiment, as shown in fig. 5, the first prediction result includes a first threshold map characterizing the probability that each pixel on the image sample predicted by the reference model belongs to a text boundary; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample, including:

s502, determining a first true value loss based on the first probability map and the annotation result of the image sample; determining a second true value loss based on the first threshold value graph and the labeling result of the image sample; and determining a third true value loss based on the first approximate binary image and the annotation result of the image sample.

And the computer equipment determines a first true value loss through a cross entropy loss calculation formula according to the first probability graph and the labeling result of the image sample. And the computer equipment determines a second true value loss by using the first threshold value image and the labeling result of the image sample through a dice loss calculation formula. And the computer equipment determines a third true value loss through a threshold map loss calculation formula according to the labeling results of the first approximate binary image and the image sample.

S504, the true loss of the image sample is determined according to the first true loss, the second true loss and the third true loss.

And the computer equipment adds the first true value loss of the third preset multiple, the second true value loss of the fourth preset multiple and the third true value loss of the fifth preset multiple, and uses the obtained sum as the true value loss of the image sample. In some embodiments, the true loss is calculated by the formula:

loss _gt (S _out ,gt)＝l _p (S _out ,gt)+αl _b (S _out ,gt)+βl _t (S _out ,gt)

therein, loss _gt (S _out Gt) represents the loss of truth for the image sample,/ _p (S _out Gt) represents a first truth loss, which is a cross entropy loss. l _t A second truth penalty is represented, which is a threshold map penalty. l _b (S _out Gt) represents a third true loss, which is a die loss. α and β are preset coefficients.

In this embodiment, a first truth loss, a second truth loss, and a third truth loss are obtained through the first probability map, the first threshold map, the first approximate binary map, and the labeling result, respectively, so as to obtain a truth loss of the image sample. The real value loss of the image samples is combined in the model training, so that the detection precision of the text detection model is improved.

In one embodiment, training the model to be trained based on the similarity loss, the pixel-level loss, and the truth loss to obtain a text detection model, includes: determining a target loss based on the similarity loss, the pixel level loss and the truth value loss; adjusting parameters in the model to be trained according to the target loss; and continuing training based on the adjusted model until the training stop condition is met, and taking the model obtained after the training is stopped as a text detection model.

And the computer equipment sums the loss of the similarity of the image sample, the loss of the pixel level of the image sample and the loss of the truth value of the image sample to obtain the total loss of the image sample. The target loss is the average of the overall losses of the individual image samples in the image sample set. And the computer equipment adjusts the parameters in the model to be trained according to the target loss to obtain an adjusted model. And continuing training by adopting the adjusted model until the training stopping condition is reached, and taking the model obtained after the training is stopped as a text detection model.

In this embodiment, a target loss is determined through the similarity loss, the pixel level loss and the true value loss, the target loss is used to adjust the model to be trained, and the adjusted model is continuously trained to obtain the text detection model. The model training based on various losses can improve the detection precision of the text detection model compared with direct training.

In one embodiment, the method for training the text detection model further comprises: acquiring an image to be detected; inputting the image to be detected into a text detection model to obtain a prediction result of the category to which each pixel on the image to be detected belongs; and determining and marking a text region in the image to be detected according to the prediction result of the category to which each pixel on the image to be detected belongs.

Wherein, the computer equipment obtains the image to be measured. The image to be detected contains text. And inputting the image to be detected into the text detection model by the computer equipment to obtain a prediction result of the category of each pixel on the image to be detected. The prediction result comprises that the category of the pixel on the image to be detected is character or the category of the pixel on the image to be detected is non-character. And the computer equipment determines and marks the text region in the image to be detected according to the prediction result of the category to which each pixel on the image to be detected belongs. The marked-out text area is output in the form of a labeling box.

In the embodiment, the image to be detected is input into the text detection model, the text region is determined and marked on the image to be detected based on the prediction result, and text detection is performed based on the text detection model, so that the text detection precision is improved.

In order to explain the training method and effect of the text detection model in the present solution in detail, the following describes a most detailed embodiment:

the application scene is detected aiming at the common text of the form image in the financial field. Fig. 6 is a schematic general flow chart of the training and application method of the text detection model. A computer device obtains a set of image samples, the set of image samples including at least one image sample. Acquiring a first characteristic diagram and a first prediction result of the image sample through the reference model, wherein the first prediction result represents the prediction result of the reference model on the category of each pixel on the image sample; the first prediction result further comprises a first probability map, a first approximate two-value map and a first threshold map, wherein the first probability map represents the probability that each pixel on the image sample predicted by the reference module belongs to the character, the first approximate two-value map represents the binary result that each pixel on the image sample predicted by the reference module belongs to the character or non-character, and the first threshold map represents the probability that each pixel on the image sample predicted by the reference module belongs to the text boundary.

And obtaining a second characteristic diagram and a second prediction result of the image sample through the model to be trained, wherein the second prediction result represents the prediction result of the model to be trained on the category of each pixel on the image sample. The second prediction result further comprises a second probability graph and a second approximate binary graph, the second probability graph represents the probability that each pixel on the image sample predicted by the model to be trained belongs to the character, and the first approximate binary graph represents the binary classification result that each pixel on the image sample predicted by the model to be trained belongs to the character or the non-character.

Acquiring the correlation among all pixels in the first characteristic diagram and the correlation among all pixels in the second characteristic diagram; the first characteristic diagram and the second characteristic diagram are equal in size; acquiring the width and the height of the first characteristic diagram; determining a first similarity graph according to the correlation among the pixels in the first feature graph; determining a second similarity graph according to the correlation among all pixels in the second feature graph; for each pixel in the pixels in the first similarity graph, obtaining the square of the difference between the current pixel and the corresponding pixel in the second similarity graph; summing the squares of the differences corresponding to each pixel in the first similarity graph to obtain a summation result; based on the summation result, the width and height of the first feature map, a similarity loss of the image sample is determined. Illustratively, the similarity loss calculation formula is as follows:

wherein l _pa (S) represents a loss of similarity of the image sample, W represents a width of the first feature map, H represents a height of the first feature map,

a pixel value representing the first similarity map,

Expanding the first probability map by adopting an expansion function to obtain a first expansion result; and expanding the first approximate binary image by adopting an expansion function to obtain a second expansion result. And determining the pixel-level loss of the image sample according to the first expansion result, the second probability map and the second approximate binary map. The pixel level loss calculation formula is:

therein, loss _distill Representing the pixel level loss of the image sample and gamma representing the hyper-parameter, which may take 5 in some embodiments. f. of _dila Representing an expansion function, in some embodiments, f _dila The kernel of (a) may be a second order matrix with parameters of 1. l _p (S _out ,f _dila (T _out ) Represents a first loss, which is a cross-entropy loss. l _b (S _out ,f _dila (T _out ) Represents a second loss, which is a die loss.

Determining a first truth loss based on the first probability map and the annotation result of the image sample; determining a second true value loss based on the first threshold value graph and the labeling result of the image sample; determining a third true value loss based on the first approximate binary image and the annotation result of the image sample; and determining the true loss of the image sample according to the first true loss, the second true loss and the third true loss. The formula for calculating the true loss is:

among them, loss _gt (S _out Gt) represents the loss of truth for the image sample,/ _p (S _out Gt) represents a first truth loss, which is a cross entropy loss. l _t A second truth penalty is represented, which is a threshold map penalty. l _b (S _out Gt) represents a third true loss, which is a die loss. α and β are preset coefficients.

Based on the similarity penalty, the pixel level penalty, and the truth penalty, a target penalty is determined. Specifically, the computer device sums the loss of similarity of the image sample, the loss at the pixel level of the image sample, and the loss at the truth level of the image sample to obtain the total loss of the image sample. The target loss is the average of the overall losses of the individual image samples in the image sample set. Adjusting parameters in the model to be trained according to the target loss; and continuing training based on the adjusted model until the training stop condition is reached, and taking the model obtained after the training stop as a text detection model.

The training of the text detection model further comprises the following steps: and acquiring an image to be detected. And inputting the image to be detected into the text detection model to obtain a prediction result of the category to which each pixel on the image to be detected belongs. And determining and marking a text region in the image to be detected according to the prediction result of the category to which each pixel on the image to be detected belongs.

According to the training device of the text detection model, the first feature map and the first prediction result of the image samples in the image sample set are obtained through the reference model, and the second feature map and the second prediction result of the image samples in the image sample set are obtained through the model to be trained. Determining similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; and determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample. And training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model. The training method of the text detection model combines the feature map and the prediction result which respectively correspond to the reference model and the model to be trained to obtain the similarity loss, the pixel level loss and the truth value loss of the image sample, carries out model training based on various losses, and can improve the detection precision of the text detection model compared with direct training.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a training device of the text detection model for implementing the above-mentioned training method of the text detection model. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the method, so that the specific limitations in the following embodiment of the training device for one or more text detection models may refer to the limitations on the training method for the text detection model in the above description, and are not described herein again.

In one embodiment, as shown in fig. 7, there is provided a training apparatus 100 for a text detection model, comprising: a sample acquisition module 110, a first result acquisition module 120, a second result acquisition module 130, a loss determination module 140, and a model training module 150, wherein:

a sample acquiring module 110, configured to acquire an image sample set, where the image sample set includes at least one image sample;

a first result obtaining module 120, configured to obtain a first feature map and a first prediction result of the image sample through the reference model, where the first prediction result represents a prediction result of the reference model for a category to which each pixel on the image sample belongs;

a second result obtaining module 130, configured to obtain a second feature map and a second prediction result of the image sample through the model to be trained, where the second prediction result represents a prediction result of the model to be trained on a category to which each pixel on the image sample belongs;

a loss determining module 140, configured to determine a similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample;

and the model training module 150 is configured to train the model to be trained based on the similarity loss, the pixel level loss and the truth loss to obtain a text detection model.

In one embodiment, in determining the loss of similarity for the image sample based on the first feature map and the second feature map, the loss determination module 140 is further configured to: obtaining the correlation among all pixels in the first characteristic diagram and the correlation among all pixels in the second characteristic diagram; and determining the similarity loss of the image sample based on the correlation among the pixels in the first feature map and the correlation among the pixels in the second feature map.

In one embodiment, the first profile and the second profile are equal in size; determining a similarity loss aspect of the image sample based on the correlation between the pixels in the first feature map and the correlation between the pixels in the second feature map, the loss determination module 140 further configured to: acquiring the width and the height of the first characteristic diagram; determining a first similarity graph according to the correlation among the pixels in the first feature graph; determining a second similarity graph according to the correlation among the pixels in the second feature graph; for each pixel in the pixels in the first similarity graph, obtaining the square of the difference between the current pixel and the corresponding pixel in the second similarity graph; summing the squares of the differences corresponding to each pixel in the first similarity graph to obtain a summation result; based on the summation result, the width and height of the first feature map, a similarity loss of the image sample is determined.

In one embodiment, the first prediction result further comprises a first probability map and a first approximate binary map, the first probability map represents the probability that each pixel on the image sample predicted by the reference model belongs to the text, and the first approximate binary map represents the result of classifying each pixel on the image sample predicted by the reference model belongs to the text or the non-text; the second prediction result also comprises a second probability graph and a second approximate binary graph, the second probability graph represents the probability that each pixel on the image sample predicted by the model to be trained belongs to the character, and the first approximate binary graph represents the binary classification result that each pixel on the image sample predicted by the model to be trained belongs to the character or the non-character; determining a pixel-level loss aspect of the image sample based on the first prediction and the second prediction, the loss determination module 140 further configured to: expanding the first probability map by adopting an expansion function to obtain a first expansion result; expanding the first approximate binary image by using an expansion function to obtain a second expansion result; and determining the pixel-level loss of the image sample according to the first expansion result, the second probability map and the second approximate binary map.

In one embodiment, the first prediction result comprises a first threshold map, and the first threshold map represents the probability that each pixel on the image sample predicted by the reference model belongs to the text boundary; determining a true loss aspect of the image sample according to the second prediction result and the annotation result of the image sample, wherein the loss determining module 140 is further configured to: determining a first true value loss based on the first probability map and the annotation result of the image sample; determining a second true value loss based on the first threshold value graph and the labeling result of the image sample; determining a third true value loss based on the first approximate binary image and the annotation result of the image sample; and determining the true loss of the image sample according to the first true loss, the second true loss and the third true loss.

In one embodiment, in training the model to be trained based on the similarity loss, the pixel-level loss, and the truth loss to obtain the text detection model, the model training module 150 is further configured to: determining a target loss based on the similarity loss, the pixel level loss and the truth value loss; adjusting parameters in the model to be trained according to the target loss; and continuing training based on the adjusted model until the training stop condition is reached, and taking the model obtained after the training stop as a text detection model.

In one embodiment, the training apparatus 100 for text detection model is further configured to: acquiring an image to be detected; inputting the image to be detected into a text detection model to obtain a prediction result of the category of each pixel on the image to be detected; and determining and marking a text region in the image to be detected according to the prediction result of the category to which each pixel on the image to be detected belongs.

The modules in the training device of the text detection model can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, an Input/Output interface (I/O for short), and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing an image sample set, a first feature map, a first prediction result, a second feature map, a second prediction result, similarity loss, pixel level loss, truth loss, a reference model, a model to be trained and a text detection model. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for communicating with an external computer device through a network connection. The computer program is executed by a processor to implement a method of training a text detection model.

It will be appreciated by those skilled in the art that the configuration shown in fig. 8 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of:

acquiring an image sample set, wherein the image sample set comprises at least one image sample; acquiring a first characteristic diagram and a first prediction result of the image sample through the reference model, wherein the first prediction result represents the prediction result of the reference model on the category of each pixel on the image sample; acquiring a second feature map and a second prediction result of the image sample through the model to be trained, wherein the second prediction result represents the prediction result of the model to be trained on the category of each pixel on the image sample; determining similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample; and training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model.

In one embodiment, the processor, when executing the computer program, further performs the steps of:

acquiring the correlation among all pixels in the first characteristic diagram and the correlation among all pixels in the second characteristic diagram; and determining the similarity loss of the image sample based on the correlation among the pixels in the first feature map and the correlation among the pixels in the second feature map.

In one embodiment, the processor when executing the computer program further performs the steps of:

the first characteristic diagram and the second characteristic diagram are equal in size; acquiring the width and the height of the first characteristic diagram; determining a first similarity graph according to the correlation among the pixels in the first feature graph; determining a second similarity graph according to the correlation among all pixels in the second feature graph; for each pixel in the pixels in the first similarity graph, obtaining the square of the difference between the current pixel and the corresponding pixel in the second similarity graph; summing the squares of the differences corresponding to each pixel in the first similarity graph to obtain a summation result; based on the summation result, the width and height of the first feature map, a similarity loss of the image sample is determined.

the first prediction result further comprises a first probability graph and a first approximate binary graph, the first probability graph represents the probability that each pixel on the image sample predicted by the reference module belongs to the character, and the first approximate binary graph represents the binary classification result that each pixel on the image sample predicted by the reference module belongs to the character or the non-character; the second prediction result also comprises a second probability graph and a second approximate binary graph, the second probability graph represents the probability that each pixel on the image sample predicted by the model to be trained belongs to the character, and the first approximate binary graph represents the binary classification result that each pixel on the image sample predicted by the model to be trained belongs to the character or the non-character; expanding the first probability map by adopting an expansion function to obtain a first expansion result; expanding the first approximate binary image by adopting an expansion function to obtain a second expansion result; and determining the pixel-level loss of the image sample according to the first expansion result, the second probability map and the second approximate binary map.

the first prediction result comprises a first threshold value image, and the first threshold value image represents the probability that each pixel on the image sample predicted by the reference model belongs to the text boundary; determining a first truth loss based on the first probability map and the annotation result of the image sample; determining a second true value loss based on the first threshold value graph and the labeling result of the image sample; determining a third true value loss based on the first approximate binary image and the annotation result of the image sample; and determining the true loss of the image sample according to the first true loss, the second true loss and the third true loss.

determining a target loss based on the similarity loss, the pixel level loss and the truth value loss; adjusting parameters in the model to be trained according to the target loss; and continuing training based on the adjusted model until the training stop condition is met, and taking the model obtained after the training is stopped as a text detection model.

acquiring an image to be detected; inputting the image to be detected into a text detection model to obtain a prediction result of the category of each pixel on the image to be detected; and determining and marking a text region in the image to be detected according to the prediction result of the category to which each pixel on the image to be detected belongs.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring an image sample set, wherein the image sample set comprises at least one image sample; acquiring a first characteristic diagram and a first prediction result of the image sample through the reference model, wherein the first prediction result represents the prediction result of the reference model on the category of each pixel on the image sample; acquiring a second characteristic diagram and a second prediction result of the image sample through the model to be trained, wherein the second prediction result represents the prediction result of the model to be trained on the category of each pixel on the image sample; determining similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample; and training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model.

In one embodiment, the computer program when executed by the processor further performs the steps of:

the first prediction result further comprises a first probability graph and a first approximate binary graph, the first probability graph represents the probability that each pixel on the image sample predicted by the reference module belongs to the character, and the first approximate binary graph represents the binary classification result that each pixel on the image sample predicted by the reference module belongs to the character or the non-character; the second prediction result also comprises a second probability graph and a second approximate binary graph, the second probability graph represents the probability that each pixel on the image sample predicted by the model to be trained belongs to the character, and the first approximate binary graph represents the binary classification result that each pixel on the image sample predicted by the model to be trained belongs to the character or the non-character; expanding the first probability map by adopting an expansion function to obtain a first expansion result; expanding the first approximate binary image by adopting an expansion function to obtain a second expansion result; and determining the pixel-level loss of the image sample according to the first expansion result, the second probability map and the second approximate binary value map.

the first prediction result comprises a first threshold value map, and the first threshold value map represents the probability that each pixel on the image sample predicted by the reference model belongs to the text boundary; determining a first true value loss based on the first probability map and the annotation result of the image sample; determining a second true value loss based on the first threshold value graph and the labeling result of the image sample; determining a third true value loss based on the first approximate binary image and the annotation result of the image sample; the true value loss of the image sample is determined according to the first true value loss, the second true value loss and the third true value loss.

determining a target loss based on the similarity loss, the pixel level loss and the truth value loss; adjusting parameters in the model to be trained according to the target loss; and continuing training based on the adjusted model until the training stop condition is reached, and taking the model obtained after the training stop as a text detection model.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:

acquiring an image sample set, wherein the image sample set comprises at least one image sample; acquiring a first feature map and a first prediction result of the image sample through the reference model, wherein the first prediction result represents the prediction result of the reference model on the category of each pixel on the image sample; acquiring a second characteristic diagram and a second prediction result of the image sample through the model to be trained, wherein the second prediction result represents the prediction result of the model to be trained on the category of each pixel on the image sample; determining similarity loss of the image sample based on the first feature map and the second feature map; determining pixel-level loss of the image sample according to the first prediction result and the second prediction result; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample; and training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model.

obtaining the correlation among all pixels in the first characteristic diagram and the correlation among all pixels in the second characteristic diagram; and determining the similarity loss of the image sample based on the correlation between the pixels in the first feature map and the correlation between the pixels in the second feature map.

the first characteristic diagram and the second characteristic diagram are equal in size; determining the similarity loss of the image sample based on the correlation between the pixels in the first feature map and the correlation between the pixels in the second feature map, including: acquiring the width and the height of the first characteristic diagram; determining a first similarity graph according to the correlation among the pixels in the first feature graph; determining a second similarity graph according to the correlation among all pixels in the second feature graph; for each pixel in the pixels in the first similarity graph, obtaining the square of the difference between the current pixel and the corresponding pixel in the second similarity graph; summing the squares of the differences corresponding to each pixel in the first similarity graph to obtain a summation result; based on the summation result, the width and height of the first feature map, a similarity loss of the image sample is determined.

the first prediction result comprises a first threshold value image, and the first threshold value image represents the probability that each pixel on the image sample predicted by the reference model belongs to the text boundary; determining a first true value loss based on the first probability map and the annotation result of the image sample; determining a second true value loss based on the first threshold value graph and the labeling result of the image sample; determining a third truth loss based on the first approximate binary image and the annotation result of the image sample; the true value loss of the image sample is determined according to the first true value loss, the second true value loss and the third true value loss.

acquiring an image to be detected; inputting the image to be detected into a text detection model to obtain a prediction result of the category to which each pixel on the image to be detected belongs; and determining and marking a text region in the image to be detected according to the prediction result of the category to which each pixel on the image to be detected belongs.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include a Read-Only Memory (ROM), a magnetic tape, a floppy disk, a flash Memory, an optical Memory, a high-density embedded nonvolatile Memory, a resistive Random Access Memory (ReRAM), a Magnetic Random Access Memory (MRAM), a Ferroelectric Random Access Memory (FRAM), a Phase Change Memory (PCM), a graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the various embodiments provided herein may be, without limitation, general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, or the like.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method for training a text detection model, the method comprising:

acquiring a first feature map and a first prediction result of the image sample through a reference model, wherein the first prediction result represents the prediction result of the reference model on the category of each pixel on the image sample;

acquiring a second feature map and a second prediction result of the image sample through a model to be trained, wherein the second prediction result represents the prediction result of the model to be trained on the category of each pixel on the image sample;

determining a similarity loss of the image sample based on the first feature map and the second feature map; determining a pixel level loss of the image sample according to the first prediction result and the second prediction result; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample;

2. The method of claim 1, wherein determining the loss of similarity for the image sample based on the first feature map and the second feature map comprises:

obtaining the correlation among all pixels in the first feature map and the correlation among all pixels in the second feature map;

3. The method of claim 2, wherein the first profile and the second profile are equal in size;

the determining the similarity loss of the image sample based on the correlation between the pixels in the first feature map and the correlation between the pixels in the second feature map comprises:

acquiring the width and the height of the first characteristic diagram;

for each pixel in the pixels in the first similarity map, obtaining the square of the difference between the current pixel and the corresponding pixel in the second similarity map;

determining a similarity loss of the image sample based on the summation result, the width and the height of the first feature map.

4. The method of claim 1, wherein the first prediction result further comprises a first probability map and a first approximate binary map, the first probability map representing the probability that each pixel on the image sample predicted by the reference model belongs to text, the first approximate binary map representing the result of classifying each pixel on the image sample predicted by the reference model as belonging to text or non-text; the second prediction result further comprises a second probability graph and a second approximate binary value graph, the second probability graph represents the probability that each pixel on the image sample predicted by the model to be trained belongs to a character, and the first approximate binary value graph represents the binary classification result that each pixel on the image sample predicted by the model to be trained belongs to a character or a non-character;

said determining a pixel-level loss for the image sample based on the first prediction and the second prediction comprises:

expanding the first approximate binary image by adopting the expansion function to obtain a second expansion result;

determining a pixel-level loss of the image sample according to the first expansion result, the second probability map, and the second approximate binary map.

5. The method of claim 4, wherein the first prediction result comprises a first threshold map, and the first threshold map characterizes a probability that each pixel on the image sample predicted by the reference model belongs to a text boundary;

the determining a true value loss of the image sample according to the second prediction result and the annotation result of the image sample includes:

determining a second true value loss based on the first threshold map and the labeling result of the image sample;

determining a third truth loss based on the first approximate binary image and an annotation result of the image sample;

determining a true loss for the image sample based on the first true loss, the second true loss, and the third true loss.

6. The method according to claim 1, wherein training the model to be trained based on the similarity loss, the pixel-level loss, and the truth loss to obtain a text detection model comprises:

determining a target penalty based on the similarity penalty, the pixel level penalty, and the truth penalty;

adjusting parameters in the model to be trained according to the target loss;

7. The method of any one of claims 1 to 6, further comprising:

acquiring an image to be detected;

inputting the image to be detected into the text detection model to obtain a prediction result of the category to which each pixel on the image to be detected belongs;

and determining and marking a text area in the image to be detected according to the prediction result of the category of each pixel on the image to be detected.

8. An apparatus for training a text detection model, the apparatus comprising:

a sample acquisition module for acquiring a set of image samples, the set of image samples comprising at least one image sample;

a first result obtaining module, configured to obtain a first feature map and a first prediction result of the image sample through a reference model, where the first prediction result represents a prediction result of the reference model for a category to which each pixel on the image sample belongs;

the second result obtaining module is used for obtaining a second feature map and a second prediction result of the image sample through the model to be trained, and the second prediction result represents a prediction result of the model to be trained on the category of each pixel on the image sample;

a loss determination module, configured to determine a similarity loss of the image sample based on the first feature map and the second feature map; determining a pixel level loss of the image sample according to the first prediction result and the second prediction result; determining the true value loss of the image sample according to the second prediction result and the annotation result of the image sample;

and the model training module is used for training the model to be trained based on the similarity loss, the pixel level loss and the truth value loss to obtain a text detection model.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.