CN112818975B

CN112818975B - Text detection model training method and device, text detection method and device

Info

Publication number: CN112818975B
Application number: CN202110109985.1A
Authority: CN
Inventors: 张鹏远; 李长亮
Original assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2024-09-24
Anticipated expiration: 2041-01-27
Also published as: CN112818975A

Abstract

The application provides a text detection model training method and device, and a text detection method and device, wherein the text detection model training method comprises the following steps: inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame; extracting a plurality of initial feature graphs with different scales corresponding to the target training image through the feature extraction layer; pooling the plurality of initial feature graphs with different scales through the feature pooling layer to obtain a plurality of enhanced feature graphs with different scales; fusing the plurality of enhancement feature graphs with different scales through the feature fusion layer to obtain a plurality of prediction frames; and determining a target prediction frame in the plurality of prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stopping condition is reached.

Description

Text detection model training method and device, text detection method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text detection model training method and apparatus, a text detection method and apparatus, a computing device, and a computer readable storage medium.

Background

With the rapid development of computer technology, the field of image processing has also developed rapidly, wherein text detection is also a very important branch in the field of image processing.

The existing text detection is mostly based on manually marked text pictures as training data of a model, the training pictures need to be marked by consuming a great deal of manpower and material resources, or the marking data are purchased at a high price, the cost is high, in the existing text detection model, the relation among image channels is mostly not considered, the phenomenon of omission often occurs when a text region with a complex background (such as complex color, complex texture and the like) is detected, and finally, the determined text detection position is often inaccurate, and misjudgment occurs.

Therefore, how to solve the above problems is a urgent problem for the technicians.

Disclosure of Invention

In view of this, embodiments of the present application provide a text detection model training method and apparatus, a text detection method and apparatus, a computing device, and a computer readable storage medium, so as to solve the technical defects existing in the prior art.

According to a first aspect of an embodiment of the present application, there is provided a text detection model training method, including:

Inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer;

extracting a plurality of initial feature graphs with different scales corresponding to the target training image through the feature extraction layer;

Pooling the plurality of initial feature graphs with different scales through the feature pooling layer to obtain a plurality of enhanced feature graphs with different scales;

Fusing the plurality of enhancement feature graphs with different scales through the feature fusion layer to obtain a plurality of prediction frames;

And determining a target prediction frame in the plurality of prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stopping condition is reached.

According to a second aspect of an embodiment of the present application, there is provided a text detection method, including:

acquiring an image to be detected, wherein the image to be detected comprises a text to be detected;

Inputting the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained by training the text detection model training method;

The text detection model generates a predictive text box corresponding to the text to be detected in response to the image to be detected as input.

According to a third aspect of an embodiment of the present application, there is provided a text detection model training apparatus, including:

The text detection module is configured to input a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer;

The extraction module is configured to extract a plurality of initial feature graphs with different scales corresponding to the target training image through the feature extraction layer;

The pooling module is configured to pool the plurality of initial feature graphs with different scales through the feature pooling layer to obtain a plurality of enhanced feature graphs with different scales;

the fusion module is configured to fuse the plurality of enhancement feature graphs with different scales through the feature fusion layer to obtain a plurality of prediction frames;

And the training module is configured to determine a target prediction frame in the plurality of prediction frames, determine a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and train the text detection model according to the loss value until a training stop condition is reached.

According to a fourth aspect of an embodiment of the present application, there is provided a text detection apparatus including:

the acquisition module is configured to acquire an image to be detected, wherein the image to be detected comprises a text to be detected;

The input module is configured to input the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained through training by the text detection model training method;

The generation module is configured to generate a predicted text box corresponding to the text to be detected by the text detection model in response to the image to be detected as input.

According to a fifth aspect of embodiments of the present application, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the text detection model training method or text detection method when executing the instructions.

According to a sixth aspect of embodiments of the present application, there is provided a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the text detection model training method or text detection method.

According to a seventh aspect of embodiments of the present application, there is provided a chip storing computer instructions which, when executed by the chip, implement the steps of the text detection model training method or the text detection method.

The text detection model training method provided by the embodiment of the application comprises the following steps: inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer; extracting a plurality of initial feature graphs with different scales corresponding to the target training image through the feature extraction layer; pooling the plurality of initial feature graphs with different scales through the feature pooling layer to obtain a plurality of enhanced feature graphs with different scales; fusing the plurality of enhancement feature graphs with different scales through the feature fusion layer to obtain a plurality of prediction frames; and determining a target prediction frame in the plurality of prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stopping condition is reached. The text detection model provided by the method can effectively enhance the connection between the features through the feature extraction layer, effectively enhance the precision of the text in the complex background area, simultaneously increase the network structure of the feature pooling layer, effectively increase the receptive field of the target area, reduce the missing detection phenomenon of small target objects, and enhance the recognition accuracy of the text detection model on the whole and the recognition efficiency.

And secondly, a novel data amplification form is adopted, the problem of inaccurate identification caused by insufficient artificial marking data and target shielding is solved, and meanwhile, the generalization of a text detection model is enhanced.

Drawings

FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;

FIG. 2 is a flowchart of a text detection model training method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a text detection model training method according to another embodiment of the present application;

Fig. 4 is a schematic flow chart of a text detection method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a text detection model training device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a text detection device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than those herein described, and those skilled in the art will readily appreciate that the present application may be similarly embodied without departing from the spirit or essential characteristics thereof, and therefore the present application is not limited to the specific embodiments disclosed below.

The terminology used in the one or more embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the application. As used in one or more embodiments of the application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of the application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the application. The word "if" as used herein may be interpreted as "responsive to a determination" depending on the context.

First, terms related to one or more embodiments of the present invention will be explained.

Text detection: given a text image, the location of the text is automatically located.

K-means clustering: the K mean value clustering algorithm is a clustering analysis algorithm for iterative solution, and the method comprises the steps of dividing data into K groups in advance, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and distributing each object to the closest clustering center. For each assigned object, the cluster center of the cluster is recalculated based on the existing objects in the cluster.

Yolov3: based on Darknet-53 target detection network structure, darknet is a feature extraction network based on residual structure.

FPN: the feature map pyramid network (Feature Pyramin Networks) is a multi-scale object detection method.

Attention mechanism: atteniton is a mechanism for allocating resources, and can be understood as reallocating resources according to the importance of attention objects for the resources that are originally allocated equally.

ASPP: a pyramid (atrous SPATIAL PYRAMID pooling) is pooled by hole space convolution, a method of parallel sampling with hole convolutions of different sampling rates for a given input.

Logical layer: and the network structure is used for classifying the detection frames.

IOU: an index for calculating and evaluating the coincidence degree between detection frames.

In the present application, a text detection model training method and apparatus, a text detection method and apparatus, a computing device, and a computer-readable storage medium are provided, and detailed descriptions are given one by one in the following embodiments.

FIG. 1 illustrates a block diagram of a computing device 100, according to an embodiment of the application. The components of the computing device 100 include, but are not limited to, a memory 110 and a processor 120. Processor 120 is coupled to memory 110 via bus 130 and database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 140 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the application, the above-described components of computing device 100, as well as other components not shown in FIG. 1, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 1 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein the processor 120 may perform the steps of the text detection model training method shown in fig. 2. FIG. 2 shows a flow chart of a text detection model training method according to an embodiment of the application, including steps 202 to 210.

Step 202: inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer.

The target training image is a text image for training a text detection model, and the text image is marked with a corresponding marking box which is used for marking a text area needing to be identified in the text image.

The text detection module at least comprises a feature extraction layer, a feature pooling layer and a feature fusion layer, wherein the feature extraction layer is preferably a feature extraction layer for fusing an attention mechanism.

In practical applications, before inputting the target training image into the text detection model, the method further comprises:

And acquiring target training images in a preset training set.

The preset training set comprises a plurality of training sets of text images, and a large number of target training images are stored in the preset training set.

In practical application, the target training image in the preset training set is based on a large number of target detection images marked manually, and marking the images by a manual marking method consumes a large amount of labor and time, which is time-consuming and labor-consuming, so that training data in the training set can be added by performing data expansion on the training set, and specifically, the target training image in the preset training set is obtained, which comprises the following steps:

acquiring an initial training set, wherein the initial training set comprises a plurality of training images;

and carrying out data amplification processing on the plurality of training images to generate a training set after data amplification.

The data amplification processing for the training images comprises the following steps: and performing data amplification processing in a mode of any one or a combination of a plurality of random cutting, random translation, contrast change, brightness change, transparency change, random shielding and random filling on the training images.

After an initial training set is obtained, data amplification is required to be carried out on sample images in the training set, and the data amplification method comprises Cutout algorithm (random shielding) and FMix algorithm (random filling) besides random cutting, stretching, contrast, brightness and transparency, wherein a square area with a fixed size is randomly selected by Cutout algorithm, and then filling is carried out by adopting all 0; FMix the algorithm is used for binarizing the image according to the high-frequency and low-frequency regions of the image, then the mask is used for weighting the pixels, and the two data amplification algorithms are introduced to solve the problem of insufficient target shielding. The data amplification can increase samples of the training set, can effectively relieve the situation of model overfitting, and can bring stronger generalization capability to the model.

In the target detection, there is generally a concept of a priori frame, the priori frame is preset to be the width and the height of a common target, when the prediction is performed, the preset width and the preset height can be utilized to help the target detection, the size of the priori frame is generally obtained through K-means clustering, as in Yolov, 9 priori frames are generally obtained through K-means clustering algorithm clustering, 3 kinds of priori frames are respectively set for large, medium and small scales, the size of the priori frame of each size is generated according to training data clustering in practical application, and in the application, the size of the priori frame can be 116×90, 156×198, 373×326, 30×61, 62×45, 59×119, 10×13, 16×30 and 33×23; also 5 x 24, 5 x 36, 6 x 25, 9 x 65, 9 x 48, 9 x 70, 14 x 155, 15 x 178, 16 x 180, etc., wherein a larger prior box (e.g., 14 x 155, 15 x 178, 16 x 180) is applied on the smaller scale feature map, suitable for detecting larger objects; applying a medium prior box (e.g., 9 x 65, 9 x 48, 9 x 70) on the mesoscale signature, adapted to detect a medium-sized subject; on larger scale feature maps, smaller prior boxes (e.g., 5 x 24, 5 x 36, 6 x 25) are applied, which are suitable for detecting smaller objects.

The training image may include images of various scenes, such as live scenes, game scenes, outdoor scenes, etc., while the training image includes text information of various types of words, shapes, languages, etc., and at least one of the images or text may be identified from the training image. The training image contains a manually marked marking frame, the position of the marking frame is the position needing to be identified, the content in the marking frame is the content needing to be identified, the marking frame is usually a rectangular marking frame or other polygonal marking frames, the application is not limited to the above, in practical application, the preset training set is divided into two parts, namely a training subset and a testing subset, in the model training process, a target training image is obtained from the training subset, and after model training is completed, a target detection image is obtained from the testing subset for detecting the performance of the model.

The text detection model trained by the application is used for detecting the position of the text region in the text image, can rapidly and accurately position the text in the image, is convenient for saving time when the text recognition is carried out subsequently, and improves the efficiency. The text detection model provided by the application comprises a feature extraction layer, a feature pooling layer and a feature fusion layer which are used for fusing an attention mechanism.

In a specific embodiment provided by the application, the target training image is a photo of a resume A, marking label boxes at the name, the age and the telephone of the resume A, inputting the target training image resume A into a text detection model for training, wherein the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer which are fused with an attention mechanism.

Step 204: and extracting a plurality of initial feature graphs with different scales corresponding to the target training image through the feature extraction layer.

The feature extraction layer preferably merges the feature extraction layer of the attention mechanism, the feature extraction layer of the attention mechanism includes a plurality of channels, the attention mechanism merges among the channels, and correspondingly, the feature extraction layer extracts a plurality of initial feature graphs with different scales corresponding to the target training image, including:

extracting a plurality of initial feature images with different scales corresponding to the target training image through the multiple channels and the attention mechanism fused among the multiple channels.

The feature extraction layer incorporating the attention mechanism preferably uses the Darknet-53 structure in the modified Yolov, i.e., increases the inter-channel attention mechanism based on Darknet-53 in Yolov 3. Darknet-53 in Yolov is a full convolution network, and is used for extracting a plurality of initial feature graphs with different scales corresponding to a target training image, specifically, feature extraction is performed on the target training image through different feature channels, feature remembering is weighted in the channel dimension through an attention mechanism, detection performance is improved, connection between channel features is enhanced, and a good effect is achieved on text regions with complex detection features.

The feature extraction layer is used for extracting initial feature images of the target training images on different scales and outputting 3 feature images X1, X2 and X3 of different scales, wherein the depth of each of the X1, X2 and X3 is 255, the side length rule is 13:26:52, and 3 prediction frames are output in each feature image, and the total number of the prediction frames is 9.

In a specific embodiment provided by the application, taking the above example, inputting the photo of resume A into Darknet-53-attention (feature extraction layer of the fusion attention mechanism) for feature extraction to obtain a feature image X1 of the scale of the large target, a feature image X2 of the scale of the target in detection and a feature image X3 of the scale of the small target.

Step 206: and pooling the plurality of initial feature graphs with different scales through the feature pooling layer to obtain a plurality of enhanced feature graphs with different scales.

In practical application, in order to ensure that the characteristics of the picture have a larger receptive field, and hope that the resolution of the characteristic picture does not drop too much (the resolution drops too much and the detail information of the image boundary is lost), the problem can be solved by a hole convolution method, and preferably, in the text detection model provided by the application, the characteristic pooling layer comprises a hole space convolution pooling pyramid;

correspondingly, the feature pooling layer pools the initial feature graphs with the multiple different scales, including:

and pooling the initial feature graphs with the different scales through the cavity space convolution pooling pyramid.

The hole space convolves the pooling pyramid (atrous SPATIAL PYRAMID pooling, ASPP), and the ASPP layer functions to similarly increase the receptive field without using pooling and downsampling operations. Each output of the convolution has a larger range of information, the perception field of view of a target area is increased, the cavity convolution with different sampling rates can effectively capture more scale information, and the missing detection phenomenon of a small target object is reduced.

In a specific embodiment of the present application, the above example is used, and feature images X1, X2 and X3 with different scales are input into the feature pooling layer for processing, so as to obtain a plurality of feature-enhanced enhancement feature graphs Y1, Y2 and Y3 with different scales.

Step 208: and fusing the plurality of enhancement feature graphs with different scales through the feature fusion layer to obtain a plurality of prediction frames.

Preferably, the feature fusion layer comprises a feature map pyramid network;

Fusing the enhanced feature graphs with different scales through the feature fusion layer to obtain a plurality of prediction frames, wherein the method comprises the following steps:

And fusing the enhancement feature graphs with different scales through the feature graph pyramid network to obtain a plurality of prediction frames and scores corresponding to each prediction frame.

The feature map pyramid network (Feature Pyramin Networks, FPN), FPN solves the multi-scale problem in object detection, can be changed through network connection, greatly improves the performance of small object detection on the premise of basically not increasing the calculation amount of an original model, has less feature semantic information of a bottom layer in feature maps of different scales, but accurate target positions, rich feature semantic information of a high layer, but rough target positions, performs prediction of the FPN independently based on the feature maps of different scales, performs up-sampling through the high layer features and performs top-down connection of the features of the bottom layer, performs corresponding prediction on each layer, outputs a plurality of different prediction results, finally generates a plurality of prediction frames, and also generates scores corresponding to each prediction frame.

In a specific embodiment of the present application, along the above example, a plurality of enhancement feature graphs Y1, Y2, and Y3 with different scales are input to the feature fusion layer for processing, so as to generate a plurality of prediction frames with corresponding scores for each prediction frame.

Step 210: and determining a target prediction frame in the plurality of prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stopping condition is reached.

In practical application, after obtaining the prediction frames in the above steps, the score corresponding to each prediction frame may also be obtained, and accordingly, determining the target prediction frame in the multiple prediction frames includes: and determining the prediction frame with the highest score as a target prediction frame.

In practical application, determining the loss value based on the target prediction frame and the label frame corresponding to the target training image includes:

and determining a loss value based on the position information of the target prediction frame and the position information of the annotation frame corresponding to the target training image.

After the predicted target prediction frame is obtained, the position information of the target prediction frame can be determined according to the coordinate of a certain vertex of the target prediction frame and the length and width of the target prediction frame, meanwhile, the position information of the labeling frame can be determined according to the coordinate of a certain vertex of the labeling frame corresponding to the target training object and the length and width of the labeling frame, the loss value can be determined based on the position information of the target prediction frame and the position information of the labeling frame corresponding to the target training image, and a plurality of methods for determining the loss value, such as a cross entropy loss function, a maximum loss function, an average loss function and the like, are not limited in the specific mode for calculating the loss value, and the method is based on practical application.

Optionally, training the text detection model according to the loss value includes:

And adjusting model parameters in a feature extraction layer, a feature pooling layer and a feature fusion layer in the text detection model according to the loss value.

Training the text detection model according to the loss value, specifically, adjusting model parameters in a feature extraction layer, a feature pooling layer and a feature fusion layer in the text detection model according to the loss value.

Selecting a prediction frame with the highest score as a target prediction frame according to scores corresponding to the prediction frames with different scales, determining a loss value based on a labeling frame in a target training image, and adjusting parameters of the text detection model according to the back propagation of the loss value until a training stop condition is reached, wherein the training stop condition can be a preset training round, a loss value is lower than a preset threshold, or a test is performed according to the target detection images in a test subset, the obtained position of the target prediction frame and the superposition area of the labeling frame are greater than the preset threshold, and the training stop condition is not particularly limited in the application.

In a specific embodiment provided by the application, a target prediction frame 1 corresponding to a name in a resume, a target prediction frame 2 corresponding to an age and a target prediction frame 3 corresponding to a telephone are respectively determined, a loss value is calculated according to the target prediction frame 1, the target prediction frame 2 and the target prediction frame 3 and a labeling frame marked in the resume, then parameters of the text detection model are adjusted according to back propagation of the loss value, after a preset round is trained, the text detection model is detected according to target detection images in a test subset, and when the coincidence degree of the prediction frame output by the text detection model and the labeling frame in a target test image reaches more than 95%, namely, the IOU value is more than 0.95, the text detection model is successfully trained.

The text detection model training method provided by the embodiment of the application comprises the steps of inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer; extracting a plurality of initial feature graphs with different scales corresponding to the target training image through the feature extraction layer; pooling the plurality of initial feature graphs with different scales through the feature pooling layer to obtain a plurality of enhanced feature graphs with different scales; fusing the plurality of enhancement feature graphs with different scales through the feature fusion layer to obtain a plurality of prediction frames; and determining a target prediction frame in the plurality of prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stopping condition is reached. The text detection model provided by the method can effectively enhance the connection between the features through the feature extraction layer, effectively enhance the precision of the text in the complex background area, simultaneously increase the network structure of the feature pooling layer, effectively increase the receptive field of the target area, reduce the missing detection phenomenon of small target objects, and enhance the recognition accuracy of the text detection model on the whole and the recognition efficiency.

Fig. 3 is a schematic diagram of a text detection model training method according to an embodiment of the present application, and as shown in fig. 3, the method includes steps 302 to 312.

Step 302: an initial training set is obtained.

Step 304: and carrying out data amplification processing on the plurality of training images to generate a training set after data amplification.

Step 306: and determining a priori frame of the training image through K-means clustering, and inputting the training image into a text detection model.

The text detection model comprises a feature extraction layer integrating an attention mechanism, a cavity space convolution pooling pyramid and a feature map pyramid network, wherein training images are input into the feature extraction layer integrating the attention mechanism to perform feature extraction, and a plurality of initial feature maps with different scales are obtained; inputting a plurality of initial feature images with different scales into a space cavity convolution pooling pyramid network for feature enhancement to obtain a plurality of enhancement feature images with different scales; and inputting the initial feature images with different scales and the enhanced feature images with different scales into a feature image pyramid network to perform feature fusion, and outputting a plurality of prediction frames.

Step 308: and acquiring a plurality of prediction frames output by the text detection model, and determining a target prediction frame in the plurality of prediction frames.

Step 310: and determining a loss value based on the target prediction frame and a label frame corresponding to the target training image.

Step 312: and training the text detection model according to the loss value until a training stopping condition is reached.

According to the text detection model training method provided by the embodiment of the application, through the feature extraction layer of the text golden policy model, the relation between features can be effectively enhanced, the precision of the text in a complex background area can be effectively enhanced, meanwhile, the network structure of the feature pooling layer is increased, the receptive field of a target area can be effectively increased, the missing detection phenomenon of a small target object is reduced, the recognition accuracy of the text detection model is enhanced on the whole, and the recognition efficiency is improved.

Fig. 4 is a flow chart of a text detection method according to an embodiment of the present application, which is described by taking text detection of a resume as an example, and includes steps 402 to 406.

Step 402: and obtaining an image to be detected, wherein the image to be detected comprises a text to be detected.

In a specific embodiment provided by the application, the obtained resume picture is an image to be detected, and the contents of names, sexes, birth months, native places, contact ways, working experiences and the like in the resume are texts to be detected.

Step 404: and inputting the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained by training through the text detection model training method.

In one embodiment of the present application, a resume picture is input to a pre-trained text detection model.

Step 406: the text detection model generates a predictive text box corresponding to the text to be detected in response to the image to be detected as input.

In a specific embodiment of the present application, the text detection model responds to the resume picture as input, and generates a predicted text box on the resume picture, where the predicted text box corresponds to the name, gender, birth month, native place, contact way, work experience, etc. in the resume picture.

Optionally, the method further comprises:

Text recognition is carried out on the content in the predictive text box based on the predictive text box;

And obtaining text content information corresponding to the text to be detected.

In a specific embodiment provided by the application, text recognition is performed on the content in the resume picture based on the predicted text boxes corresponding to the content such as name, gender, birth month, native place, contact way, work experience and the like in the resume picture, so as to obtain the text content such as name in the corresponding predicted text boxes in the resume picture: zhang III, sex: male, year and month of birth: * Years of age the month is the time of day and the time of day, and (3) the following steps: somewhere, etc. And filling the obtained text content into a preset structured table to realize the process of converting the resume picture into the text resume.

The text detection method comprises the steps of obtaining an image to be detected, wherein the image to be detected comprises a text to be detected; inputting the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained by training the text detection model training method; the text detection model responds to the image to be detected as input to generate a predicted text box corresponding to the text to be detected, and by the text detection method, the text image with complex background can be better identified, the text content can be more accurately extracted, and the identification effect of the text picture can be improved.

Corresponding to the above embodiment of the text detection model training method, the present application further provides an embodiment of the text detection model training device, and fig. 5 shows a schematic structural diagram of the text detection model training device according to one embodiment of the present application. As shown in fig. 5, the apparatus includes:

The obtaining module 502 is configured to input a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer;

an extraction module 504 configured to extract, by the feature extraction layer, a plurality of different scale initial feature maps corresponding to the target training image;

A pooling module 506, configured to pool the plurality of initial feature maps with different scales through the feature pooling layer, to obtain a plurality of enhanced feature maps with different scales;

the fusion module 508 is configured to fuse the plurality of enhancement feature graphs with different scales through the feature fusion layer to obtain a plurality of prediction frames;

the training module 510 is configured to determine a target prediction frame from the plurality of prediction frames, determine a loss value based on the target prediction frame and a label frame corresponding to the target training image, and train the text detection model according to the loss value until a training stop condition is reached.

Optionally, the acquiring module 502 is further configured to acquire a target training image in a preset training set.

Optionally, the obtaining module 502 is further configured to:

and performing any one of data amplification processing of random cutting, random translation, contrast change, brightness change, transparency change, random shielding and random filling on the training images.

Optionally, the feature extraction layer of the fused attention mechanism includes a plurality of channels, and the attention mechanism is fused among the channels;

the extraction module 504 is further configured to:

Optionally, the feature pooling layer includes a cavity space convolution pooling pyramid;

the pooling module 506 is further configured to:

Optionally, the feature fusion layer includes a feature map pyramid network;

the fusion module 508 is further configured to:

Optionally, the training module 510 is further configured to:

and determining the prediction frame with the highest score as a target prediction frame.

Optionally, the training module 510 is further configured to:

The text detection model training device provided by the embodiment of the application comprises the steps of inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer; extracting a plurality of initial feature graphs with different scales corresponding to the target training image through the feature extraction layer; pooling the plurality of initial feature graphs with different scales through the feature pooling layer to obtain a plurality of enhanced feature graphs with different scales; fusing the plurality of enhancement feature graphs with different scales through the feature fusion layer to obtain a plurality of prediction frames; and determining a target prediction frame in the plurality of prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stopping condition is reached. The text detection model provided by the device can effectively enhance the connection between the features through the feature extraction layer, effectively enhance the precision of the text in the complex background area, simultaneously increase the network structure of the feature pooling layer, effectively increase the receptive field of the target area, reduce the phenomenon of missed detection of small target objects, and enhance the recognition accuracy of the text detection model on the whole and the recognition efficiency. And secondly, a novel data amplification form is adopted, the problem of inaccurate identification caused by insufficient artificial marking data and target shielding is solved, and meanwhile, the generalization of a text detection model is enhanced.

The above is a schematic scheme of a text detection model training device of this embodiment. It should be noted that, the technical solution of the text detection model training device and the technical solution of the text detection model training method belong to the same concept, and details of the technical solution of the text detection model training device which are not described in detail can be referred to the description of the technical solution of the text detection model training method.

Corresponding to the above text detection method embodiment, the present application further provides a text detection device embodiment, and fig. 6 shows a schematic structural diagram of the text detection device according to one embodiment of the present application. As shown in fig. 6, the apparatus includes:

an obtaining module 602, configured to obtain an image to be detected, where the image to be detected includes text to be detected;

The input module 604 is configured to input the image to be detected into a pre-trained text detection model, where the text detection model is obtained through training by the training method of the text detection model;

the generating module 606 is configured to generate a predicted text box corresponding to the text to be detected by the text detection model in response to the image to be detected as input.

Optionally, the apparatus further comprises:

An identification module configured to identify text of content in the predictive text box based on the predictive text box; and obtaining text content information corresponding to the text to be detected.

The text detection device provided by the application comprises the steps of obtaining an image to be detected, wherein the image to be detected comprises a text to be detected; inputting the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained by training the text detection model training method; the text detection model responds to the image to be detected as input to generate a predicted text box corresponding to the text to be detected, and the text detection device can better identify the text image with complex background, extract text content more accurately and improve the identification effect of the text picture.

The above is an exemplary scheme of a text detection device of the present embodiment. It should be noted that, the technical solution of the text detection device and the technical solution of the text detection method belong to the same conception, and details of the technical solution of the text detection device, which are not described in detail, can be referred to the description of the technical solution of the text detection method.

It should be noted that, the components in the apparatus claims should be understood as functional modules that are necessary to be established for implementing the steps of the program flow or the steps of the method, and the functional modules are not actually functional divisions or separate limitations. The device claims defined by such a set of functional modules should be understood as a functional module architecture for implementing the solution primarily by means of the computer program described in the specification, and not as a physical device for implementing the solution primarily by means of hardware.

In one embodiment, the application also provides a computing device, which includes a memory, a processor, and computer instructions stored in the memory and executable on the processor, wherein the processor implements the text detection model training method or the text detection method when executing the instructions.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the text detection model training method or the text detection method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the text detection model training method or the text detection method.

An embodiment of the present application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the text detection model training method or the text detection method as described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the text detection model training method or the text detection method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the text detection model training method or the text detection method.

The embodiment of the application discloses a chip which stores computer instructions which, when executed by a processor, implement the steps of the text detection model training method or the text detection method as described above.

The foregoing describes certain embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the application disclosed above are intended only to assist in the explanation of the application. Alternative embodiments are not intended to be exhaustive or to limit the application to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and the full scope and equivalents thereof.

Claims

1. A text detection model training method, comprising:

Inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer, and the feature pooling layer comprises a cavity space convolution pooling pyramid for expanding the range of output information;

2. The text detection model training method of claim 1, further comprising, prior to inputting the target training image into the text detection model:

And acquiring target training images in a preset training set.

3. The method for training a text detection model according to claim 2, wherein obtaining a target training image in a preset training set comprises:

4. The text detection model training method of claim 3, wherein performing data augmentation processing on the plurality of training images comprises:

5. The text detection model training method of claim 1, wherein the feature extraction layer of the fused attention mechanism comprises a plurality of channels, and the fused attention mechanism is arranged among the plurality of channels;

Extracting, by the feature extraction layer, a plurality of initial feature graphs of different scales corresponding to the target training image, including:

Extracting a plurality of initial feature images with different scales corresponding to the target training image through the channels and a fusion attention mechanism among the channels.

6. The text detection model training method of claim 1, wherein pooling the plurality of different scale initial feature maps by the feature pooling layer comprises:

7. The text detection model training method of claim 1, wherein the feature fusion layer comprises a feature map pyramid network;

8. The text detection model training method of claim 7, wherein determining a target prediction box among the plurality of prediction boxes comprises:

9. The text detection model training method of claim 1, wherein determining a loss value based on the target prediction box and a label box corresponding to the target training image comprises:

10. The text detection model training method of claim 1, wherein training the text detection model based on the loss value comprises:

11. A text detection method, comprising:

Inputting the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained by training through the training method according to any one of claims 1-10, the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer, and the feature pooling layer comprises a cavity space convolution pooling pyramid for expanding the range of output information;

12. The text detection method of claim 11, wherein the method further comprises:

13. A text detection model training device, comprising:

The text detection module comprises a feature extraction layer, a feature pooling layer and a feature fusion layer, wherein the feature pooling layer comprises a cavity space convolution pooling pyramid and is used for expanding the range of output information;

14. A text detection device, comprising:

The input module is configured to input the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained through training by the training method according to any one of claims 1-10, the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer, and the feature pooling layer comprises a cavity space convolution pooling pyramid for expanding the range of output information;

15. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor, when executing the instructions, implements the steps of the method of any of claims 1-10 or 11-12.

16. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1-10 or 11-12.