CN113326766B

CN113326766B - Training method and device of text detection model, text detection method and device

Info

Publication number: CN113326766B
Application number: CN202110587997.5A
Authority: CN
Inventors: 陈毅; 谢群义; 钦夏孟; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2023-09-29
Anticipated expiration: 2041-05-27
Also published as: CN113326766A

Abstract

The disclosure provides a training method of a text detection model, relates to the field of artificial intelligence, and particularly relates to the field of computer vision and deep learning. The method comprises the following steps: acquiring a sample image with a text labeling box, wherein the text labeling box is used for labeling a target text on a text body contained in the sample image; inputting the sample image into an initial text detection model to perform the following operations: performing main body detection to obtain the position information of the text main body; extracting features to obtain a first sample feature map; based on the position information and the first sample feature map, carrying out feature fusion to obtain a second sample feature map and angle information of each pixel point in the text marking frame; and training the initial text detection model based on the second sample feature map, the angle information and a preset loss function. The disclosure also provides a text detection method and device and a training device of the text detection model.

Description

Training method and device of text detection model, text detection method and device

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the field of computer vision and deep learning. And more particularly to a method of training a text detection model, a method of text detection, a training apparatus for a text detection model, a text detection apparatus, an electronic device, a non-transitory computer-readable storage medium storing computer instructions, and a computer program product.

Background

OCR (Optical Character Recognition ) refers to the process of an electronic device (e.g., a scanner or digital camera) inspecting characters printed on paper, determining their shape, and then translating the shape into computer text using a character recognition method.

Disclosure of Invention

The present disclosure provides a training method of a text detection model, a text detection method, a device, equipment, a storage medium and a computer program product.

According to an aspect of the present disclosure, there is provided a training method of a text detection model, including: acquiring a sample image with a text labeling box, wherein the text labeling box is used for labeling a target text on a text body contained in the sample image; inputting the sample image into an initial text detection model to perform the following operations: performing main body detection to obtain the position information of the text main body; extracting features to obtain a first sample feature map; based on the position information and the first sample feature map, carrying out feature fusion to obtain a second sample feature map and angle information of each pixel point in the text marking frame; and training the initial text detection model based on the second sample feature map, the angle information and a preset loss function to obtain a target text detection model.

According to another aspect of the present disclosure, there is provided a text detection method including: performing subject detection on the target image by using a subject detection network to obtain the position information of the text subject contained in the image; extracting features of the target image to obtain a first feature image; performing feature fusion based on the position information and the first feature image to obtain a second feature image; and performing text detection by using a text detection model trained by the method of the embodiment of the disclosure based on the second characteristic image to obtain a corresponding text prediction box.

According to another aspect of the present disclosure, there is provided a training apparatus of a text detection model, including: the system comprises an acquisition module, a text marking module and a text marking module, wherein the acquisition module is used for acquiring a sample image with a text marking box, wherein the text marking box is used for marking a target text on a text main body contained in the sample image; the input module is used for inputting the sample image into the initial text detection model so as to execute corresponding operations through the following units: the main body detection unit is used for carrying out main body detection to obtain the position information of the text main body; the feature extraction unit is used for extracting features to obtain a first sample feature map; the feature fusion unit is used for carrying out feature fusion on the basis of the position information and the first sample feature map to obtain a second sample feature map and angle information of each pixel point in the text marking frame; and a model training unit, configured to train the initial text detection model based on the second sample feature map, the angle information, and a predetermined loss function, to obtain a target text detection model.

According to another aspect of the present disclosure, there is provided a text detection apparatus including: the main body detection module is used for carrying out main body detection on the target image by utilizing a main body detection network to obtain the position information of the text main body contained in the image; the feature extraction module is used for extracting features of the target image to obtain a first feature image; the feature fusion module is used for carrying out feature fusion on the basis of the position information and the first feature image to obtain a second feature image; and the text detection module is used for carrying out text detection by utilizing a text detection model obtained through training of the device of the embodiment of the disclosure based on the second characteristic image to obtain a corresponding text prediction box.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods according to embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method according to an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to an embodiment of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 illustrates a system architecture suitable for embodiments of the present disclosure;

FIG. 2 illustrates a flow chart of a training method of a text detection model according to an embodiment of the present disclosure;

fig. 3A, 3B, and 3C are diagrams illustrating a sample image of a training method of a text detection model according to an embodiment of the present disclosure, an image after subject detection according to the sample image, and a training result obtained according to the sample image, respectively;

FIG. 4 illustrates a flow chart of a text detection method according to an embodiment of the present disclosure;

FIG. 5 illustrates a schematic diagram of a text detection model according to an embodiment of the present disclosure;

FIG. 6 illustrates a block diagram of a training device of a text detection model, according to an embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of a training device of a text detection model, according to an embodiment of the present disclosure;

FIG. 8 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Target detection is needed in the OCR process, for example, an EAST (Efficient and Accuracy Scene Text, efficient and accurate scene text detection) model can be adopted for target detection. But the EAST model is not suitable for text detection of small documents in large contexts, such as extracting a document number from a document photograph in the context of a desktop.

At present, for text detection of a small document target under a large background, a pre-detection classifier can be added, namely, an image of a target area is detected first, and then text detection is carried out. However, the scheme needs to use two models, namely a target detection model and a text detection model, which are not end-to-end solutions, so that the scheme is inconvenient, time consumption and resource occupation can be obviously promoted, and engineering application is not facilitated.

The present disclosure will be described in detail below in connection with specific embodiments.

The training method of the text detection model and/or the system architecture of the text detection method suitable for the embodiment of the present disclosure are described as follows.

Fig. 1 illustrates a system architecture suitable for embodiments of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other environments or scenarios.

As shown in fig. 1, a system architecture 100 may include a client 101 and a server 102.

The client 101 may upload a large duty cycle image including the target text to the server 102. The large duty ratio image refers to that the image area where the target text is located has a small duty ratio in the whole image, and the background image has a large duty ratio. The high duty cycle image may be a desktop-backed document photograph or a desktop-backed ticket photograph.

The server 102 is provided with a text detection model, which can identify a target text from the large duty cycle image and extract corresponding text data. For example, the server 102 can extract a credential number from a credential photograph in the context of a desktop. For example, server 102 may extract the name of the bill payee from a desktop-based bill photograph. For example, the server 102 may extract the name of the occupant from a train ticket photograph in the context of a desktop.

It should be understood that the number of clients and servers in fig. 1 is merely illustrative. There may be any number of clients and servers, as desired for an implementation.

Application scenarios of the training method or the text detection method suitable for the text detection model of the embodiment of the present disclosure are described below.

It should be appreciated that the training method or text detection method of the text detection model of embodiments of the present disclosure may be applied to any image recognition scenario, and may be used to recognize text information in a large duty cycle image.

According to an embodiment of the present disclosure, the present disclosure provides a training method of a text detection model.

Fig. 2 illustrates a flowchart of a training method of a text detection model according to an embodiment of the present disclosure.

As shown in fig. 2, the training method 200 of the text detection model may include: operations S210 to S260.

In operation S210, a sample image with a text labeling box is acquired, where the text labeling box is used to label a target text on a text body included in the sample image.

In operation S220, the sample image is input into an initial text detection model.

In operation S230, a subject detection is performed to obtain the location information of the text subject.

In operation S240, feature extraction is performed to obtain a first sample feature map.

In operation S250, feature fusion is performed based on the position information and the first sample feature map, so as to obtain a second sample feature map and angle information of each pixel point in the text label box.

In operation S260, the initial text detection model is trained based on the second sample feature map, the angle information, and a predetermined loss function, to obtain a target text detection model.

In one embodiment of the present disclosure, in operation S210, a sample image with a text annotation box may be obtained from historical data. Or, the large duty ratio image can be collected again for manual labeling to obtain a sample image.

For example, a large duty cycle photograph may be screened from a database storing annotated train ticket photographs, thereby obtaining a sample image.

For example, a train ticket photo with a large duty ratio can be collected, and an image area containing a passenger is manually marked to obtain a sample image.

In operation S230, location information, at which the location of the text body can be determined, may be obtained using the body detection network.

For example, when the image area where the text body is located is a rectangular frame, the coordinates of the center point and the coordinates of the vertex of the rectangular frame may be obtained as the position information.

In operation S240, the sample image may be subjected to multi-stage convolution to obtain a multi-dimensional feature map, and the multi-dimensional feature map is used as the first sample feature map.

Illustratively, the sample image is downsampled and then subjected to multi-stage convolution processing, and the obtained final multi-dimensional feature map is the first sample feature map.

Illustratively, the sample image is downsampled and then subjected to a multi-stage convolution process, with any one or more of a plurality of multi-dimensional feature maps generated at a plurality of stages being the first sample feature map.

In operation S250, the object of feature fusion may be the position information and the final multidimensional feature map, or may be one or more of the position information and the multidimensional feature maps.

Illustratively, the first sample feature map is an nth multidimensional feature map obtained after n-stage convolution. The sample image is subjected to convolution processing of n stages, and accordingly, n-1 stage feature fusion can be carried out, wherein n is more than or equal to 2, and n is a positive integer. And (5) fusing the nth multidimensional feature map with the position information to obtain a position feature map. In the first stage feature fusion process, the position feature map and the n-1 stage convolution obtain multidimensional feature map splicing, and a first feature fusion map is obtained. In the feature fusion process of the ith stage, the ith-1 th feature fusion diagram (obtained after feature fusion of the ith-1 th stage) is subjected to up-sampling and then is subjected to feature splicing with the nth-i multidimensional feature diagram (obtained after convolution of the nth-i stage) to obtain the ith feature fusion diagram, i is more than 1 and less than or equal to n-1, and i is a positive integer.

Illustratively, the first sample feature map is n multi-dimensional feature maps obtained after n-stage convolution. The sample image is subjected to convolution processing of n stages, and accordingly, n-1 stage feature fusion is required, n is more than or equal to 2, and n is a positive integer. And fusing the position information with each multi-dimensional feature map to obtain n position feature maps. In the first stage of feature fusion process, the objects of feature fusion are: and the up-sampled image of the position feature map obtained by the convolution of the n-th stage and the position information is fused, and the position feature map obtained by the fusion of the multi-dimensional feature map and the position information is obtained by the convolution of the n-1-th stage. The output result of the first stage feature fusion is a first feature fusion graph. In the feature fusion process of the ith stage, the object of feature fusion is: and (3) an image obtained by up-sampling the feature fusion map obtained by the feature fusion in the i-1 stage, and a position feature map obtained by fusing the multidimensional feature map obtained by convolution in the n-i stage with position information. The output result of the feature fusion in the ith stage is the ith feature fusion diagram, i is more than 1 and less than or equal to n-1, and i is a positive integer.

In one embodiment of the present disclosure, in operation S250, when feature fusion is performed, the pixel values of the pixel points outside the region corresponding to the position information on the first sample feature image may be changed.

For example, when the marked train ticket photo is taken as a sample, the sample image is detected by the main body to determine the position information and the corresponding area thereof. When feature fusion is performed, the pixel value of the pixel point outside the region corresponding to the position information may be set to zero.

In operation S250, after feature fusion, angle information of each pixel point in the text label box and vertex coordinates of the text label box may be obtained based on the result of feature fusion.

The angle information of each pixel point in the text labeling frame can be obtained according to the n-1 feature fusion diagram, and meanwhile, the coordinates of four vertexes of the text labeling frame can be obtained.

In operation S260, training may be ended after the loss derived from the loss function is less than a preset loss threshold.

For example, the angle loss may be obtained according to a predetermined loss function, and training may be ended after the angle loss is less than a predetermined loss threshold.

According to the embodiment of the disclosure, main body detection is performed first, then feature extraction is performed, and then feature fusion is performed, so that when training a text detection model is performed, the attention of a network to a text main body can be enhanced, and a training result is more accurate.

As an optional embodiment, in response to the ratio of the image area in the subject prediction frame to the entire sample image being smaller than a first preset value, feature fusion is performed based on the position information and the first sample feature map, where the subject prediction frame is determined based on the position information.

In the embodiment of the disclosure, a subject prediction box can be obtained by using a subject detection network. The frame lines of the subject prediction frame may pass through boundary points of the region where the text subject is located, and each frame line of the subject prediction frame is parallel to one boundary line of the sample image.

For example, when the text body is a rectangular frame, the frame lines of the constructed body prediction frame may pass through four vertices of the rectangular frame, and each frame line of the body prediction frame is parallel to one boundary line of the sample image. And detecting the coordinates of the central point of the main body prediction frame, the width of the main body prediction frame and the height of the main body prediction frame, thereby obtaining the coordinates of four vertexes of the main body prediction frame.

Whether the sample image is a large duty ratio image or not is detected, if the sample image is the large duty ratio image, feature fusion is carried out based on the position information and the first sample feature, and the method can be used for text recognition of the large duty ratio image. The angle information of the pixel points in the target text area can be obtained based on the main body prediction frame, so that the method is particularly suitable for a detection scene with a certain inclination angle of the text.

Fig. 3A, 3B, and 3C are diagrams illustrating a sample image of a training method of a text detection model according to an embodiment of the present disclosure, an image after subject detection according to the sample image, and a training result obtained according to the sample image, respectively.

As shown in fig. 3A, the sample image 301 is a large duty cycle image, which may be, for example, a photograph of a train ticket against a desktop background. There is no less than one sample image 301. An image area 302 corresponding to the train ticket is provided in the sample image 301, and a text annotation box 303 is provided in the image area 302.

As shown in fig. 3B, after the sample image 301 is subject detected, a subject prediction box 305 is formed to indicate the location of the text subject (the location of the train ticket). The angle of the pixel point 304 within the text label box 303 is θ (or 90 ° - θ). The angle of the pixel point may be an included angle between a straight line passing through the pixel point and parallel to one side of the text label frame and a boundary line of the main body prediction frame.

As shown in fig. 3C, the subject-detected sample image 301 is subjected to feature extraction and feature fusion to form a text prediction box 306. The angle of the pixel 307 within the text prediction box 306 is θ '(or 90 ° - θ'). The angle of the pixel point may be an angle between a straight line passing through the pixel point and parallel to one side of the text prediction box and a boundary line of the main body prediction box.

As an alternative embodiment, the above-mentioned loss function includes: a point constraint loss function l=f (θ' - θ), where θ represents an angle of each pixel point in the text label box; and theta' represents the angle of each pixel point in the text prediction box corresponding to the text label box.

By way of example only, and not by way of limitation, the point constraint loss function may be l1=1-cos (θ' - θ).

The angle of each pixel point can be focused, not just the angles of four vertexes, so that the training result is more accurate.

As an alternative embodiment, the subject detection network is used to perform subject detection to obtain the location information of the text subject, where the subject detection network is a network branch in the initial text detection model.

The text data can be obtained by inputting the image with the large duty ratio into the text detection model of the embodiment of the disclosure, and the end-to-end text detection can be realized without inputting the image into two models, so that the time consumption and the resource occupation can be reduced.

According to an embodiment of the disclosure, the disclosure further provides a text detection method.

Fig. 4 illustrates a flow chart of a text detection method according to an embodiment of the present disclosure.

As shown in fig. 4, the text detection method 400 may include operations S410 to S440.

In operation S410, subject detection is performed on the target image using the subject detection network, and location information of the text subject included in the image is obtained.

In operation S420, feature extraction is performed on the target image, so as to obtain a first feature image.

In operation S430, feature fusion is performed based on the position information and the first feature image, so as to obtain a second feature image.

In operation S440, text detection is performed based on the second feature image and using the text detection model trained by the method according to the embodiment of the present disclosure, so as to obtain a corresponding text prediction box.

In one embodiment of the present disclosure, in operation S410, a subject detection network may be used to subject an object image to obtain a subject prediction frame, and then boundary point coordinates of the subject prediction frame are detected as location information.

For example, the target image is a train ticket photo with a desktop as a background, and the body detection network may establish a body prediction frame with four vertices of the train ticket, where four sides of the body prediction frame are parallel to and pass through the four vertices of the target image, respectively. Then, the coordinates of the center point of the main body prediction frame, the width of the main body prediction frame, and the height of the main body prediction frame are detected, so that the coordinates of four vertexes of the main body prediction frame are detected.

In operation S420, the target image may be subjected to multi-stage convolution to obtain a multi-dimensional feature map, and the multi-dimensional feature map is used as the first feature image.

By way of example, the target image may be subjected to an n-stage convolution process. The final multidimensional feature map may be used as the first feature image, or one or more of the feature maps of different dimensions obtained at each stage may be used as the first feature image.

In operation 430, the object of feature fusion may be a final multi-dimensional feature map obtained by multi-stage convolution of the position information and the target image, or may be one or more of multiple feature maps obtained by multi-stage convolution of the position information and the target image.

Illustratively, the first feature image is an nth multi-dimensional feature image obtained after n-stage convolution. The target image is subjected to convolution processing of n stages, and accordingly, n-1 stage feature fusion is required, n is more than or equal to 2, and n is a positive integer. And (5) fusing the nth multidimensional feature map with the position information to obtain a position feature map. In the first stage feature fusion process, the position feature map and the n-1 stage convolution obtain multidimensional feature map splicing, and a first feature fusion map is obtained. In the feature fusion process of the ith stage, the ith-1 th feature fusion diagram (obtained after feature fusion of the ith-1 th stage) is subjected to up-sampling and then is subjected to feature splicing with the nth-i multidimensional feature diagram (obtained after convolution of the nth-i stage) to obtain the ith feature fusion diagram, i is more than 1 and less than or equal to n-1, and i is a positive integer.

Illustratively, the first feature image is n multi-dimensional feature images obtained after n-stage convolution. The image is subjected to convolution processing of n stages, and accordingly, n-1 stage feature fusion is required, n is more than or equal to 2, and n is a positive integer. And fusing the position information with each multi-dimensional feature map to obtain a plurality of position feature maps. In the first stage of feature fusion process, the objects of feature fusion are: and the up-sampled image of the position feature map obtained by the convolution of the n-th stage and the position information is fused, and the position feature map obtained by the fusion of the multi-dimensional feature map and the position information is obtained by the convolution of the n-1-th stage. The output result of the first stage feature fusion is a first feature fusion graph. In the feature fusion process of the ith stage, the object of feature fusion is: and (3) an image obtained by up-sampling the feature fusion map obtained by the feature fusion in the i-1 stage, and a position feature map obtained by fusing the multidimensional feature map obtained by convolution in the n-i stage with position information. The output result of the feature fusion in the ith stage is the ith feature fusion diagram, i is more than 1 and less than or equal to n-1, and i is a positive integer.

In operation S430, when feature fusion is performed with the first feature image based on the position information, pixel values of pixel points on the first feature image outside the region corresponding to the position information may be changed.

Illustratively, the target image has a subject prediction box after subject detection. When feature fusion is performed, the pixel value of the pixel point of the first feature image outside the main body prediction frame may be set to zero.

In operation S440, a corresponding text prediction box is obtained, and the text prediction box may be adjusted according to the angle information, or text information in a corresponding region of the text prediction box may be directly identified.

For example, when the target image is a train ticket photo with a desktop as a background, the text in the image area corresponding to the text prediction box is "Zhang san" (the name of the train ticket holder), the text prediction box can be adjusted to be horizontal and then OCR can be performed, or OCR can be performed directly, so that the name of the license plate holder is recognized.

By the embodiment of the disclosure, the text detection model performs text detection based on the second characteristic image, so that the attention of the model to a text main body in the image can be enhanced.

As an alternative embodiment, the subject detection network is a network branch in a subject detection model.

The text data can be obtained by inputting the images with large duty ratio into the text detection model, so that the end-to-end text recognition can be realized, and the time consumption and the resource occupation can be reduced.

As an optional embodiment, the target image includes a text body, and a ratio of the text body in the whole target image is smaller than a second preset value.

The position information and the first sample feature are subjected to feature fusion, so that the position of the text main body can be determined, and the method can be used for text recognition of a large duty ratio picture.

As an alternative embodiment, the text on the text body has a preset inclination angle.

The angle information of the pixel points of the text on the text body can be obtained based on the position information of the text body, so that the method is particularly suitable for detection scenes with a certain inclination angle of the text.

Fig. 5 illustrates a schematic diagram of a text detection model according to an embodiment of the present disclosure.

As shown in fig. 5, the text detection model includes a main body detection branch 501, a feature extraction branch 502, a feature fusion branch 503, and an output branch 504.

The main body detection branch 501 is a network branch of the text detection model 500, and is used for performing main body detection to obtain position information of a text main body in an input image.

And a feature extraction branch 502, configured to perform feature extraction, so as to obtain a multi-dimensional feature map. The image may be downsampled and then subjected to multiple stages of feature extraction to obtain multiple multi-dimensional feature maps. Meanwhile, the image adopted by the main body detection branch in main body detection is also the downsampled image.

Illustratively, downsampling the image to a 7×7 grid, performing one-stage convolution to generate a 64-dimensional feature map, and downsampling multiple is 2; performing two-stage convolution to generate a 128-dimensional feature map, wherein the downsampling multiple is 2; then three-stage convolution is carried out to generate 256-dimensional feature images, and the downsampling multiple is 2; finally, four-stage convolution is carried out to generate 384-dimensional feature images, and the downsampling multiple is 2.

And a feature fusion branch 503, configured to perform feature fusion based on the location information and the multidimensional feature map. When the features are fused, a final multidimensional feature map obtained after feature extraction of a plurality of stages can be fused with the position information, and each multidimensional feature map obtained by feature extraction of a plurality of stages can be fused with the position information.

Illustratively, the target image is downsampled to a 7×7 grid, and then subjected to a four-stage convolution process to obtain a 384-dimensional feature map. The position information may be fused with the 384-dimensional feature map to obtain a first position feature map. And then carrying out one-stage feature fusion, splicing the first position feature map with the 256-dimensional feature map after upsampling, obtaining a first feature spliced map by upsampling multiple of 2, and then carrying out 1X 1 convolution and 3X 3 convolution on the first feature spliced map in sequence to obtain a 128-dimensional feature fusion map. And then carrying out two-stage feature fusion, up-sampling the 128-dimensional feature fusion graph, then splicing the 128-dimensional feature fusion graph with the 128-dimensional feature graph, obtaining a second feature splicing graph by up-sampling multiple of 2, and then carrying out 1X 1 convolution and 3X 3 convolution on the second feature splicing graph in sequence to obtain a 64-dimensional feature fusion graph. And then carrying out three-stage feature fusion, up-sampling the 64-dimensional feature fusion graph, then splicing the 64-dimensional feature fusion graph with the 64-dimensional feature graph, obtaining a third feature splicing graph by up-sampling multiple of 2, and then carrying out 1X 1 convolution and 3X 3 convolution on the third feature splicing graph in sequence to obtain a 32-dimensional feature fusion graph. Finally, the third feature fusion graph is convolved by 3×3 to obtain an output feature graph of feature fusion branch 503.

Illustratively, the target image is downsampled to a 7×7 grid, and then subjected to a four-stage convolution process to obtain a 64-dimensional feature map, a 128-dimensional feature map, a 256-dimensional feature map, and a 384-dimensional feature map. The position information may be fused with the 384-dimensional feature map, the 256-dimensional feature map, the 128-dimensional feature map, and the 64-dimensional feature map, respectively, to obtain a first position feature map, a second position feature map, a third position feature map, and a fourth position feature map, respectively. And then carrying out one-stage feature fusion, splicing the first position feature map with the second position feature map after upsampling, obtaining a first feature spliced map by upsampling multiple of 2, and then carrying out 1X 1 convolution and 3X 3 convolution on the first feature spliced map in sequence to obtain a 128-dimensional feature fusion map. And then carrying out two-stage feature fusion, up-sampling the 128-dimensional feature fusion map, then splicing the 128-dimensional feature fusion map with the third position feature map, obtaining a second feature splicing map by up-sampling multiple of 2, and then carrying out 1X 1 convolution and 3X 3 convolution on the second feature splicing map in sequence to obtain a 64-dimensional feature fusion map. And then carrying out three-stage feature fusion, up-sampling the 64-dimensional feature fusion map, then splicing the 64-dimensional feature fusion map with the fourth position feature map, obtaining a third feature splicing map by up-sampling multiple of 2, and then carrying out 1X 1 convolution and 3X 3 convolution on the third feature splicing map in sequence to obtain a 32-dimensional feature fusion map. Finally, the 3×3 convolution is performed on the 32-dimensional feature fusion map to obtain an output feature map of the feature fusion branch 503.

The output branch 504 outputs the text detection result, i.e. the text feature map, the angle information of each pixel point in the text label box, and the coordinate information of the vertex of the text label box.

Illustratively, output branch 504 convolves the output profile with 1 x 1 to yield text profile 5041. The output branch 504 convolves the output feature map by 1 x 1 to obtain angle information 5043 for each pixel in the text label box. The output branch 504 convolves the output feature map by 1 x 1 to obtain coordinate information 5042 for the vertices of the text label box.

It should be appreciated that although the feature extraction branch 502 performs 4-stage feature extraction and the feature fusion branch 503 performs 3-stage feature fusion in the embodiment of the present disclosure, in practice, the feature extraction branch 502 may perform n-stage feature extraction, the feature fusion branch 503 may perform n-1 stage or less than n-1 stage feature fusion, and n is ≡2.

According to an embodiment of the disclosure, the disclosure further provides a training device of the text detection model.

Fig. 6 illustrates a block diagram of a training device of a text detection model according to an embodiment of the present disclosure.

As shown in fig. 6, the training device 600 for text detection model includes an acquisition module 610, an input module 620, a main body detection unit 630, a feature extraction unit 640, a feature fusion unit 650, and a model training unit 660.

An obtaining module 610, configured to obtain a sample image with a text labeling box, where the text labeling box is used to label a target text on a text body included in the sample image.

The input module 620 is configured to input the sample image into the initial text detection model to perform a corresponding operation through the following units.

And a main body detection unit 630, configured to perform main body detection to obtain the position information of the text main body.

And a feature extraction unit 640, configured to perform feature extraction to obtain a first sample feature map.

And a feature fusion unit 650, configured to perform feature fusion based on the position information and the first sample feature map, to obtain a second sample feature map and angle information of each pixel point in the text label box.

And a model training unit 660, configured to train the initial text detection model based on the second sample feature map, the angle information, and a preset loss function, to obtain a target text detection model.

As an optional embodiment, the feature fusion unit is further configured to perform feature fusion based on the location information and the first sample feature map in response to a ratio of an image area in a subject prediction frame in the entire sample image being smaller than a first preset value, where the subject prediction frame is determined based on the location information.

As an optional embodiment, the subject detection unit is configured to perform subject detection by using a subject detection network, to obtain location information of the text subject, where the subject detection network is a network branch in the initial text detection model.

According to an embodiment of the disclosure, the disclosure further provides a text detection device.

Fig. 7 illustrates a block diagram of a training device of a text detection model according to an embodiment of the present disclosure.

As shown in fig. 7, the text detection device 700 includes a main body detection module 710, a feature extraction module 720, a feature fusion module 730, and a text detection module 740.

The subject detection module 710 is configured to perform subject detection on the target image by using a subject detection network, so as to obtain location information of a text subject included in the image.

The feature extraction module 720 is configured to perform feature extraction on the target image to obtain a first feature image.

And a feature fusion module 730, configured to perform feature fusion based on the position information and the first feature image, so as to obtain a second feature image.

The text detection module 740 is configured to perform text detection based on the second feature image and using a text detection model obtained by training the device according to the embodiment of the disclosure to obtain a corresponding text prediction box.

As an alternative embodiment, the subject detection network is a network branch in the text detection model.

It should be understood that, in the embodiments of the apparatus portion of the present disclosure, the embodiments of the method portion of the present disclosure are the same or similar, and the technical problems to be solved and the technical effects to be achieved are also the same or similar, which are not described herein in detail.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a training method (or a text detection method) of the text detection module. For example, in some embodiments, the training method of the text detection module (or text detection method) may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the training method (or text detection method) of the text detection module described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the training method (or text detection method) of the text detection module in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user information all conform to the regulations of related laws and regulations and do not violate the popular regulations.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions provided by the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of a text detection model, comprising:

acquiring a sample image with a text annotation box, wherein the text annotation box is used for annotating a target text on a text body contained in the sample image;

Inputting the sample image into an initial text detection model to perform the following operations:

performing main body detection to obtain the position information of the text main body as a main body prediction frame;

extracting features to obtain a first sample feature map;

based on the position information and the first sample feature map, carrying out feature fusion to obtain a second sample feature map and angle information of each pixel point in the text marking frame, wherein the angle information of each pixel point is an included angle between a straight line passing through the pixel point and a boundary line of a main body prediction frame, and the straight line is parallel to one side of the text marking frame; and

training the initial text detection model based on the second sample feature map, the angle information and a preset loss function to obtain a target text detection model, wherein the loss function comprises a point constraint function, the point constraint function is determined based on the difference value between the angle of each pixel point in a text prediction box and the angle of each pixel point in a text labeling box, and the text prediction box represents the position information of the target text output by the initial text detection model.

2. The method according to claim 1, wherein:

And performing feature fusion based on the position information and the first sample feature map in response to the ratio of the image area in the main body prediction frame in the whole sample image being smaller than a first preset value, wherein the main body prediction frame is determined based on the position information.

3. The method of claim 1 or 2, wherein the loss function comprises: point constraint loss functionWherein, the method comprises the steps of, wherein,

θ represents the angle of each pixel point in the text labeling frame; and

and theta' represents the angle of each pixel point in the text prediction box corresponding to the text labeling box.

4. The method according to claim 1, wherein:

and performing body detection by using a body detection network to obtain the position information of the text body, wherein the body detection network is a network branch in the initial text detection model.

5. A text detection method, comprising:

performing subject detection on the target image by using a subject detection network to obtain the position information of the text subject contained in the image;

extracting features of the target image to obtain a first feature image;

performing feature fusion based on the position information and the first feature image to obtain a second feature image; and

Based on the second characteristic image, performing text detection by using a text detection model trained by the method of any one of claims 1 to 4, and obtaining a corresponding text prediction box.

6. The method of claim 5, wherein the subject detection network is a network branch in the text detection model.

7. The method of claim 5, wherein the target image includes a text body therein, the text body having a duty cycle in the entire target image that is less than a second preset value.

8. The method of claim 7, wherein the text on the text body has a preset tilt angle.

9. A training device for a text detection model, comprising:

the system comprises an acquisition module, a text marking module and a text marking module, wherein the acquisition module is used for acquiring a sample image with a text marking box, and the text marking box is used for marking a target text on a text main body contained in the sample image;

an input module for inputting the sample image into an initial text detection model to perform a corresponding operation by:

the main body detection unit is used for carrying out main body detection to obtain the position information of the text main body and used as a main body prediction frame;

The feature extraction unit is used for extracting features to obtain a first sample feature map;

the feature fusion unit is used for carrying out feature fusion on the basis of the position information and the first sample feature image to obtain a second sample feature image and angle information of each pixel point in the text marking frame, wherein the angle information of each pixel point is an included angle between a straight line passing through the pixel point and a boundary line of the main body prediction frame, and the straight line is parallel to one side of the text marking frame; and

the model training unit is used for training the initial text detection model based on the second sample feature map, the angle information and a preset loss function to obtain a target text detection model, wherein the loss function comprises a point constraint function, the point constraint function is determined based on the difference value between the angle of each pixel point in a text prediction box and the angle of each pixel point in a text labeling box, and the text prediction box represents the position information of the target text output by the initial text detection model.

10. The apparatus of claim 9, wherein:

the feature fusion unit is further configured to perform feature fusion based on the position information and the first sample feature map in response to a ratio of an image area in a main body prediction frame in the whole sample image being smaller than a first preset value, where the main body prediction frame is determined based on the position information.

11. According to claim9 or 10, wherein the loss function comprises: point constraint loss functionWherein, the method comprises the steps of, wherein,

θ represents the angle of each pixel point in the text labeling frame; and

12. The apparatus of claim 9, wherein:

the main body detection unit is used for carrying out main body detection by using a main body detection network to obtain the position information of the text main body, wherein the main body detection network is a network branch in the initial text detection model.

13. A text detection device, comprising:

the main body detection module is used for carrying out main body detection on the target image by utilizing a main body detection network to obtain the position information of the text main body contained in the image;

the feature extraction module is used for extracting features of the target image to obtain a first feature image;

the feature fusion module is used for carrying out feature fusion on the basis of the position information and the first feature image to obtain a second feature image; and

the text detection module is used for carrying out text detection based on the second characteristic image by using a text detection model trained by the device of any one of claims 9 to 12 to obtain a corresponding text prediction box.

14. The apparatus of claim 13, wherein the subject detection network is a network branch in the text detection model.

15. The apparatus of claim 13, wherein the target image includes a text body therein, the text body having a duty cycle in the entire target image that is less than a second preset value.

16. The apparatus of claim 15, wherein the text on the text body has a preset tilt angle.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.