CN112348028A

CN112348028A - A scene text detection method, correction method, device, electronic device and medium

Info

Publication number: CN112348028A
Application number: CN202011385920.1A
Authority: CN
Inventors: 孙永毫; 徐强
Original assignee: Guangdong Guoli Education Technology Co ltd
Current assignee: Guangdong Guoli Education Technology Co ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-02-09

Abstract

The invention provides a scene text detection method, a correction method, a device, an electronic device and a medium, which belong to the technical field of network intelligence education. The scene text detection method includes acquiring a target picture, and the target picture is sent by an intelligent terminal; processing the target picture according to a feature pyramid network, generating a feature map F, predicting a probability map P and a threshold map T through the feature map F, and passing the probability map The graph P and the threshold graph T generate an approximate binary graph B; use a differentiable binarization processing model to perform adaptive thresholding on the approximate binary graph B to obtain a first target result, where the first target result includes generating different regions in the target picture. The present invention utilizes a differentiable binarization processing model to implement binarization operation in the segmentation network to achieve the effect of combined optimization, and realizes the adaptive threshold value in various places of the heat map, thereby shortening the inference calculation time of image and text recognition, and improving the Identify the correction rate.

Description

Scene text detection method, correction method, device, electronic equipment and medium

Technical Field

The invention belongs to the technical field of network intelligent education, and particularly relates to a scene text detection method, a correction device, electronic equipment and a medium.

Background

With the development of computer technology, on-line teaching is rapidly developed, and some corresponding teaching tool products are produced at the same time, so that technical support and help in education guidance are provided for students, teachers and parents, and many teaching tool products can provide the function of correcting subjects by taking pictures or screenshots.

The most important thing for the modification of the picture is the recognition process, and the most dependent on the quality of the picture to be photographed. Different from document character recognition, character recognition in natural scenes has the problems of complex image background, low resolution, various fonts, various shapes and the like, and the traditional optical character recognition cannot be applied under the conditions. In order to better recognize the natural scene text, the scene text needs to be detected more accurately.

In recent years, segmentation-based methods have become popular in the field of scene text detection because they are more accurate in detecting scene texts of various shapes (curved, vertical, multi-directional).

Due to the pixel-level prediction result, the scene character detection method based on segmentation can describe characters with different shapes, and is popular recently. However, most segmentation-based methods require complex post-processing to classify pixel-level predictors into detected text instances, resulting in a relatively high time-cost for inference.

The detection of scene text based on segmentation converts a probability map (thermodynamic diagram) generated by a segmentation method into a bounding box and a character area, wherein a binarization post-processing process is included. The binarization process is very critical, a fixed threshold value is set in the conventional binarization operation, and the used standard binarization function is not differentiable, so that the fixed threshold value is difficult to adapt to complex and variable detection scenes, and finally, the detected result has high distortion rate, low accuracy and high requirement on post-processing.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a scene text detection method, a correction device, an electronic device and a medium, solves the problem that binarization processing cannot be differentiated in the prior art, and overcomes the technical bottleneck of low recognition efficiency.

In order to achieve the above object, in a first aspect, the present invention provides a scene text detection method, where the method includes:

acquiring a target picture, wherein the target picture is sent by an intelligent terminal;

processing the target picture according to the feature pyramid network to generate a feature map F, predicting a probability map P and a threshold map T through the feature map F, and generating an approximate binary map B through the probability map P and the threshold map T;

and carrying out self-adaptive threshold processing on the approximate binary image B by utilizing a differentiable binarization processing model to obtain a first target result, wherein the first target result comprises different regions in the target image.

Further, in a differentiable binarization processing model, an approximate step function is introduced, differentiable binarization processing is applied to a segmentation network, and when a relationship between the probability map P and the threshold map T and the binary map B is established, the following formula is used:

where k is the amplification factor.

Further, the first target result includes at least one functional area generated by segmentation, and the functional area is calculated and identified to obtain a first identification contour, where the first identification contour is described by a set of line segments:

wherein n represents the number of vertices;

the polygon is reduced by the Vatti clipping algorithm, and the contraction offset D is calculated by the perimeter L and the area A:

where r is the contraction factor.

Further, optimizing the first target result by using a loss function L, wherein the loss function L is obtained by calculating the weight of the probability map P loss Ls, the binary map B loss Lb and the threshold map T loss Lt: l ═ Ls + α × Lb + β × Lt, where α and β are weighting factors, and the probability map P penalty Ls and the binary map B penalty Lb use a binary cross entropy penalty function:

wherein S is_tA sample set representing a ratio of positive and negative samples of 1: 3;

lt uses the L1 distance loss function:

in a second aspect, the present invention further provides a batching method, which applies the scene text detection method as described above, where the method includes:

applying the scene text detection method according to any one of claims 1 to 4 to obtain a first target result;

and carrying out correction processing on the first target result to obtain a second target result.

Further, a simulation training model is added, and in the training phase, supervision is carried out on the probability map P, the threshold map T and the approximate binary map B, wherein the threshold map T and the approximate binary map B share the same supervision.

Further, in the process of separating the first target result, determining a test paper outline, a text line outline and an answer number frame outline, wherein the test paper outline comprises a whole target picture, the text line outline comprises each line of text, the answer number frame outline comprises the answer number of each question, the upper border of each question is defined by the answer number frame outline and the text line outline, the left end point and the right end point of the upper border are extended, the upper border is connected with the test paper outline in the left-right extending direction, and the upper border divides the test paper outline into at least one test question area.

And further, calculating and identifying the test question area, wherein the first identification outline comprises a print outline, a graph outline and a handwritten outline, the print outline and the graph outline form question information, and the handwritten outline forms answer information.

Further, in the process of performing correction processing on the first target result, the first target result includes topic information and answer information, OCR recognition is performed on the topic information to obtain topic text recognition information, and OCR recognition is performed on the answer information to obtain answer text recognition information;

extracting key words in the question text identification information according to the question text identification information and the graphic outline, and inquiring in a database according to the key words to obtain a similar original question group; and identifying a graph area in the similar question group, judging the graph similarity between the graph area and the graph outline, determining a final question from the similar question group when the graph similarity is greater than a preset similarity, and inquiring to obtain a corresponding answer analysis according to the final question.

In a third aspect, the present invention further provides an apparatus applied to the scene text detection method, including:

an acquisition unit configured to acquire a target picture, the target picture being transmitted by an intelligent terminal;

a generating unit configured to process the target picture according to a feature pyramid network, generate a feature map F, predict a probability map P and a threshold map T from the feature map F, and generate an approximate binary map B from the probability map P and the threshold map T;

a binarization unit configured to perform adaptive threshold processing on the approximated binary image B by using a differentiable binarization processing model to obtain a first target result, wherein the first target result comprises different regions in the generated target image.

In a fourth aspect, the present invention further provides an apparatus applied to the above modifying method, including:

the scene text detection device further includes a correcting unit, where the correcting unit is configured to perform correcting processing on the first target result to obtain a second target result.

In a fifth aspect, the present invention further provides an electronic device, including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, at least one program, a code set, or an instruction set is loaded and executed by the processor to implement the scene text detection method as described above, or the modification method as described above.

In a sixth aspect, the present invention is also a computer readable storage medium, on which computer instructions are stored, which when executed by a processor implement the steps of the scene text detection method or the steps of the correction method.

The invention has the beneficial effects that:

the invention utilizes a differentiable binarization processing model to implement binarization operation in a segmentation network so as to achieve the effect of combination optimization and realize the self-adaptation of the threshold value in each position of the thermodynamic diagram, thereby shortening the reasoning and calculating time of picture and character recognition, improving the recognition correction rate, having high accuracy of detection and recognition and reducing the requirement on post-processing.

Drawings

The invention is further illustrated by means of the attached drawings, but the embodiments in the drawings do not constitute any limitation to the invention, and for a person skilled in the art, other drawings can be obtained on the basis of the following drawings without inventive effort.

Fig. 1 is a schematic flowchart of a scene text detection method provided in this embodiment 1.

Fig. 2 is a schematic flow chart of a batching method provided in this embodiment 2.

Fig. 3 is a schematic diagram of a framework of a scene text detection device provided in embodiment 3.

Fig. 4 is a schematic diagram of a frame of a modification device provided in this embodiment 4.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Example 1:

referring to fig. 1, this embodiment 1 provides a scene text detection method, where the method includes:

acquiring a target picture, wherein the target picture is sent by an intelligent terminal and can be obtained by shooting and sending by the intelligent terminal;

After receiving a target picture, converting the output of the feature pyramid into the same size by an upsampling mode through a backbone with a feature pyramid structure, and cascading (cascade) to generate a feature F; in a conventional binarization operation method, where P denotes a probability map, t denotes a partition threshold, the probability map P output by the network is partitioned by a fixed threshold, specifically using the following formula:

since this binarization calculation method is not differentiable, this method cannot be optimized with the segmentation network in the training phase.

In this embodiment, an approximate step function is introduced into the differentiable binarization processing model, the differentiable binarization processing is applied to the segmentation network, and when a relationship between the probability map P and the threshold map T is established, the following formula is used:

where k is an amplification factor and is empirically set to 50.

The binary calculation is differentiable through the formula, so that the condition of gradient back propagation can be met, and the differentiable binary calculation with the self-adaptive threshold value is not only beneficial to segmenting different regions, but also can segment similar example regions.

In this embodiment, the first target result includes at least one functional area generated by segmentation, and the functional area is calculated and identified to obtain a first identification contour, where the first identification contour is a polygonal shape and includes curved, vertical, and multi-directional sides with different angles, that is, the first identification contour may be described by a set of line segments:

wherein n represents the number of vertices;

where r is the contraction factor.

In this embodiment, the first target result is optimized by using a loss function L, which is obtained by weight calculation from the probability map P loss Ls, the binary map B loss Lb, and the threshold map T loss Lt: l is Ls + α × Lb + β × Lt, where α and β are weighting factors, and α and β are set to 1.0 and 10, respectively.

The probability map P penalty Ls and the binary map B penalty Lb use a binary cross entropy penalty function:

lt uses the L1 distance loss function:

example 2:

referring to fig. 2, this embodiment 2 provides a batching method to which the scene text detection method in embodiment 1 is applied, and the method includes:

applying the scene text detection method in embodiment 1 to obtain a first target result;

It should be noted that the scene text detection method in embodiment 1 is applied to a correction method in the educational field, and more specifically, a target photo is obtained by taking a picture, and after the processing by the scene text detection method in embodiment 1, a first target result is obtained, and the first target result separates areas representing different meanings, such as titles, answers, graphs, and formulas, in the target photo according to a set requirement, and performs correction processing on the first target result, and pushes a second target result to a user.

In this embodiment, a simulation training model is added, thereby leading to two phases: a training phase and a correcting phase. In the training phase, supervising is carried out on the probability map P, the threshold map T and the approximate binary map B, wherein the threshold map T and the approximate binary map B share the same supervising; by the method, the boundary box can be easily and quickly acquired from the threshold value image T and the approximate binary value image B in the correction stage.

Additionally, training data can be randomly generated through a simulation training model, and the model can be effectively perfected and the response speed of the reaction is improved through the training data and the flow sequence of detection and correction.

As a preferable mode, in the process of separating the first target result, a test paper contour, a text line contour and an answer number frame contour are determined, the test paper contour includes the whole target picture, the text line contour includes each line of text, the answer number frame contour includes the answer number of each answer, the upper boundary of each answer is defined by the answer number frame contour and the text line contour, the left end point and the right end point of the upper boundary are extended, the upper boundary is connected with the test paper contour in the left-right extending direction, and the upper boundary separates the test paper contour into at least one test question area.

It should be noted that the upper boundary of the question mark frame outline is the upper boundary of each question, the upper boundary of each question can be defined by the question mark frame outline and the corresponding text line outline of the first line, after the left and right end points of the upper boundary are extended, the upper boundary is combined with the test paper outline, namely the upper boundary divides the test paper outline into at least one test question area, one test question area is arranged between every two upper boundaries, and the last test question of the page is arranged between the upper boundary and the bottom edge of the test paper outline.

As a preferred mode, calculating and identifying each test question area to obtain a first identification outline, wherein the first identification outline comprises a printing outline, a graph outline and a handwritten outline, the printing outline and the graph outline form question information, and the handwritten outline forms answer information; it should be noted that, the method can also include a formula outline, and because the subject information includes a formula and a figure besides printed characters, the three elements form all the subject information; the outline of the handwriting may also include characters and formulas, but since all are handwriting, all are included in the answer information.

As a preferred mode, in the course of performing the correction processing on the first target result, since the first target result is already calculated and identified, and corresponding topic information and answer information are formed, wherein each topic information and answer information corresponds to a label, further, performing OCR recognition on the topic information to obtain topic text recognition information, and performing OCR recognition on the answer information to obtain answer text recognition information;

extracting key words in the question text identification information according to the question text identification information and the graphic outline and also can comprise a formula outline, and inquiring in a database according to the key words to obtain a similar original question group; and identifying a graph area in the similar question group, judging the graph similarity between the graph area and the graph outline, determining a final question from the similar question group when the graph similarity is greater than a preset similarity, and inquiring to obtain a corresponding answer analysis according to the final question.

It should be noted that, in the first step, a similar problem group is found according to the keywords in the topic text identification information, where the similar problem group includes at least one problem, and then the final problem is further located by combining the graphic similarity between the graphic area of the problem and the graphic outline, and optionally, the final problem and the corresponding answer are pushed to the user for analysis; of course, according to practical situations, after the title is corrected, at least one of the score/loss, the correction mark and the score ranking related to the title can be pushed, and correction can be selected to be directly performed on the target picture or the correction result can be sent to the intelligent terminal.

Example 3:

referring to fig. 3, this embodiment 3 provides an apparatus applied to the scene text detection method in embodiment 1, including:

The method includes the steps of obtaining a target picture by an obtaining unit, wherein the target picture includes information of test question questions, test question answers, examinees, examination time, subjects, grade and the like, generating an approximate binary image B for the target picture, performing adaptive threshold processing on the approximate binary image B by using a differentiable binarization processing model, and accurately segmenting different text regions in the scene.

Example 4:

referring to fig. 4, this embodiment 4 provides an apparatus applied to the correction method in embodiment 2, including the scene text detection apparatus in embodiment 3 and a correction unit, where the correction unit is configured to perform correction processing on the first target result to obtain a second target result.

It should be noted that, on the basis of embodiment 3, a correction unit is added, and different text areas are identified, judged and corrected in combination with the photographing correction requirement in the education field.

Example 5:

this embodiment 5 provides an electronic device, including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, at least one program, a code set, or a set of instructions is loaded and executed by the processor to implement the scene text detection method in embodiment 1 or the modification method in embodiment 2.

Example 6:

this embodiment 6 provides a computer-readable storage medium on which computer instructions are stored, which when executed by a processor implement the steps of the scene text detection method as in embodiment 1, or the steps of the correction method as in embodiment 2.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Compared with the prior art, the method and the device adopt a model detection algorithm aiming at the target image to generate the question information detection result and the answer information detection result, perform OCR model recognition on the two detection results respectively to recognize the character line recognition result and the formula recognition result, improve the detection and recognition efficiency of the chart and the formula in the test question and the answer, and further improve the correction efficiency.

Finally, it should be emphasized that the present invention is not limited to the above-described embodiments, but only the preferred embodiments of the invention have been described above, and the present invention is not limited to the above-described embodiments, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A scene text detection method is characterized by comprising the following steps:

2. The method for detecting the scene text according to claim 1, characterized in that an approximate step function is introduced into the differentiable binarization processing model, the differentiable binarization processing is applied to the segmentation network, and when establishing the relationship between the probability map P and the threshold map T and the binary map B, the following formula is used:

where k is the amplification factor.

3. The method as claimed in claim 2, wherein the first objective result comprises at least one functional area generated by segmentation, and the functional area is calculated and identified to obtain a first identification contour, and the first identification contour is described by a set of line segments:

wherein n represents the number of vertices;

where r is the contraction factor.

4. The scene text detection method according to claim 3, wherein the first objective result is optimized by using a loss function L, which is obtained by weight calculation from the probability map P loss Ls, the binary map B loss Lb and the threshold map T loss Lt: l ═ Ls + α × Lb + β × Lt, where α and β are weighting factors, and the probability map P penalty Ls and the binary map B penalty Lb use a binary cross entropy penalty function:

lt uses the L1 distance loss function:

5. a batching method for applying the scene text detection method according to any one of claims 1 to 4, characterized in that the method comprises:

6. A batching method as claimed in claim 5, characterized in that a simulated training model is added, in the training phase supervision being applied on said probability map P, threshold map T and approximated binary map B, wherein said threshold map T and approximated binary map B share the same supervision.

7. The approval method of claim 6, wherein in the process of dividing the first target result, a test paper contour, a text line contour and an question mark frame contour are determined, the test paper contour comprises the whole target picture, the text line contour comprises each line of text, the question mark frame contour comprises the question mark of each question, an upper boundary of each question is defined by the question mark frame contour and the text line contour, left and right end points of the upper boundary are extended, the upper boundary is connected with the test paper contour in a left and right extending direction, and the upper boundary divides the test paper contour into at least one test question area.

8. The approval method of claim 7, wherein the test question region is computationally identified, the first identification profile comprises a block profile, a figure profile and a handwritten profile, the block profile and the figure profile constitute question information, and the handwritten profile constitutes answer information.

9. The approval method of claim 8, wherein during the approval process of the first target result, the first target result comprises topic information and answer information, OCR recognition is performed on the topic information to obtain topic text recognition information, OCR recognition is performed on the answer information to obtain answer text recognition information;

10. An apparatus applied to the scene text detection method according to any one of claims 1 to 4, comprising:

11. An apparatus applied to the correction method according to any one of claims 5 to 9, characterized by comprising:

the scene text detection device as claimed in claim 10, further comprising a modification unit configured to perform modification processing on the first target result to obtain a second target result.

12. An electronic device comprising a processor and a memory, wherein the memory has stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the scene text detection method according to any one of claims 1 to 4, or to implement the correction method according to any one of claims 5 to 9.

13. A computer readable storage medium having stored thereon computer instructions, which, when executed by a processor, carry out the steps of the scene text detection method according to any one of claims 1 to 4, or carry out the steps of the correction method according to any one of claims 5 to 9.