Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a scene text detection method, a correction device, an electronic device and a medium, solves the problem that binarization processing cannot be differentiated in the prior art, and overcomes the technical bottleneck of low recognition efficiency.
In order to achieve the above object, in a first aspect, the present invention provides a scene text detection method, where the method includes:
acquiring a target picture, wherein the target picture is sent by an intelligent terminal;
processing the target picture according to the feature pyramid network to generate a feature map F, predicting a probability map P and a threshold map T through the feature map F, and generating an approximate binary map B through the probability map P and the threshold map T;
and carrying out self-adaptive threshold processing on the approximate binary image B by utilizing a differentiable binarization processing model to obtain a first target result, wherein the first target result comprises different regions in the target image.
Further, in a differentiable binarization processing model, an approximate step function is introduced, differentiable binarization processing is applied to a segmentation network, and when a relationship between the probability map P and the threshold map T and the binary map B is established, the following formula is used:
where k is the amplification factor.
Further, the first target result includes at least one functional area generated by segmentation, and the functional area is calculated and identified to obtain a first identification contour, where the first identification contour is described by a set of line segments:
wherein n represents the number of vertices;
the polygon is reduced by the Vatti clipping algorithm, and the contraction offset D is calculated by the perimeter L and the area A:
where r is the contraction factor.
Further, optimizing the first target result by using a loss function L, wherein the loss function L is obtained by calculating the weight of the probability map P loss Ls, the binary map B loss Lb and the threshold map T loss Lt: l ═ Ls + α × Lb + β × Lt, where α and β are weighting factors, and the probability map P penalty Ls and the binary map B penalty Lb use a binary cross entropy penalty function:
wherein S istA sample set representing a ratio of positive and negative samples of 1: 3;
lt uses the L1 distance loss function:
in a second aspect, the present invention further provides a batching method, which applies the scene text detection method as described above, where the method includes:
applying the scene text detection method according to any one of claims 1 to 4 to obtain a first target result;
and carrying out correction processing on the first target result to obtain a second target result.
Further, a simulation training model is added, and in the training phase, supervision is carried out on the probability map P, the threshold map T and the approximate binary map B, wherein the threshold map T and the approximate binary map B share the same supervision.
Further, in the process of separating the first target result, determining a test paper outline, a text line outline and an answer number frame outline, wherein the test paper outline comprises a whole target picture, the text line outline comprises each line of text, the answer number frame outline comprises the answer number of each question, the upper border of each question is defined by the answer number frame outline and the text line outline, the left end point and the right end point of the upper border are extended, the upper border is connected with the test paper outline in the left-right extending direction, and the upper border divides the test paper outline into at least one test question area.
And further, calculating and identifying the test question area, wherein the first identification outline comprises a print outline, a graph outline and a handwritten outline, the print outline and the graph outline form question information, and the handwritten outline forms answer information.
Further, in the process of performing correction processing on the first target result, the first target result includes topic information and answer information, OCR recognition is performed on the topic information to obtain topic text recognition information, and OCR recognition is performed on the answer information to obtain answer text recognition information;
extracting key words in the question text identification information according to the question text identification information and the graphic outline, and inquiring in a database according to the key words to obtain a similar original question group; and identifying a graph area in the similar question group, judging the graph similarity between the graph area and the graph outline, determining a final question from the similar question group when the graph similarity is greater than a preset similarity, and inquiring to obtain a corresponding answer analysis according to the final question.
In a third aspect, the present invention further provides an apparatus applied to the scene text detection method, including:
an acquisition unit configured to acquire a target picture, the target picture being transmitted by an intelligent terminal;
a generating unit configured to process the target picture according to a feature pyramid network, generate a feature map F, predict a probability map P and a threshold map T from the feature map F, and generate an approximate binary map B from the probability map P and the threshold map T;
a binarization unit configured to perform adaptive threshold processing on the approximated binary image B by using a differentiable binarization processing model to obtain a first target result, wherein the first target result comprises different regions in the generated target image.
In a fourth aspect, the present invention further provides an apparatus applied to the above modifying method, including:
the scene text detection device further includes a correcting unit, where the correcting unit is configured to perform correcting processing on the first target result to obtain a second target result.
In a fifth aspect, the present invention further provides an electronic device, including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, at least one program, a code set, or an instruction set is loaded and executed by the processor to implement the scene text detection method as described above, or the modification method as described above.
In a sixth aspect, the present invention is also a computer readable storage medium, on which computer instructions are stored, which when executed by a processor implement the steps of the scene text detection method or the steps of the correction method.
The invention has the beneficial effects that:
the invention utilizes a differentiable binarization processing model to implement binarization operation in a segmentation network so as to achieve the effect of combination optimization and realize the self-adaptation of the threshold value in each position of the thermodynamic diagram, thereby shortening the reasoning and calculating time of picture and character recognition, improving the recognition correction rate, having high accuracy of detection and recognition and reducing the requirement on post-processing.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Example 1:
referring to fig. 1, this embodiment 1 provides a scene text detection method, where the method includes:
acquiring a target picture, wherein the target picture is sent by an intelligent terminal and can be obtained by shooting and sending by the intelligent terminal;
processing the target picture according to the feature pyramid network to generate a feature map F, predicting a probability map P and a threshold map T through the feature map F, and generating an approximate binary map B through the probability map P and the threshold map T;
and carrying out self-adaptive threshold processing on the approximate binary image B by utilizing a differentiable binarization processing model to obtain a first target result, wherein the first target result comprises different regions in the target image.
After receiving a target picture, converting the output of the feature pyramid into the same size by an upsampling mode through a backbone with a feature pyramid structure, and cascading (cascade) to generate a feature F; in a conventional binarization operation method, where P denotes a probability map, t denotes a partition threshold, the probability map P output by the network is partitioned by a fixed threshold, specifically using the following formula:
since this binarization calculation method is not differentiable, this method cannot be optimized with the segmentation network in the training phase.
In this embodiment, an approximate step function is introduced into the differentiable binarization processing model, the differentiable binarization processing is applied to the segmentation network, and when a relationship between the probability map P and the threshold map T is established, the following formula is used:
where k is an amplification factor and is empirically set to 50.
The binary calculation is differentiable through the formula, so that the condition of gradient back propagation can be met, and the differentiable binary calculation with the self-adaptive threshold value is not only beneficial to segmenting different regions, but also can segment similar example regions.
In this embodiment, the first target result includes at least one functional area generated by segmentation, and the functional area is calculated and identified to obtain a first identification contour, where the first identification contour is a polygonal shape and includes curved, vertical, and multi-directional sides with different angles, that is, the first identification contour may be described by a set of line segments:
wherein n represents the number of vertices;
the polygon is reduced by the Vatti clipping algorithm, and the contraction offset D is calculated by the perimeter L and the area A:
where r is the contraction factor.
In this embodiment, the first target result is optimized by using a loss function L, which is obtained by weight calculation from the probability map P loss Ls, the binary map B loss Lb, and the threshold map T loss Lt: l is Ls + α × Lb + β × Lt, where α and β are weighting factors, and α and β are set to 1.0 and 10, respectively.
The probability map P penalty Ls and the binary map B penalty Lb use a binary cross entropy penalty function:
wherein S istA sample set representing a ratio of positive and negative samples of 1: 3;
lt uses the L1 distance loss function:
example 2:
referring to fig. 2, this embodiment 2 provides a batching method to which the scene text detection method in embodiment 1 is applied, and the method includes:
applying the scene text detection method in embodiment 1 to obtain a first target result;
and carrying out correction processing on the first target result to obtain a second target result.
It should be noted that the scene text detection method in embodiment 1 is applied to a correction method in the educational field, and more specifically, a target photo is obtained by taking a picture, and after the processing by the scene text detection method in embodiment 1, a first target result is obtained, and the first target result separates areas representing different meanings, such as titles, answers, graphs, and formulas, in the target photo according to a set requirement, and performs correction processing on the first target result, and pushes a second target result to a user.
In this embodiment, a simulation training model is added, thereby leading to two phases: a training phase and a correcting phase. In the training phase, supervising is carried out on the probability map P, the threshold map T and the approximate binary map B, wherein the threshold map T and the approximate binary map B share the same supervising; by the method, the boundary box can be easily and quickly acquired from the threshold value image T and the approximate binary value image B in the correction stage.
Additionally, training data can be randomly generated through a simulation training model, and the model can be effectively perfected and the response speed of the reaction is improved through the training data and the flow sequence of detection and correction.
As a preferable mode, in the process of separating the first target result, a test paper contour, a text line contour and an answer number frame contour are determined, the test paper contour includes the whole target picture, the text line contour includes each line of text, the answer number frame contour includes the answer number of each answer, the upper boundary of each answer is defined by the answer number frame contour and the text line contour, the left end point and the right end point of the upper boundary are extended, the upper boundary is connected with the test paper contour in the left-right extending direction, and the upper boundary separates the test paper contour into at least one test question area.
It should be noted that the upper boundary of the question mark frame outline is the upper boundary of each question, the upper boundary of each question can be defined by the question mark frame outline and the corresponding text line outline of the first line, after the left and right end points of the upper boundary are extended, the upper boundary is combined with the test paper outline, namely the upper boundary divides the test paper outline into at least one test question area, one test question area is arranged between every two upper boundaries, and the last test question of the page is arranged between the upper boundary and the bottom edge of the test paper outline.
As a preferred mode, calculating and identifying each test question area to obtain a first identification outline, wherein the first identification outline comprises a printing outline, a graph outline and a handwritten outline, the printing outline and the graph outline form question information, and the handwritten outline forms answer information; it should be noted that, the method can also include a formula outline, and because the subject information includes a formula and a figure besides printed characters, the three elements form all the subject information; the outline of the handwriting may also include characters and formulas, but since all are handwriting, all are included in the answer information.
As a preferred mode, in the course of performing the correction processing on the first target result, since the first target result is already calculated and identified, and corresponding topic information and answer information are formed, wherein each topic information and answer information corresponds to a label, further, performing OCR recognition on the topic information to obtain topic text recognition information, and performing OCR recognition on the answer information to obtain answer text recognition information;
extracting key words in the question text identification information according to the question text identification information and the graphic outline and also can comprise a formula outline, and inquiring in a database according to the key words to obtain a similar original question group; and identifying a graph area in the similar question group, judging the graph similarity between the graph area and the graph outline, determining a final question from the similar question group when the graph similarity is greater than a preset similarity, and inquiring to obtain a corresponding answer analysis according to the final question.
It should be noted that, in the first step, a similar problem group is found according to the keywords in the topic text identification information, where the similar problem group includes at least one problem, and then the final problem is further located by combining the graphic similarity between the graphic area of the problem and the graphic outline, and optionally, the final problem and the corresponding answer are pushed to the user for analysis; of course, according to practical situations, after the title is corrected, at least one of the score/loss, the correction mark and the score ranking related to the title can be pushed, and correction can be selected to be directly performed on the target picture or the correction result can be sent to the intelligent terminal.
Example 3:
referring to fig. 3, this embodiment 3 provides an apparatus applied to the scene text detection method in embodiment 1, including:
an acquisition unit configured to acquire a target picture, the target picture being transmitted by an intelligent terminal;
a generating unit configured to process the target picture according to a feature pyramid network, generate a feature map F, predict a probability map P and a threshold map T from the feature map F, and generate an approximate binary map B from the probability map P and the threshold map T;
a binarization unit configured to perform adaptive threshold processing on the approximated binary image B by using a differentiable binarization processing model to obtain a first target result, wherein the first target result comprises different regions in the generated target image.
The method includes the steps of obtaining a target picture by an obtaining unit, wherein the target picture includes information of test question questions, test question answers, examinees, examination time, subjects, grade and the like, generating an approximate binary image B for the target picture, performing adaptive threshold processing on the approximate binary image B by using a differentiable binarization processing model, and accurately segmenting different text regions in the scene.
Example 4:
referring to fig. 4, this embodiment 4 provides an apparatus applied to the correction method in embodiment 2, including the scene text detection apparatus in embodiment 3 and a correction unit, where the correction unit is configured to perform correction processing on the first target result to obtain a second target result.
It should be noted that, on the basis of embodiment 3, a correction unit is added, and different text areas are identified, judged and corrected in combination with the photographing correction requirement in the education field.
Example 5:
this embodiment 5 provides an electronic device, including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, at least one program, a code set, or a set of instructions is loaded and executed by the processor to implement the scene text detection method in embodiment 1 or the modification method in embodiment 2.
Example 6:
this embodiment 6 provides a computer-readable storage medium on which computer instructions are stored, which when executed by a processor implement the steps of the scene text detection method as in embodiment 1, or the steps of the correction method as in embodiment 2.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
Compared with the prior art, the method and the device adopt a model detection algorithm aiming at the target image to generate the question information detection result and the answer information detection result, perform OCR model recognition on the two detection results respectively to recognize the character line recognition result and the formula recognition result, improve the detection and recognition efficiency of the chart and the formula in the test question and the answer, and further improve the correction efficiency.
The invention utilizes a differentiable binarization processing model to implement binarization operation in a segmentation network so as to achieve the effect of combination optimization and realize the self-adaptation of the threshold value in each position of the thermodynamic diagram, thereby shortening the reasoning and calculating time of picture and character recognition, improving the recognition correction rate, having high accuracy of detection and recognition and reducing the requirement on post-processing.
Finally, it should be emphasized that the present invention is not limited to the above-described embodiments, but only the preferred embodiments of the invention have been described above, and the present invention is not limited to the above-described embodiments, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.