CN113723352B

CN113723352B - Text detection method, system, storage medium and electronic equipment

Info

Publication number: CN113723352B
Application number: CN202111069214.0A
Authority: CN
Inventors: 刘浩东; 李甜甜; 李金晶; 董福豪
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2024-08-02
Anticipated expiration: 2041-09-13
Also published as: CN113723352A

Abstract

The embodiment of the invention provides a text detection method, a text detection system, a storage medium and electronic equipment, which can be applied to the field of artificial intelligence or the field of finance. The method comprises the following steps: extracting features of the image to be detected by adopting an attention pyramid network model to obtain an attention pyramid feature map; selecting candidate frames of the attention pyramid feature map by adopting a regional suggestion network to obtain text candidate frames; and inputting the attention pyramid feature map and the position information of the candidate frame into a fast R-CNN model to perform candidate frame classification prediction processing so as to judge whether the area selected by the text candidate frame is a text area or not, and obtaining a text detection result. According to the invention, the significance detection is carried out on the text in the image to be detected through the attention pyramid network model, the background information is restrained while the text is highlighted, so that the interference caused by the background is reduced, the representation capability of the features can be improved, and the accuracy of text detection is improved.

Description

Text detection method, system, storage medium and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a text detection method, a system, a storage medium, and an electronic device.

Background

With the rapid development of computer technology and mobile devices, a great number of practical applications need to acquire high-level semantic information contained in scene texts. The text detection in the bill image is used for accurately positioning the text region of the bill in the image, and the detection result can directly influence the final recognition effect. However, since the bill text shows sparsity characteristics in the bill image which is freely shot, a large number of background areas bring great difficulty to real text detection, and the problem of low text detection accuracy exists.

Disclosure of Invention

The embodiment of the invention aims to provide a text detection method, a system, a storage medium and electronic equipment, which can improve the text detection accuracy. The specific technical scheme is as follows:

The invention provides a text detection method, which comprises the following steps:

Acquiring an image to be detected;

Extracting features of the image to be detected by adopting an attention pyramid network model to obtain an attention pyramid feature map;

selecting candidate frames of the attention pyramid feature map by adopting a regional suggestion network to obtain text candidate frames;

And inputting the attention pyramid feature diagram and the position information of the text candidate box into a fast R-CNN model to perform candidate box classification prediction processing so as to judge whether the area framed by the text candidate box is a text area or not, and obtaining a text detection result.

Optionally, the attention pyramid network model includes ResNet a network, a global averaging pooling layer, a first residual module, a channel attention module, and a second residual module; resNet101 the network comprises a top layer convolution unit, a middle layer convolution unit and a bottom layer convolution unit;

The method for extracting the features of the image to be detected by adopting the attention pyramid network model to obtain an attention pyramid feature map specifically comprises the following steps:

Inputting the image to be detected into the bottom convolution unit for feature extraction to obtain a bottom feature map; inputting the bottom layer feature map into the middle layer convolution unit for feature extraction to obtain a middle layer feature map; inputting the middle layer feature map into the top layer convolution unit to obtain a top layer feature map;

Inputting the top-level feature map into a global average pooling layer to obtain a pooling processing result;

inputting the top-level feature map into a first residual error module corresponding to the top-level convolution unit to obtain a first residual error feature map;

inputting the pooling processing result and the first residual error feature map into a channel attention module corresponding to the top-layer convolution unit to carry out weight adjustment, so as to obtain a first channel attention feature map;

Inputting the first channel attention feature map into a second residual error module corresponding to the top-layer convolution unit to obtain a second residual error feature map;

Inputting the middle layer characteristic diagram into a first residual error module corresponding to the middle layer convolution unit to obtain a third residual error characteristic diagram;

Inputting the second residual characteristic diagram and the third residual characteristic diagram into a channel attention module corresponding to the middle layer convolution unit for weight adjustment to obtain a second channel attention characteristic diagram;

And inputting the second channel attention feature map into a second residual error module corresponding to the middle layer convolution unit to obtain an attention pyramid feature map.

Optionally, inputting the top-level feature map into a first residual error module corresponding to the top-level convolution unit to obtain a first residual error feature map, which specifically includes:

Inputting the top-layer feature map into a1 multiplied by 1 convolution layer for channel combination treatment to obtain a combination result;

inputting the combined result into a 3 multiplied by 3 convolution layer for size amplification treatment to obtain an amplification treatment result;

inputting the amplified treatment result into a Batch Norm layer for Batch normalization treatment to obtain a normalization treatment result;

the normalized processing result is input into a ReLU function, and the size of the result obtained after the normalization processing result is reduced through a 3X 3 convolution layer, so that a reduced processing result is obtained;

And inputting a result obtained by summing the top-layer feature map and the reduction processing result into a ReLU function to obtain a first residual feature map.

Optionally, inputting the second residual feature map and the third residual feature map into a channel attention module corresponding to the middle layer convolution unit to perform weight adjustment to obtain a second channel attention feature map, which specifically includes:

Combining the second residual characteristic diagram and the third residual characteristic diagram to obtain a combined characteristic diagram;

Inputting the combined feature images into a global pooling layer for compression treatment to obtain compressed feature images;

inputting the compressed characteristic diagram into a1 multiplied by 1 convolution layer for processing, and inputting the processed result into a ReLU function to obtain an output result;

Inputting the output result into a1 multiplied by 1 convolution layer for processing, and inputting the processed result into a Sigmoid function to obtain an attention vector;

And carrying out weight adjustment on the second residual error feature map by using the attention vector to obtain a second channel attention feature map.

Optionally, the processing the output result in the 1×1 convolution layer, and inputting the processed result in a Sigmoid function to obtain an attention vector specifically includes:

inputting the output result into a1 multiplied by 1 convolution layer to perform characteristic diagram channel summation processing to obtain a fraction diagram;

determining a text prediction probability by using the scores in the score map;

And obtaining the attention vector by using a Sigmoid function according to the text prediction probability and the text expected probability.

Optionally, the weighting adjustment is performed on the second residual feature map by using the attention vector to obtain a second channel attention feature map, which specifically includes:

Performing product operation on the attention vector and the second residual error feature map to obtain a product operation result;

And carrying out summation operation on the product operation result and the combined feature map to obtain a second channel attention feature map.

Optionally, inputting the attention pyramid feature map and the candidate box information into a fast R-CNN model to perform candidate box classification prediction processing, so as to determine whether the text candidate box is a text region, so as to obtain a text detection result, which specifically includes:

Inputting the attention pyramid feature diagram and the candidate frame information into a fast R-CNN model to perform candidate frame classification prediction processing to obtain a detection frame;

and performing NMS duplicate removal processing on the detection frame to obtain a text detection result.

The invention also provides a text detection system, comprising:

The image acquisition module is used for acquiring an image to be detected;

the feature extraction module is used for extracting features of the image to be detected by adopting an attention pyramid network model to obtain an attention pyramid feature map;

the candidate frame generation module is used for selecting the candidate frames of the attention pyramid feature map by adopting a regional suggestion network to obtain text candidate frames;

And the text detection module is used for inputting the attention pyramid feature diagram and the position information of the text candidate box into a fast R-CNN model to perform candidate box classification prediction processing so as to judge whether the area selected by the text candidate box is a text area or not and obtain a text detection result.

The present invention also provides a computer-readable storage medium having a program stored thereon, which when executed by a processor, implements the above-described text detection method.

The present invention also provides an electronic device including:

At least one processor, and at least one memory, bus, connected to the processor;

the processor and the memory complete communication with each other through the bus; the processor is configured to call the program instructions in the memory to execute the text detection method described above.

According to the text detection method, the system, the storage medium and the electronic equipment provided by the embodiment of the invention, the attention pyramid network model is adopted to extract the characteristics of the image to be detected, so that the attention pyramid characteristic diagram is obtained; selecting candidate frames of the attention pyramid feature map by adopting a regional suggestion network to obtain text candidate frames; and inputting the attention pyramid feature map and the position information of the candidate frame into a fast R-CNN model to perform candidate frame classification prediction processing so as to judge whether the area selected by the text candidate frame is a text area or not, and obtaining a text detection result. According to the invention, the significance of the text in the image to be detected is detected through the attention pyramid network model, the background information is restrained while the text is highlighted, so that the interference caused by the background is reduced, the representation capability of the features can be improved, and the detection accuracy of the text is improved.

Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a text detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an attention pyramid network structure according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a refinement residual module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a channel attention module according to an embodiment of the present invention;

Fig. 5 is a schematic diagram of a text detection flow provided in an embodiment of the present invention;

FIG. 6 is a block diagram of a text detection system according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a text detection method, as shown in fig. 1, which comprises the following steps:

Step 101: and acquiring an image to be detected.

Step 102: and extracting features of the image to be detected by adopting the attention pyramid network model to obtain an attention pyramid feature map.

Attention pyramid network (Attention Pyramid Network, APN) architecture as shown in fig. 2, the attention pyramid network model includes ResNet network, global averaging pooling layer (Global Average Pooling, GAP), 2 refinement residual modules (REFINEMENT RESIDUAL BLOCK, RRB) being first and second residual modules, respectively, and a channel attention module (Channel Attention Block, CAB). The CAB performs weight self-adaptive calibration on the characteristic map channels of each stage by using an attention mechanism to enhance the representation capability of target characteristics and acquire the characteristics with more discriminant ability; the RRB module is added in the feature layer of each stage and is used for merging channel information and introducing a residual block to optimize the network so as to realize the effect of refining the feature map.

ResNet101 the network has five sets of convolutions, including a top layer convolution unit Res5, a middle layer convolution unit and a bottom layer convolution unit conv1; the middle layer convolution units have three layers, namely Res2, res3 and Res4. The five groups of convolutions comprise 101 layers in total, wherein the conv1 input size is 224 multiplied by 224, the Res5 output size is 7 multiplied by 7, each convolution stage is reduced by 2 times, and through five groups of convolutions which are reduced by 32 times in total, finally, aiming at the problem of lack of global information in a network, the APN introduces a global average pooling layer GAP at the top of ResNet network, provides global context information, ensures high-level consistency constraint and completes network construction.

The attention fusion module in the APN structure comprises a refinement residual module RRB and a channel attention module CAB. The RRB module with the transverse connection function is used for merging channel information, introducing a residual block to optimize a network, and realizing the effect of refining the feature map; from the aspect of the feature map channel, the CAB module utilizes the attention mechanism to fuse the features of adjacent stages to correct the weight of each channel, and obtains the features with more discriminant ability for the subsequent text detection task.

Step 102, specifically includes:

1) Inputting the image to be detected into a bottom layer convolution unit for feature extraction to obtain a bottom layer feature map; inputting the bottom layer characteristic diagram into a middle layer convolution unit for characteristic extraction to obtain a middle layer characteristic diagram; and inputting the middle layer characteristic diagram into a top layer convolution unit to obtain a top layer characteristic diagram.

Inputting an image to be detected into conv1 for feature extraction to obtain a bottom layer feature map; inputting the bottom layer feature map into Res2 for feature extraction to obtain a first middle layer feature map; inputting the first intermediate layer feature map into Res3 for feature extraction to obtain a second intermediate layer feature map; inputting the second intermediate layer characteristic diagram into Res4 for characteristic extraction to obtain a third intermediate layer characteristic diagram; and inputting the third middle layer characteristic diagram into a top layer convolution unit Res5 to obtain a top layer characteristic diagram.

2) And inputting the top-level feature map into a global average pooling layer to obtain a pooling processing result.

And inputting the top-level feature map into a global average pooling layer GAP to obtain a pooling processing result.

3) And inputting the top-layer feature map into a first residual error module corresponding to the top-layer convolution unit to obtain a first residual error feature map.

As shown in fig. 3, this step specifically includes:

Inputting the top-layer feature map into a 1X 1 convolution layer (1X 1 conv) for channel merging processing to obtain a merging result, merging information of all channels through the 1X 1 convolution layers, and fixing the number of the feature map channels from the convolution neural network CNNs to be 512; inputting the combined result into a 3X 3 convolution layer (3X 3 conv) for size amplification treatment to obtain an amplification treatment result; inputting the amplified processing result into a Batch Norm layer (Batch normalization layer for ensuring consistent data distribution and rapid training speed of ReLU) for Batch normalization processing to obtain a normalization processing result; the normalized processing result is input into a ReLU function, and the size of the result obtained after the normalized processing result is reduced through a 3X 3 convolution layer, so that a reduced processing result is obtained; and inputting a result obtained by summing (sum) the top-level feature map and the reduction processing result into a ReLU function to obtain a first residual feature map.

4) And inputting the pooling processing result and the first residual error feature map into a channel attention module corresponding to the top-layer convolution unit to carry out weight adjustment, so as to obtain a first channel attention feature map.

As shown in fig. 4, the channel attention module CAB is configured to combine features of adjacent stages, and take the pooling result and the first residual feature map as inputs, so as to fully utilize differences of different stages. The CAB module is firstly connected (concate) with RRB features of a high stage (pooling processing result) and a low stage (first residual feature diagram), explicitly establishes a dependence relation between channels, uses a Global pooling layer (Global pool) to compress the feature diagram to generate statistical information of the channels, adds two 1X 1 convolution (1X 1 conv) and a ReLU function to reduce model complexity and assist generalization, learns the dependence relation between the channels by using the Sigmoid function, acquires an attention vector, then performs weight adjustment on the low stage feature channel by using the attention vector, and performs product operation on the attention vector and the first residual feature diagram to obtain a product (mul) operation result; and carrying out summation (sum) operation on the product operation result and the pooling treatment result, and finally obtaining a first channel attention characteristic diagram.

5) And inputting the first channel attention feature map into a second residual error module corresponding to the top-layer convolution unit to obtain a second residual error feature map.

As shown in fig. 3, the method for generating the second residual feature map is similar to the method for generating the first residual feature map, and will not be described again.

6) And inputting the middle layer characteristic diagram into a first residual error module corresponding to the middle layer convolution unit to obtain a third residual error characteristic diagram.

Inputting the third intermediate layer characteristic diagram into a first residual error module corresponding to a third intermediate layer convolution unit to obtain a third residual error characteristic diagram;

Inputting the second intermediate layer characteristic diagram into a first residual error module corresponding to a second intermediate layer convolution unit to obtain a second residual error characteristic diagram;

and inputting the first middle layer characteristic diagram into a first residual error module corresponding to the first middle layer convolution unit to obtain a first residual error characteristic diagram.

7) And inputting the second residual characteristic diagram and the third residual characteristic diagram into a channel attention module corresponding to the middle layer convolution unit for weight adjustment to obtain a second channel attention characteristic diagram.

The method specifically comprises the following steps:

Combining the second residual characteristic diagram and the third residual characteristic diagram to obtain a combined characteristic diagram; inputting the combined feature images into a global pooling layer for compression treatment to obtain a compressed feature image; inputting the compressed characteristic diagram into a1 multiplied by 1 convolution layer for processing, and inputting the processed result into a ReLU function to obtain an output result; inputting the output result into a1 multiplied by 1 convolution layer for processing, and inputting the processed result into a Sigmoid function to obtain an attention vector; and carrying out weight adjustment on the second residual error feature map by using the attention vector to obtain a second channel attention feature map.

Optionally, the weight adjustment is performed on the second residual feature map by using the attention vector to obtain a second channel attention feature map, which specifically includes:

Performing product operation on the attention vector and the second residual error feature map to obtain a product operation result; and carrying out summation operation on the product operation result and the combined feature map to obtain a second channel attention feature map.

Optionally, the output result is input into a1×1 convolution layer for processing, and the processed result is input into a Sigmoid function to obtain an attention vector, which specifically includes:

Inputting the output result into a1 multiplied by 1 convolution layer for carrying out characteristic diagram channel summation processing to obtain a fraction diagram; determining a text prediction probability by using the score in the score map; and obtaining the attention vector by using a Sigmoid function according to the text prediction probability and the text expected probability.

CAB aims at integrating adjacent stage features to calculate the attention vector of each channel and change the feature weight of each stage to optimize feature consistency. After the APN is extended to FCN (Fully Convolutional Networks, full convolutional neural network) architecture, the convolution operation outputs a score map, giving the probability of the pixel on each class as formula (1-1), the final score y _n of the score map is simply the sum of the channels of all feature maps.

In the formula (1-1), x is the characteristic of network output, k is a convolution kernel, n epsilon {1,2, …, n }, n is the channel number, D is the set of pixel positions (i represents rows and j represents columns), and the formula (1-1) implicitly represents that the weights of different channels are equal. The calculation of the channel attention weight is shown in the formulas (1-2) and (1-3). In the formula (1-2), delta is the prediction probability, y is the network output, and N is the total number of columns.

The final predictive label is the highest probability class, derived from equations (1-2) and (1-3). Assuming that the predicted outcome is y ₀ and the true label is y ₁, the attention-directed weight parameter corrects the highest probability value y ₀ to y ₁ as in equation (1-3).

In the formulas (1-3)Representing a new prediction of the network, α=sigmoid (x; k). In order to obtain consistent and accurate prediction results, features having discriminant are extracted and non-discriminant features are suppressed, so in the formula (1-3), the α value is the attention weight for the feature map x, indicating that attention feature selection is performed using CAB. By the method, the characteristics of discriminant can be obtained step by using a network, and the consistency of the prediction category is ensured.

To refine features more accurately, a deep supervision method is used to obtain better performance and optimize the network, and in the APN of the present invention, the Softmax penalty function is used to supervise the upsampled output at each stage except the global average pooling layer, as in equations (1-4).

L＝SoftmaxLoss(y；k) (1-4)

8) And inputting the attention characteristic diagram of the second channel into a second residual error module corresponding to the middle layer convolution unit to obtain an attention pyramid characteristic diagram.

And inputting the second channel attention characteristic diagram into a second residual error module corresponding to the third middle layer convolution unit to obtain the first attention pyramid characteristic diagram.

It should be noted that, since there are three intermediate layer feature graphs, three attention pyramid feature graphs, that is, a first attention pyramid feature graph, a second attention pyramid feature graph, and a third attention pyramid feature graph, can be obtained.

Step 103: and selecting candidate boxes for the attention pyramid feature map by adopting a regional suggestion network to obtain text candidate boxes.

As shown in fig. 5, after feature extraction is performed on an image to be detected by using the attention pyramid network, attention pyramid features are obtained and are simultaneously input to the regional suggestion network and the fast R-CNN model. After the attention pyramid features are input to the regional suggestion network, the anchor boxes are also input to the regional suggestion network, and after text two-classification processing and rectangular bounding box regression, the rectangular text candidate boxes after refinement processing can be obtained. The method utilizes the regional suggestion network to generate candidate frames according to the attention pyramid feature map output by the APN network, and extracts corresponding effective RoI features for each candidate frame.

Step 104: and inputting the attention pyramid feature diagram and the position information of the text candidate box into a fast R-CNN model to perform candidate box classification prediction processing so as to judge whether the area selected by the text candidate box is a text area or not, and obtaining a text detection result.

And distinguishing the extracted RoI category by using a classifier in the Fast R-CNN module, judging whether the extracted RoI category is text, and directly outputting the corrected text candidate box as a text detection result.

Step 104 specifically includes:

Inputting the attention pyramid feature map and the candidate frame information into a Faster R-CNN model to perform candidate frame classification prediction processing to obtain a detection frame; and performing NMS duplicate removal processing on the detection frame to obtain a text detection result.

As shown in fig. 5, after the attention pyramid feature map and the candidate box information are input into the fast R-CNN model, the output result of the fast R-CNN model is subjected to text bi-classification and quadrilateral candidate box regression processing, and the detection box obtained by processing is subjected to NMS de-duplication processing, so that a text detection result is obtained. The invention carries out finer classification and bounding box regression on the text candidate boxes obtained by the detection in the step 103, the classification task learning judges that the candidate boxes are text areas and background areas, the candidate box regression task learning and regression quadrilateral bounding box position information, and finally NMS deduplication is carried out on the candidate boxes to obtain a final text prediction result.

The invention is based on the attention mechanism feature extraction model APN, extracts the features with more discriminant ability by using the attention mechanism on the ResNet basic model, provides the significance detection of the text region, highlights the text information, simultaneously suppresses the background information, and reduces false alarm caused by the background interference of similar texts. The channel attention module of the invention uses the attention mechanism to fuse the adjacent stage characteristics to correct the weight of each channel from the aspect of the characteristic map channel, and obtains the characteristics with more discriminant. The refinement residual error module plays a role of transverse connection, is used for merging channel information, introduces residual error blocks to optimize a network, and achieves the effect of refining the feature map.

As an optional embodiment, the invention carries out the detection of the bank bill text based on the attention pyramid network, and uses the channel attention vector to adjust the weight parameter from the text feature map layer by utilizing the combination of the feature pyramid structure and the attention mechanism, so as to guide the high-stage combination and the low-stage combination to enhance the feature consistency, improve the feature representation capability and select the better bill text feature at the same time, further improve the text detection effect, and solve the problem that the text detection accuracy is low because the bill text shows sparsity characteristic in the freely shot bank bill image and a large number of background areas bring great difficulty to the real text detection.

The present invention also provides a text detection system, as shown in fig. 6, which includes:

The image acquisition module 601 is configured to acquire an image to be detected.

The feature extraction module 602 is configured to perform feature extraction on an image to be detected by using the attention pyramid network model, so as to obtain an attention pyramid feature map.

The attention pyramid network model comprises ResNet networks, a global average pooling layer, a first residual module, a channel attention module and a second residual module; resNet101 the network includes a top layer convolution unit, a middle layer convolution unit, and a bottom layer convolution unit.

The feature extraction module 602 specifically includes:

the feature extraction unit is used for inputting the image to be detected into the bottom layer convolution unit to perform feature extraction so as to obtain a bottom layer feature map; inputting the bottom layer characteristic diagram into a middle layer convolution unit for characteristic extraction to obtain a middle layer characteristic diagram; and inputting the middle layer characteristic diagram into a top layer convolution unit to obtain a top layer characteristic diagram.

And the pooling processing unit is used for inputting the top-level feature map into a global average pooling layer to obtain a pooling processing result.

The first residual characteristic diagram generating unit is used for inputting the top-level characteristic diagram into a first residual module corresponding to the top-level convolution unit to obtain a first residual characteristic diagram.

The first channel attention feature map generating unit is used for inputting the pooling processing result and the first residual feature map into the channel attention module corresponding to the top-layer convolution unit to carry out weight adjustment, so as to obtain a first channel attention feature map.

The second residual characteristic diagram generating unit is used for inputting the first channel attention characteristic diagram into a second residual module corresponding to the top-layer convolution unit to obtain a second residual characteristic diagram.

And the third residual characteristic diagram generating unit is used for inputting the intermediate layer characteristic diagram into the first residual module corresponding to the intermediate layer convolution unit to obtain a third residual characteristic diagram.

The second channel attention feature map generating unit is used for inputting the second residual feature map and the third residual feature map into the channel attention module corresponding to the middle layer convolution unit to carry out weight adjustment so as to obtain a second channel attention feature map;

and the attention pyramid feature diagram generating unit is used for inputting the attention feature diagram of the second channel into a second residual error module corresponding to the middle layer convolution unit to obtain the attention pyramid feature diagram.

Wherein,

The first residual characteristic diagram generating unit is specifically configured to: inputting the top layer characteristic diagram into a1 multiplied by 1 convolution layer for channel combination treatment to obtain a combination result; inputting the combined result into a3 multiplied by 3 convolution layer for size amplification treatment to obtain an amplification treatment result; inputting the amplified treatment result into a Batch Norm layer for Batch normalization treatment to obtain a normalization treatment result; the normalized processing result is input into a ReLU function, and the size of the result obtained after the normalized processing result is reduced through a 3X 3 convolution layer, so that a reduced processing result is obtained; and inputting a result obtained by summing the top-layer feature map and the reduction processing result into a ReLU function to obtain a first residual feature map.

The second channel attention characteristic diagram generating unit is specifically configured to:

Optionally, the output result is input into a1×1 convolution layer for processing, and the processed result is input into a Sigmoid function to obtain an attention vector, which specifically includes: inputting the output result into a1 multiplied by 1 convolution layer for carrying out characteristic diagram channel summation processing to obtain a fraction diagram; determining a text prediction probability by using the score in the score map; and obtaining the attention vector by using a Sigmoid function according to the text prediction probability and the text expected probability.

Optionally, the weight adjustment is performed on the second residual feature map by using the attention vector to obtain a second channel attention feature map, which specifically includes: performing product operation on the attention vector and the second residual error feature map to obtain a product operation result; and carrying out summation operation on the product operation result and the combined feature map to obtain a second channel attention feature map.

And the candidate frame generating module 603 is configured to perform candidate frame selection on the attention pyramid feature map by using the regional suggestion network, so as to obtain a text candidate frame.

The text detection module 604 is configured to input the attention pyramid feature map and the location information of the text candidate box into the fast R-CNN model to perform candidate box classification prediction processing, so as to determine whether the area selected by the text candidate box is a text area, and obtain a text detection result.

The text detection module 604 is specifically configured to: inputting the attention pyramid feature diagram and the candidate frame information into a fast R-CNN model to perform candidate frame classification prediction processing to obtain a detection frame; and performing NMS duplicate removal processing on the detection frame to obtain a text detection result.

An embodiment of the present invention provides a computer-readable storage medium having stored thereon a program that, when executed by a processor, implements the above-described text detection method.

An embodiment of the present invention provides an electronic device, as shown in fig. 7, where the electronic device 70 includes at least one processor 701, and at least one memory 702 and a bus 703 connected to the processor 701; wherein, the processor 701 and the memory 702 complete communication with each other through the bus 703; the processor 701 is configured to invoke the program instructions in the memory 702 to perform the text detection method described above. The electronic device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application also provides a computer program product adapted to perform a program initialized with the steps comprised by the text detection method described above when executed on a data processing device.

It should be noted that the text detection method, the system, the storage medium and the electronic device provided by the invention can be applied to the field of artificial intelligence or the field of finance. The foregoing is merely an example, and the application fields of the text detection method, the system, the storage medium and the electronic device provided by the present invention are not limited.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, systems and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, the device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip. Memory is an example of a computer-readable medium.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A text detection method, comprising:

Acquiring an image to be detected;

Inputting the attention pyramid feature diagram and the position information of the text candidate box into a fast R-CNN model to perform candidate box classification prediction processing so as to judge whether the area framed by the text candidate box is a text area or not, and obtaining a text detection result;

The attention pyramid network model comprises ResNet networks, a global average pooling layer, a first residual module, a channel attention module and a second residual module; resNet101 the network comprises a top layer convolution unit, a middle layer convolution unit and a bottom layer convolution unit;

2. The text detection method according to claim 1, wherein the inputting the top-level feature map into a first residual module corresponding to the top-level convolution unit, to obtain a first residual feature map, specifically includes:

3. The text detection method according to claim 1, wherein the inputting the second residual feature map and the third residual feature map into the channel attention module corresponding to the middle layer convolution unit performs weight adjustment to obtain a second channel attention feature map, and specifically includes:

4. A text detection method according to claim 3, wherein the outputting the result is inputted into a1 x 1 convolution layer for processing, and the processed result is inputted into a Sigmoid function to obtain an attention vector, and specifically includes:

determining a text prediction probability by using the scores in the score map;

5. The text detection method of claim 3, wherein the weighting adjustment is performed on the second residual feature map by using the attention vector to obtain a second channel attention feature map, and specifically includes:

6. The text detection method according to claim 1, wherein inputting the attention pyramid feature map and the location information of the text candidate box into a fast R-CNN model to perform candidate box classification prediction processing to determine whether a region selected by the text candidate box is a text region, so as to obtain a text detection result, and specifically includes:

inputting the attention pyramid feature diagram and the position information of the text candidate box into a fast R-CNN model to perform candidate box classification prediction processing to obtain a detection box;

7. A text detection system, comprising:

The image acquisition module is used for acquiring an image to be detected;

The text detection module is used for inputting the attention pyramid feature diagram and the position information of the text candidate box into a fast R-CNN model to perform candidate box classification prediction processing so as to judge whether the area selected by the text candidate box is a text area or not and obtain a text detection result;

the feature extraction module specifically comprises:

the feature extraction unit is used for inputting the image to be detected into the bottom layer convolution unit to perform feature extraction so as to obtain a bottom layer feature map; inputting the bottom layer characteristic diagram into a middle layer convolution unit for characteristic extraction to obtain a middle layer characteristic diagram; inputting the middle layer feature map into a top layer convolution unit to obtain a top layer feature map;

The pooling processing unit is used for inputting the top-level feature map into a global average pooling layer to obtain a pooling processing result;

the first residual characteristic diagram generating unit is used for inputting the top-level characteristic diagram into a first residual module corresponding to the top-level convolution unit to obtain a first residual characteristic diagram;

The first channel attention feature map generating unit is used for inputting the pooling processing result and the first residual feature map into a channel attention module corresponding to the top-layer convolution unit for weight adjustment to obtain a first channel attention feature map;

the second residual characteristic diagram generating unit is used for inputting the first channel attention characteristic diagram into a second residual module corresponding to the top-layer convolution unit to obtain a second residual characteristic diagram;

the third residual characteristic diagram generating unit is used for inputting the intermediate layer characteristic diagram into the first residual module corresponding to the intermediate layer convolution unit to obtain a third residual characteristic diagram;

8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a program which, when executed by a processor, implements the text detection method of any of claims 1-6.

9. An electronic device, comprising:

The processor and the memory complete communication with each other through the bus; the processor is configured to invoke program instructions in the memory to perform the text detection method of any of claims 1-6.