CN113139539A

CN113139539A - Method and device for detecting characters of arbitrary-shaped scene with asymptotic regression boundary

Info

Publication number: CN113139539A
Application number: CN202110280975.4A
Authority: CN
Inventors: 操晓春; 代朋纹; 张三义; 张华�
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-07-20
Anticipated expiration: 2041-03-16
Also published as: CN113139539B

Abstract

The invention discloses a method and a device for detecting characters of any shape scene with an asymptotic regression boundary, wherein the method comprises the following steps: extracting visual characteristics of an image to be detected, and performing characteristic fusion on the visual characteristics to obtain characteristic expression; inputting the feature expression into a horizontal suggestion box to generate a network and generating a horizontal character candidate box; inputting the feature expression and the horizontal character candidate box into a direction suggestion box to generate a network and generating a direction character suggestion box; and inputting the feature expression and direction character suggestion box into a character boundary of any shape to generate a network, and acquiring a scene character detection result. According to the method, more accurate and smooth character boundaries can be generated through asymptotic regression, more accurate point positions are obtained by utilizing the geometric topological relation and the semantic relation among boundary sampling points, and the model has better generalization, more effective execution speed and stronger detection capability.

Description

Method and device for detecting characters of arbitrary-shaped scene with asymptotic regression boundary

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a method and a device for detecting characters of an arbitrary-shape scene with an asymptotic regression boundary.

Background

In modern social life, images are a popular information carrier, which is widely present in a network space to deliver rich information. Characters have been used as a more direct information carrier since ancient times, and also contain rich and accurate high-level semantic information. When the characters take the images as carriers, the characters can not only directly transmit character information, but also help to understand the deep meaning of the images. Therefore, how to detect and identify the characters in the image has a very important application value in real life, which is mainly embodied in four aspects: (1) a deep intelligent visual question and answer or description system. For a given image, the machine can intelligently respond to or describe deeper meanings in conjunction with textual information in the image. If a bus image is shot in a natural scene, the intelligent system can understand the deeper semantic meaning of the image according to the visual elements containing characters, such as the license plate, the starting station and the destination of the bus, the advertisement poster on the surface of the bus and the like. (2) Provided is a human-computer interaction system. When people are shopping or shopping malls, many billboards, posters, store signs, menus, product needs, etc. are often encountered, however, the information is often presented in different languages. Therefore, the mobile device collects images and identifies the character elements in the images, and can bring convenience to the life of people. (3) And searching images based on the text content. The character information in the image can effectively solve the ambiguity of the image content, and the image retrieval based on the visual cue can be supplemented according to the character retrieval image in the image. In addition, many lawbreakers use images as carriers, and embed some vulgar characters in the images to propagate in a network space. And bad character information in the image is identified, so that the transmission of the image is prevented, and the physical and mental health of the underage is protected. (4) An intelligent transportation system. In the outdoor environment, accurate discernment license plate and traffic sign all have positive effect to the intelligent management of traffic.

In order to effectively identify characters in an image, the position of the characters is the most important preorder step in accurate positioning. In addition, the detection of the characters in the natural scene plays an important role in the field of image editing, and the accurate positioning of the characters is beneficial to better removing or replacing the character contents in the image, so that the effect of privacy protection is achieved. However, detection of text in images of natural scenes is extremely challenging. Firstly, under the condition of an uncontrollable natural scene, due to factors such as uneven illumination, weather change, shooting angle or shaking and the like, the scene characters have low resolution, large noise, blurring, shadow or shielding, thereby increasing the difficulty of character detection in the scene. In addition, due to the characteristics of the characters in the scene, such as the layout of any shape, the diversity of font color/type/size, the similarity between the character texture and background elements (bricks, fences, etc.), and the like, these factors also make the characters in the image missed, misdetected, or incomplete in boundary positioning. In summary, text detection in natural scenes is a very challenging task in the field of computer vision.

In recent years, natural character detection methods based on deep learning are mainly classified into three categories: a method based on boundary point regression; a pixel segmentation based approach; a method based on a mixture of regression and segmentation. The method based on the boundary point regression is to regress key points or a plurality of sampling points on the boundary of characters in any shape. The method mainly comprises the steps of regressing accurate character boundaries for characters with any shapes in a candidate region; or directly regressing points on the boundary through a one-stage model. The pixel segmentation-based method is to regard character areas in any shape in an image as a semantic segmentation problem, estimate the geometric attribute or connection relation of each pixel in the character areas, and finally aggregate the pixels into different character instances according to auxiliary information predicted by each pixel. In addition, a learner aggregates pixels into local connected regions according to the predicted attribute information of each pixel, and then predicts or infers the connection relationship between the connected regions, thereby aggregating into different text instances. The method based on regression segmentation mixing is to obtain horizontal candidate frames through regression, and then completes semantic segmentation of pixel level in the candidate frames. However, the current mainstream methods have their own disadvantages. Such as regression-based methods, usually obtain candidate boxes through a regional suggestion network, which requires manual design of prior boxes and relies on smart positive and negative sample sampling, thereby limiting the generalization performance of the model. In addition, such methods independently regress points on the text boundary, ignoring geometric topological or semantic relationships between the boundary points. Pixel segmentation based methods are typically extremely sensitive to noise. Due to the interference of the background, many erroneous judgments are easily generated, such as judging the pixels of the character area as the background or misjudging the background as the character area. Moreover, such methods often have difficulty generating smooth boundaries, which may negatively impact some practical applications. Furthermore, such methods typically involve complex post-aggregation processing that requires processing of a large number of pixels, thereby slowing down the overall algorithm. The regression-based segmentation blending method also involves generating text candidate boxes using the regional suggestion network, and the segmentation is limited to the candidate boxes, which will also be affected when the candidate boxes are not precisely located. In addition, the method also involves segmenting a plurality of scales and a large number of candidate blocks, and the execution speed of the method is also seriously influenced.

Therefore, the invention generates a small number of candidate frames through a network without prior frame design, then samples a plurality of dense points on the boundary of the candidate frames, considers the geometric topology and semantic relationship among the sampling points, and gradually iterates regression to obtain the accurate boundary of characters with any shape.

Disclosure of Invention

The invention provides a method and a device for detecting characters in any shape with asymptotic regression character boundaries aiming at natural scene images. The method gradually evolves sampling points on the boundary of the bounding box on the basis of the candidate box so as to accurately position the position of characters with any shape in the scene picture. In the process of generating the candidate frame, the invention avoids regression on the basis of a prior frame designed manually by regressing the center point of the character and the width and height of the horizontal external bounding frame. In the evolution process, the invention captures the topological relation and the semantic relation among the sampling points on the boundary, thereby enhancing the characteristic expression of the sampling points on the boundary and obtaining more accurate position of the boundary sampling point by regression.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

a method for detecting characters of an arbitrary-shaped scene with an asymptotic regression boundary comprises the following steps:

1) extracting visual characteristics of an image to be detected, performing multi-scale characteristic fusion on the visual characteristics, and acquiring a characteristic expression F of the image to be detected_e；

2) Expressing F according to characteristics_eGenerating a horizontal character candidate box B^h；

3) Expressing F according to characteristics_eAnd in the horizontal character candidate box B^hGenerating an offset prediction value by a first boundary sampling point set obtained by sampling on the boundary, and generating a directional character suggestion box B by combining sampling points in the first boundary sampling point set^oThe angular point of the direction character suggestion box B is obtained^o；

4) Using in-direction text suggestion boxes B^oA second boundary sampling point set obtained by sampling on the boundary, and F is expressed according to the evolution characteristic_eGenerating new coordinates of the sampling points according to the obtained coordinate positions of the sampling points, and obtaining the coordinates of the accurate boundary positions according to the new coordinate positions of the sampling points

And estimating the score s of the characters belonging to the boundary surrounding area, thereby obtaining a scene character detection result.

Further, the method for extracting visual features comprises the following steps: a backbone network pre-trained on ImageNet was utilized.

Further, the backbone network includes: DLA34 network or ResNet50 network.

Further, the method for performing multi-scale feature fusion on the visual features comprises the following steps: and fusing the multi-scale features from shallow to deep.

Further, a horizontal character suggestion box is used for generating a network to obtain a horizontal character candidate box B^h：

1) For feature expression F_eAfter convolution and linear rectification, the linear rectification result is input into the first convolution layer to generate a character center response diagram

2) For feature expression F_eTo carry outAfter convolution and linear rectification, the linear rectification result is input into the second convolution layer to generate a character external rectangular frame scale estimation graph

The number of convolution kernels of the first convolution layer is different from that of the second convolution layer;

3) response graph to character center

Performing maximum pooling operation and passing a set threshold τ_cFiltering the central point with low score to obtain the filtered central point;

4) estimating the graph according to the filtered central point and the dimension of the character external rectangular frame

Generating a horizontal text candidate box B^h。

Further, training the horizontal character suggestion box to generate a loss function of the network

Wherein the center of the text is lost

Loss of dimension

N_tRepresenting the number of text instances in the sample image, i represents the text center response graph

The position index of the upper part, P and Q respectively represent the character center response diagram

Dimension estimation graph of rectangle frame externally connected with characters

The true value of (c) is given,

alpha and beta represent a first penalty factor and a second penalty factor, respectively, for the Smooth-L1 loss function.

Further, generating an offset prediction value by:

1) in the horizontal text suggestion box B^hIs uniformly sampled over the boundary of N_oObtaining coordinates x of each sampling point in the first boundary sampling point set by using the point;

2) expression of F according to characteristics_eExtracting the boundary original feature expression F of each sampling point in the first boundary sampling point set^c；

3) For each boundary original characteristic F^cPerforming boundary information aggregation to obtain increased feature expression F^cia；

4) Expression of F according to an increasing characteristic^ciaAnd performing offset prediction to obtain an offset prediction value o of each sampling point in the first boundary sampling point set.

Further, obtaining a boundary original feature expression F through the following steps^c：

1) Expression of F according to characteristics_eRespectively extracting semantic features F of each sampling point in the first boundary sampling point set^semAnd position feature F^loc；

2) To identify semantic features F^semAnd position feature F^locSplicing together to obtain a boundary original characteristic expression F^c。

Further, the increase of the characteristic expression F is obtained by the following steps^cia：

1) Expression of boundary primitive features F^cAfter one-dimensional cyclic convolution operation and linear rectification, inputting a linear rectification result into a BN layer to capture a geometric topological structure of closed boundary cycle

2) By geometric topology

And t expansion ratios r_tThe boundary information aggregation unit acquires t features

And will geometrically topological structure

And each feature

Performing splicing to generate multi-scale fused boundary features

3) Boundary characterization using one-dimensional convolution operations

Reducing dimension and obtaining characteristics

4) Feature-based

Performing maximal pooling operation to generate boundary global features

5) Global feature of boundary

Distributing the data to each sampling point in the first boundary sampling point set, and splicing to obtain an increased feature expression F^cia。

Further, the characteristics are obtained by the following steps

1) Will be characterized by

By 3 expansion ratios r_tBy expanding one-dimensional cyclic convolution to produce a feature representation

Expression of characteristics

And characteristic expression

2) Respectively based on feature expressions

And characteristic expression

By using N_gCollecting information along character boundary by each collection node, capturing semantic relation between sampling points, and obtaining characteristics of the collection nodes by combining with boundary global relation captured by adding a virtual collection node

And sink node characteristics

3) Calculating the relationship between boundary sampling points and sink nodes

Wherein

i is the number of the collection node, j is the number of the virtual collection node, D_uFor the characteristic expression

And

dimension (d);

4) assigning features on sink nodes to each of the boundary sample points to produce features

Wherein

Representing element-by-element addition.

Further, a network is generated through an angle point, and a direction character suggestion box B is generated^oCorner points of (a):

1) obtaining an updated sampling point coordinate x' ═ x + o through the coordinate x of each sampling point and the offset predicted value o;

2) selecting 4 points from coordinates x' of sampling points at equal intervals as direction character suggestion boxes

Angular point, generating directional character suggestion box B^oThe corner points of (a).

Further, the training corner points generate a loss function of the network

N_tIndicating the number of instances of text in the sample image,

coordinates are predicted for the corner point of the jth sample point of the ith word in the sample image,

the coordinates of the truth value of the corner point of the jth sampling point of the ith character in the sample image,

as a function of Smooth-L1 loss.

Further, by bounding the position network on one side, new coordinate positions of the generated sampling points are generated, wherein the objective function of the boundary positioning network is trained

Wherein N is_tRepresenting the number of text instances in the sample image, N_aThe number of sampling points of the ith word in the sample image,

as a function of Smooth-L1 loss,

is the predicted value of the j sampling point of the ith character in the sample image,

is the true value of the jth sampling point of the ith word in the sample image.

Further, accurate boundary position coordinates are obtained through a reliable boundary positioning network

And the score s of characters belonging to the predicted boundary surrounding area, wherein the objective function of the reliable boundary positioning network is trained

Wherein N is_tIndicating the number of instances of text in the sample image,

and the confidence that the area surrounded by the ith prediction boundary in the sample image belongs to the characters or the background is represented.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-mentioned method when executed.

An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method as described above.

The invention has the beneficial effects that:

1. the invention gradually regresses the sampling points on the character boundary by adopting an asymptotic regression mode, the regression form is consistent with the human visual system, and the asymptotic regression can generate more accurate and smooth character boundary for the character layout with a complex form.

2. The invention models the geometric topological relation and semantic relation between the boundary sampling points to interact the information on the boundary, thereby enhancing the boundary characteristic expression and obtaining the position of a more accurate point by regression.

3. The invention does not need to design a prior frame, thereby leading the model to have better generalization.

4. The number of the candidate frames generated by the method is obviously less than that of the candidate frames generated by the prior frame regression, so that the execution speed of the model is effectively improved.

5. The invention has strong detection capability and excellent performance for characters in any shapes, such as horizontal characters, multidirectional characters, curved characters and the like.

Drawings

FIG. 1 is a flow chart of detecting characters in an arbitrary shape scene.

Fig. 2 is a schematic diagram of a boundary offset prediction network.

Fig. 3 is a schematic diagram of a boundary information aggregation unit.

Fig. 4 is a schematic diagram of a reliable boundary positioning mechanism.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method for detecting characters in scenes with any shapes, as shown in fig. 1, includes:

1) extracting feature expression of an input image;

2) generating a network prediction character central point and the scale of an external rectangular frame by using the horizontal character suggestion frame to generate a horizontal character candidate frame;

3) sampling the boundary of the horizontal candidate frame, extracting the characteristics of the boundary sampling points, and enhancing the characteristic expression of the boundary sampling points by using a boundary aggregation network (CIA) to evolve the positions of the boundary sampling points so as to generate a directional character candidate frame;

4) sampling the boundary of the direction candidate frame, and gradually evolving the position of the boundary sampling point through a plurality of boundary positioning Mechanism (CLM) modules to approach the boundary of characters in any shape; and finally, a Reliable boundary positioning Mechanism (RCLM) technology is utilized to determine the confidence degree of whether the area surrounded by the positioned boundary points belongs to characters.

In one embodiment of the present invention, the input is an RBG image with a size H × W. Images were randomly cropped to 640 x 640 at the time of training. The shortest side of the input image is set to different values according to different data sets during testing, and the longest side of the input image is kept fixed in length and breadth and is changed correspondingly. Then, the image input into the network is subjected to feature extraction, and the specific steps are as follows:

i) the visual features of the input image are extracted using a backbone network (i.e., a network pre-trained on ImageNet, such as DLA34, ResNet50, etc.), the output of which is represented as

For DLA34, D₂，D₃，D₄And D₅64,128, 256,512 respectively. For ResNet50, it is 256,512,1024 and 2048, respectively.

ii) mixing C₂，C₃，C₄And C₅And inputting the feature enhancement module. Then C₃By reducing the characteristic dimension of a convolution operationDegree; then, sampling the characteristic diagram by 2 times, and splicing the characteristic diagram with C2; spliced feature is subjected to a feature C 'with a new deformation convolution parameter'₃(ii) a Then, C₄Repeating the above process with C'₃C 'is generated through deformation convolution after splicing'₄(ii) a Finally, C₅Repeating the above process with C'₄After splicing, a deformation convolution is carried out to generate a characteristic expression of scale robustness

Where σ is 4, D_e64 or 256 for DLA34 or ResNet50, respectively.

In one embodiment of the invention, the horizontal text candidate box generation network is composed of two branches. One of them consists of 1 convolution layer of 3 x 3 (256 convolution kernels), 1 Linear rectifying Unit (ReLU) and 1 convolution layer of 1 x 1 (1 convolution kernel) to generate a word center response map

The second is composed of 1 convolution layer (256 convolution kernels) of 3 x 3, 1 Linear rectification Unit (ReLU) and 1 convolution layer (2 convolution kernels) of 1 x 1 to generate a character circumscribed rectangle frame scale estimation graph

During training, the loss function of the network contains text center loss

Loss of dimension

Two parts, respectively denoted as:

wherein N is_tIndicating the number of text instances in the image and i indicates the position index on the response map. P and Q represent the truth values of the word center response diagram and the scale of the circumscribed rectangle box respectively.

As a function of Smooth-L1 loss. Alpha and beta represent penalty factors that are set to 2 and 4, respectively, during the training process.

In the test process, according to the obtained word center response diagram

The invention firstly utilizes a3 x 3 maximum pooling operation to highlight the central point; then passes a threshold τ_cCenter points with low scores are filtered. The horizontal text suggestion box may then be represented as:

wherein

And

i.e. the abscissa and ordinate of the point corresponding to the ith maximum response.

In one embodiment of the present invention, for the generation of directional text suggestion boxes, the present invention first generates a directional text suggestion box at each horizontal text suggestion box

Is uniformly sampled over the boundary of (1)_oAnd then extracting an original boundary Feature expression by using a boundary Feature Extractor (CFE) in a boundary offset prediction network

The specific process is as follows:

i) from F_eExtracting corresponding semantic features of sampling points

ii) calculating the position characteristics of the boundary sampling points in the following way: f^loc＝x-x_min。

iii) comparing the semantic features F of each sample point^semAnd position feature F^locSpliced together to form the original characteristic expression F of each sampling point^c。

Generating enhanced feature expressions after passing through a boundary Information Aggregation (CIA) module

Then, F^ciaInput to an Offset Prediction Head (OPH) to generate an Offset for each sample point

The boundary feature extractor CFE, the boundary information aggregation module CIA, and the offset prediction header form an offset prediction network (as shown in fig. 2). The updated coordinates of the sample points are thus expressed as: x' ═ x + o. From x', 4 points are selected at equal intervals as the corner points of the directional character suggestion box. Stacking all the predicted corner coordinates in the image together to form a coordinate matrix

During the training process, the loss function of the corner learning is expressed as:

wherein

I.e. the true coordinates representing the corner points.

In an embodiment of the present invention, a boundary Information Aggregation (CIA) module specifically executes the following steps:

i) input boundary primitive feature F^cThen, the geometric topology of the closed boundary loop is captured using 1 9 × 9 one-dimensional cyclic convolution operation (D convolution kernels), 1 ReLU, and one (batch normalization, BN) layer, whose output is represented as

ii)

Input into 6 CIA units with expansion rates r of 1,1,2,2,4 and 4 respectively, to model semantic relations between boundary sampling points to generate features respectively

And

iii) stitching the features to generate a multi-scale fused boundary feature

iv) fusing the features using a standard 3 x 3 one-dimensional convolution operation (D convolution kernels)

Performing dimensionality reduction treatment to obtain a reduced characteristic

v) is based on

Generation of boundary global features using max pooling operations

vi) global features

Distributing the data to each sampling point for splicing to obtain the output characteristics of the CIA module

Wherein D_cia＝8*D。

For each CIA unit, as shown in fig. 3, the specific steps are as follows:

i) input features

The periodicity of the sample points on the closed boundary is first encoded by 3 dilation one-dimensional cyclic convolutions with dilation rate r to produce a feature representation

And

wherein D_uIs a dimension.

ii) based on

By using N_gThe collection nodes collect information along the character boundaries, capture semantic relations among sampling points, and reduce interference between redundant sampling points and noise sampling points. In addition, a virtual sink node is added to capture the global relationship of the boundary, so the characteristics of the sink node

Expressed as:

wherein

Represents a maximum pooling operation; phi denotes a characteristic polymerization operation, N_gFor hyper-parameter, N in the directional suggestion box generation module_g64, N in the arbitrary-shaped character boundary generating module_g＝128。

iii) in the same way based on

Obtaining polymerization characteristics

iv) relationship between boundary sample points and sink nodes

Can be expressed as:

wherein

Representing the relationship between the ith boundary sample point and the jth sink node.

v) the features on the sink node are assigned to each of the boundary sample points to produce an aggregate feature

It is expressed as:

wherein

Representing element-by-element addition.

In bookIn one embodiment of the invention, for each CIA unit, F_uAre all different. For 1,2, …,6 CIA units, F_uI.e. respectively represent

In an embodiment of the present invention, the Offset Prediction Head (OPH) is composed of 3 1 × 1 one-dimensional convolutions (the number of convolution kernels is 256,64,2, respectively) and 2 ReLU units.

In one embodiment of the present invention, for the generation of arbitrary shape literal boundaries, it first proposes box B from the direction literal^oIs sampled by the boundary of N_aPoint; and then, gradually evolving sampling points by utilizing K boundary positioning Mechanism (CLM) modules to approximate the boundary of characters in any shape. Finally, as shown in FIG. 4, the new coordinates obtained by the CLM evolution

Inputting the data into a Reliable boundary positioning Mechanism (RCLM) to generate accurate boundary positioning and predict the score of the characters belonging to the boundary enclosing region.

In an embodiment of the invention, the CLM is composed of a boundary feature extractor CFE, a boundary information aggregation CIA and an offset prediction head OPH module; adding the position offset output by the OPH to the input coordinate position to obtain a new coordinate position; the new coordinate position is then entered into the next CLM module for further evolution, and so on.

In an embodiment of the invention, the RCLM is similar in structure to the CLM, except that the RCLM not only aggregates the boundary information into the output F of the CIA module^ciaInputting the position of the boundary into the offset prediction head to further adjust the position of the boundary, and generating final position coordinates

And inputting the result into a boundary Scoring Mechanism (CSM) module to predict the confidence level of whether the region surrounded by the boundary is a character

In the training process, the evolution of the whole process passes through an objective function

Learning is carried out; and the CSM module is updated through the objective function

This is completed. It is expressed as:

wherein

Which represents the true value of the jth sample point for the ith word.

And the confidence that the area surrounded by the ith prediction boundary belongs to the characters or the background is represented. l1 denotes a character, and l 0 denotes a background.

In an embodiment of the present invention, D, N_o，N_a，D_u，τ_cAnd K is set to 128,64,128, 0.35, and 2, respectively.

The invention provides a method for detecting characters with any shapes at asymptotic regression boundaries, which comprises the following test environments and experimental results:

(1) and (3) testing environment:

the system environment is as follows: ubuntu 16.04.

Hardware environment: memory: 15GB, GPU: NVIDIA RTX 2080Ti, CPU 4.00GHz Intel (R) Xeon (R) W-2125, hard disk: 2 TB.

(2) Experimental data:

the invention has performed experiments on three data sets, CTW1500(1000 training pictures, 500 test pictures), Total-Text (1255 training pictures, 300 test pictures) and ArT (5603 training pictures, 4563 test pictures). During the evaluation, for CTW1500, Total-Text and ArT, the shortest edges of their test pictures were set to 416,512 and 640, respectively.

(3) The optimization method comprises the following steps:

optimization was performed using an Adam optimizer. For CTW1500, the Total-Text and ArT models were trained for 250,300,300 epochs, respectively. The initial learning rate of the model was 0.0001. It multiplies the learning rate by 0.1 after 80,120,160,180 th and 260 epochs. For the backbone networks DLA34 and ResNet-50, the batch size during training is set to 6 and 3, respectively.

(4) The experimental results are as follows:

1) ablation experiment:

the experiment was performed on the CTW1500 dataset and the results are shown in tables 1 and 2. In the experiment, the baseline model carries out position evolution once from the sampling point on the horizontal candidate frame to the boundary of the character with any shape. As shown in the first row of Table 1, baseball can obtain 73.9% Recall, 80.9% Precision and 77.2% F-measure. If a directional character candidate generation network is added into the baseline model, 2.6% of F-measure can be improved. Further, when a CIA module is added into OPTG and ATPG, the F-measure of the model is improved by 2.6 percent. If RCLM is added on the basis of the baselene model added with OTPG, Precision (83.4% vs. 81.5%) of the model is obviously improved compared with Recall (77.8% vs. 78.2%). When the baseline is added with OTPG, CIA and RCLM, the optimal Recall (81.3%), Precision (86.1%) and F-measure (83.7%) can be achieved. It can be seen from table 2 that as the number of CLM modules increases, the F-measure tends to be stable, but the computation speed of the model is significantly affected.

Table 1: validation of extracted modules

Table 2: impact of CLM Module number

#CLM	Recall(％)	Precision(％)	F-measure(％)	FPS
					K＝0	80.3	86.6	83.0	16.3
K＝1	81.3	86.1	83.7	12.8
					K＝2	81.1	87.1	84.0	11.8
K＝3	81.3	86.5	83.8	10.7

2) And (3) comparing the performances:

as can be seen from tables 3, 4 and 5, the method of the present invention achieves the most advanced performance.

Table 3: performance comparison over CTW1500

Table 4: performance comparison on Total-Text

Table 5: comparison of Performance at ArT

Reference documents:

[1]Shangbang Long,Jiaqiang Ruan,Wenjie Zhang,Xin He,Wenhao Wu,and Cong Yao. TextSnake:A flexible representation for detecting text of arbitrary shapes.In ECCV,pages 19–35, 2018.

[2]Zichuan Liu,Guosheng Lin,Sheng Yang,Fayao Liu,Weisi Lin,and Wang Ling Goh. Towards robust curve text detection with conditional spatial expansion.In CVPR,pages 7269–7278, 2019.

[3]Wenhai Wang,Enze Xie,Xiaoge Song,Yuhang Zang,Wenjia Wang,Tong Lu,Gang Yu,and Chunhua Shen.Efficient and accurate arbitrary-shaped text detection with pixel aggregation network.In ICCV,pages 8439–8448,2019.

[4]Chuhui Xue,Shijian Lu,and Wei Zhang.MSR:multi-scale shape regression for scene text detection.In IJCAI,pages 989–995,2019.

[5]Wenhai Wang,Enze Xie,Xiang Li,Wenbo Hou,Tong Lu,Gang Yu,and Shuai Shao.Shape robust text detection with progressive scale expansion network.In CVPR,pages 9336–9345,2019.

[6]Youngmin Baek,Bado Lee,Dongyoon Han,Sangdoo Yun,and Hwalsuk Lee.Character region awareness for text detection.In CVPR,pages 9365–9374,2019.

[7]Zhuotao Tian,Michelle Shu,Pengyuan Lyu,Ruiyu Li,Chao Zhou,Xiaoyong Shen,and Jiaya Jia.Learning shape-aware embedding for scene text detection.In CVPR,pages 4234–4243,2019.

[8]Pengfei Wang,Chengquan Zhang,Fei Qi,Zuming Huang,Mengyi En,Junyu Han,Jingtuo Liu,Errui Ding,and Guangming Shi.A single-shot arbitrarily-shaped text detector based on context attended multi-task learning.In ACM-MM,pages 1277–1285,2019.

[9]Yongchao Xu,Yukang Wang,Wei Zhou,Yongpan Wang,Zhibo Yang,and Xiang Bai. TextField:Learning A deep direction field for irregular scene text detection.IEEE Trans.Image Process.,28(11):5566–5579,2019.

[10]Minghui Liao,Zhaoyi Wan,Cong Yao,Kai Chen,and Xiang Bai.Real-time scene text detection with differentiable binarization.In AAAI,pages 11474–11481,2020.

[11]Yu Zhou,Hongtao Xie,Shancheng Fang,Yan Li,and Yongdong Zhang.CRNet:A center-aware representation for detecting text of arbitrary shapes.In ACM-MM,pages 2571–2580, 2020.

[12]Yuliang Liu,Lianwen Jin,and ChuanMing Fang.Arbitrarily shaped scene text detection with a mask tightness text detector.IEEE Trans.Image Process.,29:2918–2930,2020.

[13]Shanyu Xiao,Liangrui Peng,Ruijie Yan,Keyu An,Gang Yao,and Jaesik Min.Sequential deformation for accurate scene text detection.In ECCV,pages 108–124,2020.

[14]Yuxin Wang,Hongtao Xie,Zhengjun Zha,Mengting Xing,Zilong Fu,and Yongdong Zhang. ContourNet:Taking a further step toward accurate arbitrary-shaped scene text detection.In CVPR, pages 11750–11759,2020.

[15]Yixing Zhu and Jun Du.Sliding line point regression for shape robust scene text detection. In ICPR,pages 3735–3740,2018.

[16]Yuliang Liu,Lianwen Jin,Shuaitao Zhang,and Sheng Zhang.Curved scene text detection via transverse and longitudinal sequence connection.Pattern Recognit.,90:337–345,2019.

[17]Jun Tang,Zhibo Yang,Yongpan Wang,Qi Zheng,Yongchao Xu,and Xiang Bai.SegLink++: Detecting dense and arbitrary-shaped scene text by instance-aware component grouping.Pattern Recognit.,96,2019.

[18]Xiaobing Wang,Yingying Jiang,Zhenbo Luo,Cheng-Lin Liu,Hyunsoo Choi,and Sungjin Kim.Arbitrary shape scene text detection with adaptive text region representation.In CVPR,pages 6449–6458,2019.

[19]Chengquan Zhang,Borong Liang,Zuming Huang,Mengyi En,Junyu Han,Errui Ding,and Xinghao Ding.Look More Than Once:An accurate detector for text of arbitrary shapes.In CVPR, pages 10552–10561,2019.

[20]Fangfang Wang,Yifeng Chen,Fei Wu,and Xi Li.TextRay:Contour-based geometric modeling for arbitraryshaped scene text detection.In ACM-MM,pages 111–119,2020.

[21]Hao Wang,Pu Lu,Hui Zhang,Mingkun Yang,Xiang Bai,Yongchao Xu,Mengchao He, Yongpan Wang,and Wenyu Liu.All You Need Is Boundary:Toward arbitrary-shaped text spotting. In AAAI,pages 12160–12167,2020.

the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and a person skilled in the art may make modifications or equivalent substitutions to the technical solutions of the present invention without departing from the scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A method for detecting characters of an arbitrary-shaped scene with an asymptotic regression boundary comprises the following steps:

2. The method of claim 1, wherein the method of extracting visual features comprises: utilizing a backbone network pre-trained on ImageNet; the backbone network includes: DLA34 network or ResNet50 network.

3. The method of claim 1, wherein the horizontal text candidate box B is obtained by the following steps^h：

2) For feature expression F_eAfter convolution and linear rectification, the linear rectification result is input into the second convolution layer to generate a character external rectangular frame scale estimation graph

3) response graph to character center

Generating a horizontal text candidate box B^h。

4. The method of claim 1, wherein the offset prediction value is generated by:

3) To eachBoundary primitive feature F^cPerforming boundary information aggregation to obtain increased feature expression F^cia；

5. The method of claim 4, wherein the boundary primitive feature representation F is obtained by^c：

6. The method of claim 4, wherein the increased characteristic expression F is obtained by^cia：

2) By geometric topology

And will geometrically topological structure

And each feature

Performing splicing to generate multi-scale fused boundary features

3) Boundary characterization using one-dimensional convolution operations

Reducing dimension and obtaining characteristics

4) Feature-based

Performing maximal pooling operation to generate boundary global features

5) Global feature of boundary

7. The method of claim 6, wherein the features are obtained by the steps of

1) Will be characterized by

Expression of characteristics

And characteristic expression

2) Respectively based on feature expressions

And characteristic expression

And sink node characteristics

3) Calculating the relationship between boundary sampling points and sink nodes

Wherein

And

dimension (d);

Wherein

Representing element-by-element addition.

8. The method of claim 4, wherein the directional text suggestion box B is generated by^oCorner points of (a):

9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.

10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.