CN113139539A - Method and device for detecting characters of arbitrary-shaped scene with asymptotic regression boundary - Google Patents
Method and device for detecting characters of arbitrary-shaped scene with asymptotic regression boundary Download PDFInfo
- Publication number
- CN113139539A CN113139539A CN202110280975.4A CN202110280975A CN113139539A CN 113139539 A CN113139539 A CN 113139539A CN 202110280975 A CN202110280975 A CN 202110280975A CN 113139539 A CN113139539 A CN 113139539A
- Authority
- CN
- China
- Prior art keywords
- boundary
- character
- expression
- feature
- sampling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
- G06F18/2193—Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method and a device for detecting characters of any shape scene with an asymptotic regression boundary, wherein the method comprises the following steps: extracting visual characteristics of an image to be detected, and performing characteristic fusion on the visual characteristics to obtain characteristic expression; inputting the feature expression into a horizontal suggestion box to generate a network and generating a horizontal character candidate box; inputting the feature expression and the horizontal character candidate box into a direction suggestion box to generate a network and generating a direction character suggestion box; and inputting the feature expression and direction character suggestion box into a character boundary of any shape to generate a network, and acquiring a scene character detection result. According to the method, more accurate and smooth character boundaries can be generated through asymptotic regression, more accurate point positions are obtained by utilizing the geometric topological relation and the semantic relation among boundary sampling points, and the model has better generalization, more effective execution speed and stronger detection capability.
Description
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a method and a device for detecting characters of an arbitrary-shape scene with an asymptotic regression boundary.
Background
In modern social life, images are a popular information carrier, which is widely present in a network space to deliver rich information. Characters have been used as a more direct information carrier since ancient times, and also contain rich and accurate high-level semantic information. When the characters take the images as carriers, the characters can not only directly transmit character information, but also help to understand the deep meaning of the images. Therefore, how to detect and identify the characters in the image has a very important application value in real life, which is mainly embodied in four aspects: (1) a deep intelligent visual question and answer or description system. For a given image, the machine can intelligently respond to or describe deeper meanings in conjunction with textual information in the image. If a bus image is shot in a natural scene, the intelligent system can understand the deeper semantic meaning of the image according to the visual elements containing characters, such as the license plate, the starting station and the destination of the bus, the advertisement poster on the surface of the bus and the like. (2) Provided is a human-computer interaction system. When people are shopping or shopping malls, many billboards, posters, store signs, menus, product needs, etc. are often encountered, however, the information is often presented in different languages. Therefore, the mobile device collects images and identifies the character elements in the images, and can bring convenience to the life of people. (3) And searching images based on the text content. The character information in the image can effectively solve the ambiguity of the image content, and the image retrieval based on the visual cue can be supplemented according to the character retrieval image in the image. In addition, many lawbreakers use images as carriers, and embed some vulgar characters in the images to propagate in a network space. And bad character information in the image is identified, so that the transmission of the image is prevented, and the physical and mental health of the underage is protected. (4) An intelligent transportation system. In the outdoor environment, accurate discernment license plate and traffic sign all have positive effect to the intelligent management of traffic.
In order to effectively identify characters in an image, the position of the characters is the most important preorder step in accurate positioning. In addition, the detection of the characters in the natural scene plays an important role in the field of image editing, and the accurate positioning of the characters is beneficial to better removing or replacing the character contents in the image, so that the effect of privacy protection is achieved. However, detection of text in images of natural scenes is extremely challenging. Firstly, under the condition of an uncontrollable natural scene, due to factors such as uneven illumination, weather change, shooting angle or shaking and the like, the scene characters have low resolution, large noise, blurring, shadow or shielding, thereby increasing the difficulty of character detection in the scene. In addition, due to the characteristics of the characters in the scene, such as the layout of any shape, the diversity of font color/type/size, the similarity between the character texture and background elements (bricks, fences, etc.), and the like, these factors also make the characters in the image missed, misdetected, or incomplete in boundary positioning. In summary, text detection in natural scenes is a very challenging task in the field of computer vision.
In recent years, natural character detection methods based on deep learning are mainly classified into three categories: a method based on boundary point regression; a pixel segmentation based approach; a method based on a mixture of regression and segmentation. The method based on the boundary point regression is to regress key points or a plurality of sampling points on the boundary of characters in any shape. The method mainly comprises the steps of regressing accurate character boundaries for characters with any shapes in a candidate region; or directly regressing points on the boundary through a one-stage model. The pixel segmentation-based method is to regard character areas in any shape in an image as a semantic segmentation problem, estimate the geometric attribute or connection relation of each pixel in the character areas, and finally aggregate the pixels into different character instances according to auxiliary information predicted by each pixel. In addition, a learner aggregates pixels into local connected regions according to the predicted attribute information of each pixel, and then predicts or infers the connection relationship between the connected regions, thereby aggregating into different text instances. The method based on regression segmentation mixing is to obtain horizontal candidate frames through regression, and then completes semantic segmentation of pixel level in the candidate frames. However, the current mainstream methods have their own disadvantages. Such as regression-based methods, usually obtain candidate boxes through a regional suggestion network, which requires manual design of prior boxes and relies on smart positive and negative sample sampling, thereby limiting the generalization performance of the model. In addition, such methods independently regress points on the text boundary, ignoring geometric topological or semantic relationships between the boundary points. Pixel segmentation based methods are typically extremely sensitive to noise. Due to the interference of the background, many erroneous judgments are easily generated, such as judging the pixels of the character area as the background or misjudging the background as the character area. Moreover, such methods often have difficulty generating smooth boundaries, which may negatively impact some practical applications. Furthermore, such methods typically involve complex post-aggregation processing that requires processing of a large number of pixels, thereby slowing down the overall algorithm. The regression-based segmentation blending method also involves generating text candidate boxes using the regional suggestion network, and the segmentation is limited to the candidate boxes, which will also be affected when the candidate boxes are not precisely located. In addition, the method also involves segmenting a plurality of scales and a large number of candidate blocks, and the execution speed of the method is also seriously influenced.
Therefore, the invention generates a small number of candidate frames through a network without prior frame design, then samples a plurality of dense points on the boundary of the candidate frames, considers the geometric topology and semantic relationship among the sampling points, and gradually iterates regression to obtain the accurate boundary of characters with any shape.
Disclosure of Invention
The invention provides a method and a device for detecting characters in any shape with asymptotic regression character boundaries aiming at natural scene images. The method gradually evolves sampling points on the boundary of the bounding box on the basis of the candidate box so as to accurately position the position of characters with any shape in the scene picture. In the process of generating the candidate frame, the invention avoids regression on the basis of a prior frame designed manually by regressing the center point of the character and the width and height of the horizontal external bounding frame. In the evolution process, the invention captures the topological relation and the semantic relation among the sampling points on the boundary, thereby enhancing the characteristic expression of the sampling points on the boundary and obtaining more accurate position of the boundary sampling point by regression.
In order to achieve the purpose, the technical scheme of the invention comprises the following steps:
a method for detecting characters of an arbitrary-shaped scene with an asymptotic regression boundary comprises the following steps:
1) extracting visual characteristics of an image to be detected, performing multi-scale characteristic fusion on the visual characteristics, and acquiring a characteristic expression F of the image to be detectede;
2) Expressing F according to characteristicseGenerating a horizontal character candidate box Bh;
3) Expressing F according to characteristicseAnd in the horizontal character candidate box BhGenerating an offset prediction value by a first boundary sampling point set obtained by sampling on the boundary, and generating a directional character suggestion box B by combining sampling points in the first boundary sampling point setoThe angular point of the direction character suggestion box B is obtainedo;
4) Using in-direction text suggestion boxes BoA second boundary sampling point set obtained by sampling on the boundary, and F is expressed according to the evolution characteristiceGenerating new coordinates of the sampling points according to the obtained coordinate positions of the sampling points, and obtaining the coordinates of the accurate boundary positions according to the new coordinate positions of the sampling pointsAnd estimating the score s of the characters belonging to the boundary surrounding area, thereby obtaining a scene character detection result.
Further, the method for extracting visual features comprises the following steps: a backbone network pre-trained on ImageNet was utilized.
Further, the backbone network includes: DLA34 network or ResNet50 network.
Further, the method for performing multi-scale feature fusion on the visual features comprises the following steps: and fusing the multi-scale features from shallow to deep.
Further, a horizontal character suggestion box is used for generating a network to obtain a horizontal character candidate box Bh:
1) For feature expression FeAfter convolution and linear rectification, the linear rectification result is input into the first convolution layer to generate a character center response diagram
2) For feature expression FeTo carry outAfter convolution and linear rectification, the linear rectification result is input into the second convolution layer to generate a character external rectangular frame scale estimation graphThe number of convolution kernels of the first convolution layer is different from that of the second convolution layer;
3) response graph to character centerPerforming maximum pooling operation and passing a set threshold τcFiltering the central point with low score to obtain the filtered central point;
4) estimating the graph according to the filtered central point and the dimension of the character external rectangular frameGenerating a horizontal text candidate box Bh。
Further, training the horizontal character suggestion box to generate a loss function of the networkWherein the center of the text is lostLoss of dimension NtRepresenting the number of text instances in the sample image, i represents the text center response graphThe position index of the upper part, P and Q respectively represent the character center response diagramDimension estimation graph of rectangle frame externally connected with charactersThe true value of (c) is given,alpha and beta represent a first penalty factor and a second penalty factor, respectively, for the Smooth-L1 loss function.
Further, generating an offset prediction value by:
1) in the horizontal text suggestion box BhIs uniformly sampled over the boundary of NoObtaining coordinates x of each sampling point in the first boundary sampling point set by using the point;
2) expression of F according to characteristicseExtracting the boundary original feature expression F of each sampling point in the first boundary sampling point setc;
3) For each boundary original characteristic FcPerforming boundary information aggregation to obtain increased feature expression Fcia;
4) Expression of F according to an increasing characteristicciaAnd performing offset prediction to obtain an offset prediction value o of each sampling point in the first boundary sampling point set.
Further, obtaining a boundary original feature expression F through the following stepsc:
1) Expression of F according to characteristicseRespectively extracting semantic features F of each sampling point in the first boundary sampling point setsemAnd position feature Floc;
2) To identify semantic features FsemAnd position feature FlocSplicing together to obtain a boundary original characteristic expression Fc。
Further, the increase of the characteristic expression F is obtained by the following stepscia:
1) Expression of boundary primitive features FcAfter one-dimensional cyclic convolution operation and linear rectification, inputting a linear rectification result into a BN layer to capture a geometric topological structure of closed boundary cycle
2) By geometric topologyAnd t expansion ratios rtThe boundary information aggregation unit acquires t featuresAnd will geometrically topological structureAnd each featurePerforming splicing to generate multi-scale fused boundary features
3) Boundary characterization using one-dimensional convolution operationsReducing dimension and obtaining characteristics
5) Global feature of boundaryDistributing the data to each sampling point in the first boundary sampling point set, and splicing to obtain an increased feature expression Fcia。
1) Will be characterized byBy 3 expansion ratios rtBy expanding one-dimensional cyclic convolution to produce a feature representationExpression of characteristicsAnd characteristic expression
2) Respectively based on feature expressionsAnd characteristic expressionBy using NgCollecting information along character boundary by each collection node, capturing semantic relation between sampling points, and obtaining characteristics of the collection nodes by combining with boundary global relation captured by adding a virtual collection nodeAnd sink node characteristics
3) Calculating the relationship between boundary sampling points and sink nodesWhereini is the number of the collection node, j is the number of the virtual collection node, DuFor the characteristic expressionAnddimension (d);
4) assigning features on sink nodes to each of the boundary sample points to produce featuresWhereinRepresenting element-by-element addition.
Further, a network is generated through an angle point, and a direction character suggestion box B is generatedoCorner points of (a):
1) obtaining an updated sampling point coordinate x' ═ x + o through the coordinate x of each sampling point and the offset predicted value o;
2) selecting 4 points from coordinates x' of sampling points at equal intervals as direction character suggestion boxesAngular point, generating directional character suggestion box BoThe corner points of (a).
Further, the training corner points generate a loss function of the networkNtIndicating the number of instances of text in the sample image,coordinates are predicted for the corner point of the jth sample point of the ith word in the sample image,the coordinates of the truth value of the corner point of the jth sampling point of the ith character in the sample image,as a function of Smooth-L1 loss.
Further, by bounding the position network on one side, new coordinate positions of the generated sampling points are generated, wherein the objective function of the boundary positioning network is trainedWherein N istRepresenting the number of text instances in the sample image, NaThe number of sampling points of the ith word in the sample image,as a function of Smooth-L1 loss,is the predicted value of the j sampling point of the ith character in the sample image,is the true value of the jth sampling point of the ith word in the sample image.
Further, accurate boundary position coordinates are obtained through a reliable boundary positioning networkAnd the score s of characters belonging to the predicted boundary surrounding area, wherein the objective function of the reliable boundary positioning network is trainedWherein N istIndicating the number of instances of text in the sample image,and the confidence that the area surrounded by the ith prediction boundary in the sample image belongs to the characters or the background is represented.
A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-mentioned method when executed.
An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method as described above.
The invention has the beneficial effects that:
1. the invention gradually regresses the sampling points on the character boundary by adopting an asymptotic regression mode, the regression form is consistent with the human visual system, and the asymptotic regression can generate more accurate and smooth character boundary for the character layout with a complex form.
2. The invention models the geometric topological relation and semantic relation between the boundary sampling points to interact the information on the boundary, thereby enhancing the boundary characteristic expression and obtaining the position of a more accurate point by regression.
3. The invention does not need to design a prior frame, thereby leading the model to have better generalization.
4. The number of the candidate frames generated by the method is obviously less than that of the candidate frames generated by the prior frame regression, so that the execution speed of the model is effectively improved.
5. The invention has strong detection capability and excellent performance for characters in any shapes, such as horizontal characters, multidirectional characters, curved characters and the like.
Drawings
FIG. 1 is a flow chart of detecting characters in an arbitrary shape scene.
Fig. 2 is a schematic diagram of a boundary offset prediction network.
Fig. 3 is a schematic diagram of a boundary information aggregation unit.
Fig. 4 is a schematic diagram of a reliable boundary positioning mechanism.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method for detecting characters in scenes with any shapes, as shown in fig. 1, includes:
1) extracting feature expression of an input image;
2) generating a network prediction character central point and the scale of an external rectangular frame by using the horizontal character suggestion frame to generate a horizontal character candidate frame;
3) sampling the boundary of the horizontal candidate frame, extracting the characteristics of the boundary sampling points, and enhancing the characteristic expression of the boundary sampling points by using a boundary aggregation network (CIA) to evolve the positions of the boundary sampling points so as to generate a directional character candidate frame;
4) sampling the boundary of the direction candidate frame, and gradually evolving the position of the boundary sampling point through a plurality of boundary positioning Mechanism (CLM) modules to approach the boundary of characters in any shape; and finally, a Reliable boundary positioning Mechanism (RCLM) technology is utilized to determine the confidence degree of whether the area surrounded by the positioned boundary points belongs to characters.
In one embodiment of the present invention, the input is an RBG image with a size H × W. Images were randomly cropped to 640 x 640 at the time of training. The shortest side of the input image is set to different values according to different data sets during testing, and the longest side of the input image is kept fixed in length and breadth and is changed correspondingly. Then, the image input into the network is subjected to feature extraction, and the specific steps are as follows:
i) the visual features of the input image are extracted using a backbone network (i.e., a network pre-trained on ImageNet, such as DLA34, ResNet50, etc.), the output of which is represented as For DLA34, D2,D3,D4And D564,128, 256,512 respectively. For ResNet50, it is 256,512,1024 and 2048, respectively.
ii) mixing C2,C3,C4And C5And inputting the feature enhancement module. Then C3By reducing the characteristic dimension of a convolution operationDegree; then, sampling the characteristic diagram by 2 times, and splicing the characteristic diagram with C2; spliced feature is subjected to a feature C 'with a new deformation convolution parameter'3(ii) a Then, C4Repeating the above process with C'3C 'is generated through deformation convolution after splicing'4(ii) a Finally, C5Repeating the above process with C'4After splicing, a deformation convolution is carried out to generate a characteristic expression of scale robustnessWhere σ is 4, De64 or 256 for DLA34 or ResNet50, respectively.
In one embodiment of the invention, the horizontal text candidate box generation network is composed of two branches. One of them consists of 1 convolution layer of 3 x 3 (256 convolution kernels), 1 Linear rectifying Unit (ReLU) and 1 convolution layer of 1 x 1 (1 convolution kernel) to generate a word center response mapThe second is composed of 1 convolution layer (256 convolution kernels) of 3 x 3, 1 Linear rectification Unit (ReLU) and 1 convolution layer (2 convolution kernels) of 1 x 1 to generate a character circumscribed rectangle frame scale estimation graphDuring training, the loss function of the network contains text center lossLoss of dimensionTwo parts, respectively denoted as:
wherein N istIndicating the number of text instances in the image and i indicates the position index on the response map. P and Q represent the truth values of the word center response diagram and the scale of the circumscribed rectangle box respectively.As a function of Smooth-L1 loss. Alpha and beta represent penalty factors that are set to 2 and 4, respectively, during the training process.
In the test process, according to the obtained word center response diagramThe invention firstly utilizes a3 x 3 maximum pooling operation to highlight the central point; then passes a threshold τcCenter points with low scores are filtered. The horizontal text suggestion box may then be represented as:
In one embodiment of the present invention, for the generation of directional text suggestion boxes, the present invention first generates a directional text suggestion box at each horizontal text suggestion boxIs uniformly sampled over the boundary of (1)oAnd then extracting an original boundary Feature expression by using a boundary Feature Extractor (CFE) in a boundary offset prediction networkThe specific process is as follows:
ii) calculating the position characteristics of the boundary sampling points in the following way: floc=x-xmin。
iii) comparing the semantic features F of each sample pointsemAnd position feature FlocSpliced together to form the original characteristic expression F of each sampling pointc。
Generating enhanced feature expressions after passing through a boundary Information Aggregation (CIA) moduleThen, FciaInput to an Offset Prediction Head (OPH) to generate an Offset for each sample pointThe boundary feature extractor CFE, the boundary information aggregation module CIA, and the offset prediction header form an offset prediction network (as shown in fig. 2). The updated coordinates of the sample points are thus expressed as: x' ═ x + o. From x', 4 points are selected at equal intervals as the corner points of the directional character suggestion box. Stacking all the predicted corner coordinates in the image together to form a coordinate matrixDuring the training process, the loss function of the corner learning is expressed as:
In an embodiment of the present invention, a boundary Information Aggregation (CIA) module specifically executes the following steps:
i) input boundary primitive feature FcThen, the geometric topology of the closed boundary loop is captured using 1 9 × 9 one-dimensional cyclic convolution operation (D convolution kernels), 1 ReLU, and one (batch normalization, BN) layer, whose output is represented as
ii)Input into 6 CIA units with expansion rates r of 1,1,2,2,4 and 4 respectively, to model semantic relations between boundary sampling points to generate features respectivelyAnd
iv) fusing the features using a standard 3 x 3 one-dimensional convolution operation (D convolution kernels)Performing dimensionality reduction treatment to obtain a reduced characteristic
vi) global featuresDistributing the data to each sampling point for splicing to obtain the output characteristics of the CIA module Wherein Dcia=8*D。
For each CIA unit, as shown in fig. 3, the specific steps are as follows:
i) input featuresThe periodicity of the sample points on the closed boundary is first encoded by 3 dilation one-dimensional cyclic convolutions with dilation rate r to produce a feature representationAndwherein DuIs a dimension.
ii) based onBy using NgThe collection nodes collect information along the character boundaries, capture semantic relations among sampling points, and reduce interference between redundant sampling points and noise sampling points. In addition, a virtual sink node is added to capture the global relationship of the boundary, so the characteristics of the sink nodeExpressed as:
whereinRepresents a maximum pooling operation; phi denotes a characteristic polymerization operation, NgFor hyper-parameter, N in the directional suggestion box generation moduleg64, N in the arbitrary-shaped character boundary generating moduleg=128。
v) the features on the sink node are assigned to each of the boundary sample points to produce an aggregate featureIt is expressed as:
In bookIn one embodiment of the invention, for each CIA unit, FuAre all different. For 1,2, …,6 CIA units, FuI.e. respectively represent
In an embodiment of the present invention, the Offset Prediction Head (OPH) is composed of 3 1 × 1 one-dimensional convolutions (the number of convolution kernels is 256,64,2, respectively) and 2 ReLU units.
In one embodiment of the present invention, for the generation of arbitrary shape literal boundaries, it first proposes box B from the direction literaloIs sampled by the boundary of NaPoint; and then, gradually evolving sampling points by utilizing K boundary positioning Mechanism (CLM) modules to approximate the boundary of characters in any shape. Finally, as shown in FIG. 4, the new coordinates obtained by the CLM evolutionInputting the data into a Reliable boundary positioning Mechanism (RCLM) to generate accurate boundary positioning and predict the score of the characters belonging to the boundary enclosing region.
In an embodiment of the invention, the CLM is composed of a boundary feature extractor CFE, a boundary information aggregation CIA and an offset prediction head OPH module; adding the position offset output by the OPH to the input coordinate position to obtain a new coordinate position; the new coordinate position is then entered into the next CLM module for further evolution, and so on.
In an embodiment of the invention, the RCLM is similar in structure to the CLM, except that the RCLM not only aggregates the boundary information into the output F of the CIA moduleciaInputting the position of the boundary into the offset prediction head to further adjust the position of the boundary, and generating final position coordinatesAnd inputting the result into a boundary Scoring Mechanism (CSM) module to predict the confidence level of whether the region surrounded by the boundary is a characterIn the training process, the evolution of the whole process passes through an objective functionLearning is carried out; and the CSM module is updated through the objective functionThis is completed. It is expressed as:
whereinWhich represents the true value of the jth sample point for the ith word.And the confidence that the area surrounded by the ith prediction boundary belongs to the characters or the background is represented. l1 denotes a character, and l 0 denotes a background.
In an embodiment of the present invention, D, No,Na,Du,τcAnd K is set to 128,64,128, 0.35, and 2, respectively.
The invention provides a method for detecting characters with any shapes at asymptotic regression boundaries, which comprises the following test environments and experimental results:
(1) and (3) testing environment:
the system environment is as follows: ubuntu 16.04.
Hardware environment: memory: 15GB, GPU: NVIDIA RTX 2080Ti, CPU 4.00GHz Intel (R) Xeon (R) W-2125, hard disk: 2 TB.
(2) Experimental data:
the invention has performed experiments on three data sets, CTW1500(1000 training pictures, 500 test pictures), Total-Text (1255 training pictures, 300 test pictures) and ArT (5603 training pictures, 4563 test pictures). During the evaluation, for CTW1500, Total-Text and ArT, the shortest edges of their test pictures were set to 416,512 and 640, respectively.
(3) The optimization method comprises the following steps:
optimization was performed using an Adam optimizer. For CTW1500, the Total-Text and ArT models were trained for 250,300,300 epochs, respectively. The initial learning rate of the model was 0.0001. It multiplies the learning rate by 0.1 after 80,120,160,180 th and 260 epochs. For the backbone networks DLA34 and ResNet-50, the batch size during training is set to 6 and 3, respectively.
(4) The experimental results are as follows:
1) ablation experiment:
the experiment was performed on the CTW1500 dataset and the results are shown in tables 1 and 2. In the experiment, the baseline model carries out position evolution once from the sampling point on the horizontal candidate frame to the boundary of the character with any shape. As shown in the first row of Table 1, baseball can obtain 73.9% Recall, 80.9% Precision and 77.2% F-measure. If a directional character candidate generation network is added into the baseline model, 2.6% of F-measure can be improved. Further, when a CIA module is added into OPTG and ATPG, the F-measure of the model is improved by 2.6 percent. If RCLM is added on the basis of the baselene model added with OTPG, Precision (83.4% vs. 81.5%) of the model is obviously improved compared with Recall (77.8% vs. 78.2%). When the baseline is added with OTPG, CIA and RCLM, the optimal Recall (81.3%), Precision (86.1%) and F-measure (83.7%) can be achieved. It can be seen from table 2 that as the number of CLM modules increases, the F-measure tends to be stable, but the computation speed of the model is significantly affected.
Table 1: validation of extracted modules
Table 2: impact of CLM Module number
#CLM | Recall(%) | Precision(%) | F-measure(%) | FPS |
K=0 | 80.3 | 86.6 | 83.0 | 16.3 |
K=1 | 81.3 | 86.1 | 83.7 | 12.8 |
K=2 | 81.1 | 87.1 | 84.0 | 11.8 |
K=3 | 81.3 | 86.5 | 83.8 | 10.7 |
2) And (3) comparing the performances:
as can be seen from tables 3, 4 and 5, the method of the present invention achieves the most advanced performance.
Table 3: performance comparison over CTW1500
Table 4: performance comparison on Total-Text
Table 5: comparison of Performance at ArT
Reference documents:
[1]Shangbang Long,Jiaqiang Ruan,Wenjie Zhang,Xin He,Wenhao Wu,and Cong Yao. TextSnake:A flexible representation for detecting text of arbitrary shapes.In ECCV,pages 19–35, 2018.
[2]Zichuan Liu,Guosheng Lin,Sheng Yang,Fayao Liu,Weisi Lin,and Wang Ling Goh. Towards robust curve text detection with conditional spatial expansion.In CVPR,pages 7269–7278, 2019.
[3]Wenhai Wang,Enze Xie,Xiaoge Song,Yuhang Zang,Wenjia Wang,Tong Lu,Gang Yu,and Chunhua Shen.Efficient and accurate arbitrary-shaped text detection with pixel aggregation network.In ICCV,pages 8439–8448,2019.
[4]Chuhui Xue,Shijian Lu,and Wei Zhang.MSR:multi-scale shape regression for scene text detection.In IJCAI,pages 989–995,2019.
[5]Wenhai Wang,Enze Xie,Xiang Li,Wenbo Hou,Tong Lu,Gang Yu,and Shuai Shao.Shape robust text detection with progressive scale expansion network.In CVPR,pages 9336–9345,2019.
[6]Youngmin Baek,Bado Lee,Dongyoon Han,Sangdoo Yun,and Hwalsuk Lee.Character region awareness for text detection.In CVPR,pages 9365–9374,2019.
[7]Zhuotao Tian,Michelle Shu,Pengyuan Lyu,Ruiyu Li,Chao Zhou,Xiaoyong Shen,and Jiaya Jia.Learning shape-aware embedding for scene text detection.In CVPR,pages 4234–4243,2019.
[8]Pengfei Wang,Chengquan Zhang,Fei Qi,Zuming Huang,Mengyi En,Junyu Han,Jingtuo Liu,Errui Ding,and Guangming Shi.A single-shot arbitrarily-shaped text detector based on context attended multi-task learning.In ACM-MM,pages 1277–1285,2019.
[9]Yongchao Xu,Yukang Wang,Wei Zhou,Yongpan Wang,Zhibo Yang,and Xiang Bai. TextField:Learning A deep direction field for irregular scene text detection.IEEE Trans.Image Process.,28(11):5566–5579,2019.
[10]Minghui Liao,Zhaoyi Wan,Cong Yao,Kai Chen,and Xiang Bai.Real-time scene text detection with differentiable binarization.In AAAI,pages 11474–11481,2020.
[11]Yu Zhou,Hongtao Xie,Shancheng Fang,Yan Li,and Yongdong Zhang.CRNet:A center-aware representation for detecting text of arbitrary shapes.In ACM-MM,pages 2571–2580, 2020.
[12]Yuliang Liu,Lianwen Jin,and ChuanMing Fang.Arbitrarily shaped scene text detection with a mask tightness text detector.IEEE Trans.Image Process.,29:2918–2930,2020.
[13]Shanyu Xiao,Liangrui Peng,Ruijie Yan,Keyu An,Gang Yao,and Jaesik Min.Sequential deformation for accurate scene text detection.In ECCV,pages 108–124,2020.
[14]Yuxin Wang,Hongtao Xie,Zhengjun Zha,Mengting Xing,Zilong Fu,and Yongdong Zhang. ContourNet:Taking a further step toward accurate arbitrary-shaped scene text detection.In CVPR, pages 11750–11759,2020.
[15]Yixing Zhu and Jun Du.Sliding line point regression for shape robust scene text detection. In ICPR,pages 3735–3740,2018.
[16]Yuliang Liu,Lianwen Jin,Shuaitao Zhang,and Sheng Zhang.Curved scene text detection via transverse and longitudinal sequence connection.Pattern Recognit.,90:337–345,2019.
[17]Jun Tang,Zhibo Yang,Yongpan Wang,Qi Zheng,Yongchao Xu,and Xiang Bai.SegLink++: Detecting dense and arbitrary-shaped scene text by instance-aware component grouping.Pattern Recognit.,96,2019.
[18]Xiaobing Wang,Yingying Jiang,Zhenbo Luo,Cheng-Lin Liu,Hyunsoo Choi,and Sungjin Kim.Arbitrary shape scene text detection with adaptive text region representation.In CVPR,pages 6449–6458,2019.
[19]Chengquan Zhang,Borong Liang,Zuming Huang,Mengyi En,Junyu Han,Errui Ding,and Xinghao Ding.Look More Than Once:An accurate detector for text of arbitrary shapes.In CVPR, pages 10552–10561,2019.
[20]Fangfang Wang,Yifeng Chen,Fei Wu,and Xi Li.TextRay:Contour-based geometric modeling for arbitraryshaped scene text detection.In ACM-MM,pages 111–119,2020.
[21]Hao Wang,Pu Lu,Hui Zhang,Mingkun Yang,Xiang Bai,Yongchao Xu,Mengchao He, Yongpan Wang,and Wenyu Liu.All You Need Is Boundary:Toward arbitrary-shaped text spotting. In AAAI,pages 12160–12167,2020.
the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and a person skilled in the art may make modifications or equivalent substitutions to the technical solutions of the present invention without departing from the scope of the present invention, and the scope of the present invention should be determined by the claims.
Claims (10)
1. A method for detecting characters of an arbitrary-shaped scene with an asymptotic regression boundary comprises the following steps:
1) extracting visual characteristics of an image to be detected, performing multi-scale characteristic fusion on the visual characteristics, and acquiring a characteristic expression F of the image to be detectede;
2) Expressing F according to characteristicseGenerating a horizontal character candidate box Bh;
3) Expressing F according to characteristicseAnd in the horizontal character candidate box BhGenerating an offset prediction value by a first boundary sampling point set obtained by sampling on the boundary, and generating a directional character suggestion box B by combining sampling points in the first boundary sampling point setoThe angular point of the direction character suggestion box B is obtainedo;
4) Using in-direction text suggestion boxes BoA second boundary sampling point set obtained by sampling on the boundary, and F is expressed according to the evolution characteristiceGenerating new coordinates of the sampling points according to the obtained coordinate positions of the sampling points, and obtaining the coordinates of the accurate boundary positions according to the new coordinate positions of the sampling pointsAnd estimating the score s of the characters belonging to the boundary surrounding area, thereby obtaining a scene character detection result.
2. The method of claim 1, wherein the method of extracting visual features comprises: utilizing a backbone network pre-trained on ImageNet; the backbone network includes: DLA34 network or ResNet50 network.
3. The method of claim 1, wherein the horizontal text candidate box B is obtained by the following stepsh:
1) For feature expression FeAfter convolution and linear rectification, the linear rectification result is input into the first convolution layer to generate a character center response diagram
2) For feature expression FeAfter convolution and linear rectification, the linear rectification result is input into the second convolution layer to generate a character external rectangular frame scale estimation graphThe number of convolution kernels of the first convolution layer is different from that of the second convolution layer;
3) response graph to character centerPerforming maximum pooling operation and passing a set threshold τcFiltering the central point with low score to obtain the filtered central point;
4. The method of claim 1, wherein the offset prediction value is generated by:
1) in the horizontal text suggestion box BhIs uniformly sampled over the boundary of NoObtaining coordinates x of each sampling point in the first boundary sampling point set by using the point;
2) expression of F according to characteristicseExtracting the boundary original feature expression F of each sampling point in the first boundary sampling point setc;
3) To eachBoundary primitive feature FcPerforming boundary information aggregation to obtain increased feature expression Fcia;
4) Expression of F according to an increasing characteristicciaAnd performing offset prediction to obtain an offset prediction value o of each sampling point in the first boundary sampling point set.
5. The method of claim 4, wherein the boundary primitive feature representation F is obtained byc:
1) Expression of F according to characteristicseRespectively extracting semantic features F of each sampling point in the first boundary sampling point setsemAnd position feature Floc;
2) To identify semantic features FsemAnd position feature FlocSplicing together to obtain a boundary original characteristic expression Fc。
6. The method of claim 4, wherein the increased characteristic expression F is obtained bycia:
1) Expression of boundary primitive features FcAfter one-dimensional cyclic convolution operation and linear rectification, inputting a linear rectification result into a BN layer to capture a geometric topological structure of closed boundary cycle
2) By geometric topologyAnd t expansion ratios rtThe boundary information aggregation unit acquires t featuresAnd will geometrically topological structureAnd each featurePerforming splicing to generate multi-scale fused boundary features
3) Boundary characterization using one-dimensional convolution operationsReducing dimension and obtaining characteristics
1) Will be characterized byBy 3 expansion ratios rtBy expanding one-dimensional cyclic convolution to produce a feature representationExpression of characteristicsAnd characteristic expression
2) Respectively based on feature expressionsAnd characteristic expressionBy using NgCollecting information along character boundary by each collection node, capturing semantic relation between sampling points, and obtaining characteristics of the collection nodes by combining with boundary global relation captured by adding a virtual collection nodeAnd sink node characteristics
3) Calculating the relationship between boundary sampling points and sink nodesWhereini is the number of the collection node, j is the number of the virtual collection node, DuFor the characteristic expressionAnddimension (d);
8. The method of claim 4, wherein the directional text suggestion box B is generated byoCorner points of (a):
1) obtaining an updated sampling point coordinate x' ═ x + o through the coordinate x of each sampling point and the offset predicted value o;
9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.
10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110280975.4A CN113139539B (en) | 2021-03-16 | 2021-03-16 | Method and device for detecting characters of arbitrary-shaped scene with asymptotic regression boundary |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110280975.4A CN113139539B (en) | 2021-03-16 | 2021-03-16 | Method and device for detecting characters of arbitrary-shaped scene with asymptotic regression boundary |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113139539A true CN113139539A (en) | 2021-07-20 |
CN113139539B CN113139539B (en) | 2023-01-13 |
Family
ID=76811104
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110280975.4A Active CN113139539B (en) | 2021-03-16 | 2021-03-16 | Method and device for detecting characters of arbitrary-shaped scene with asymptotic regression boundary |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113139539B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113822278A (en) * | 2021-11-22 | 2021-12-21 | 松立控股集团股份有限公司 | License plate recognition method for unlimited scene |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004334913A (en) * | 2004-08-19 | 2004-11-25 | Matsushita Electric Ind Co Ltd | Document recognition device and document recognition method |
CN107346420A (en) * | 2017-06-19 | 2017-11-14 | 中国科学院信息工程研究所 | Text detection localization method under a kind of natural scene based on deep learning |
CN108960229A (en) * | 2018-04-23 | 2018-12-07 | 中国科学院信息工程研究所 | One kind is towards multidirectional character detecting method and device |
CN109117836A (en) * | 2018-07-05 | 2019-01-01 | 中国科学院信息工程研究所 | Text detection localization method and device under a kind of natural scene based on focal loss function |
CN110245545A (en) * | 2018-09-26 | 2019-09-17 | 浙江大华技术股份有限公司 | A kind of character recognition method and device |
CN110287960A (en) * | 2019-07-02 | 2019-09-27 | 中国科学院信息工程研究所 | The detection recognition method of curve text in natural scene image |
CN111738055A (en) * | 2020-04-24 | 2020-10-02 | 浙江大学城市学院 | Multi-class text detection system and bill form detection method based on same |
-
2021
- 2021-03-16 CN CN202110280975.4A patent/CN113139539B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004334913A (en) * | 2004-08-19 | 2004-11-25 | Matsushita Electric Ind Co Ltd | Document recognition device and document recognition method |
CN107346420A (en) * | 2017-06-19 | 2017-11-14 | 中国科学院信息工程研究所 | Text detection localization method under a kind of natural scene based on deep learning |
CN108960229A (en) * | 2018-04-23 | 2018-12-07 | 中国科学院信息工程研究所 | One kind is towards multidirectional character detecting method and device |
CN109117836A (en) * | 2018-07-05 | 2019-01-01 | 中国科学院信息工程研究所 | Text detection localization method and device under a kind of natural scene based on focal loss function |
CN110245545A (en) * | 2018-09-26 | 2019-09-17 | 浙江大华技术股份有限公司 | A kind of character recognition method and device |
CN110287960A (en) * | 2019-07-02 | 2019-09-27 | 中国科学院信息工程研究所 | The detection recognition method of curve text in natural scene image |
CN111738055A (en) * | 2020-04-24 | 2020-10-02 | 浙江大学城市学院 | Multi-class text detection system and bill form detection method based on same |
Non-Patent Citations (3)
Title |
---|
PENGWEN DAI 等: "Deep Multi-Scale Context Aware Feature Aggregation for Curved Scene Text Detection", 《IEEE TRANSACTIONS ON MULTIMEDIA》 * |
朱盈盈 等: "适用于文字检测的候选框提取算法", 《数据采集与处理》 * |
陈泽瀛: "一种基于自适应非极大值抑制的文本检测算法", 《数字技术与应用》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113822278A (en) * | 2021-11-22 | 2021-12-21 | 松立控股集团股份有限公司 | License plate recognition method for unlimited scene |
Also Published As
Publication number | Publication date |
---|---|
CN113139539B (en) | 2023-01-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhou et al. | Violence detection in surveillance video using low-level features | |
Chen et al. | Show, match and segment: Joint weakly supervised learning of semantic matching and object co-segmentation | |
US11062123B2 (en) | Method, terminal, and storage medium for tracking facial critical area | |
WO2018103608A1 (en) | Text detection method, device and storage medium | |
He et al. | Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild | |
Xiao et al. | A weakly supervised semantic segmentation network by aggregating seed cues: the multi-object proposal generation perspective | |
Wang et al. | Video co-saliency guided co-segmentation | |
Shivakumara et al. | Multioriented video scene text detection through Bayesian classification and boundary growing | |
Ma et al. | A saliency prior context model for real-time object tracking | |
Li et al. | Hierarchical feature fusion network for salient object detection | |
Xu et al. | Video saliency detection via graph clustering with motion energy and spatiotemporal objectness | |
US9626585B2 (en) | Composition modeling for photo retrieval through geometric image segmentation | |
Ni et al. | Learning to photograph: A compositional perspective | |
CN106203423B (en) | Weak structure perception visual target tracking method fusing context detection | |
CN104952083B (en) | A kind of saliency detection method based on the modeling of conspicuousness target background | |
Lee et al. | Unsupervised video object segmentation via prototype memory network | |
Zheng et al. | A feature-adaptive semi-supervised framework for co-saliency detection | |
CN112752158B (en) | Video display method and device, electronic equipment and storage medium | |
Mei et al. | Large-field contextual feature learning for glass detection | |
CN113139544A (en) | Saliency target detection method based on multi-scale feature dynamic fusion | |
Tang et al. | CLASS: cross-level attention and supervision for salient objects detection | |
Bak et al. | Two-stream convolutional networks for dynamic saliency prediction | |
Zhang et al. | Deep salient object detection by integrating multi-level cues | |
Zhang et al. | Detecting and removing visual distractors for video aesthetic enhancement | |
Wang et al. | End-to-end trainable network for superpixel and image segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |