CN106897732B - A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields - Google Patents
A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields Download PDFInfo
- Publication number
- CN106897732B CN106897732B CN201710010596.7A CN201710010596A CN106897732B CN 106897732 B CN106897732 B CN 106897732B CN 201710010596 A CN201710010596 A CN 201710010596A CN 106897732 B CN106897732 B CN 106897732B
- Authority
- CN
- China
- Prior art keywords
- text
- bounding box
- text field
- connection
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims description 12
- 238000005065 mining Methods 0.000 claims description 12
- 230000008859 change Effects 0.000 claims description 9
- 230000003321 amplification Effects 0.000 claims description 7
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 6
- 230000000694 effects Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000013515 script Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000013434 data augmentation Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
- G06V10/225—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on a marking or identifier characterising the area
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本发明属于计算机视觉技术领域,更具体地,涉及一种基于连接文字段的自然图片中多方向文本检测方法。The invention belongs to the technical field of computer vision, and more specifically relates to a multi-directional text detection method in natural pictures based on connected text fields.
背景技术Background technique
读取自然图片中的文本是一个充满挑战的热门任务,在照片光学识别、地理定位和图像检索方面都有许多实际的应用。在文本读取系统中,文字检测就是在单词级别或文字条级别上以包围盒来定位文字区域,它通常都作为非常关键的第一步。从某种意义上而言,文字检测也可以视为一种特殊的物体检测,即将单词、字符或文字条作为检测的目标。Reading text in natural images is a challenging and popular task with many practical applications in photo optical recognition, geolocation, and image retrieval. In the text reading system, text detection is to locate the text area with a bounding box at the word level or text strip level, and it is usually regarded as a very critical first step. In a sense, text detection can also be regarded as a special object detection, which takes words, characters or text strips as the detection target.
尽管已有的技术已经在将物体检测方法应用于文字检测上取得了极大的成功,但是物体检测方法在定位文字区域方面仍有几点明显的不足。第一,单词或文字条的长宽比通常都比一般物体要大的多,之前的方法难以产生这种比例的包围盒;第二,一些非拉丁语的文本在相邻单词之间并不包含空格,比如中文汉字。已有的技术都只能检测到单词,在检测这种文本时就会不适用,因为这种不包含空格的文本无法提供划分不同单词的视觉信息。第三,在大型自然场景图片中,文字可能是任意方向的,然而现有的技术绝大多数都只能检测水平方向的文字。因此自然场景图片中的文本检测仍然是计算机视觉技术领域的难点之一。Although existing technologies have achieved great success in applying object detection methods to text detection, there are still several obvious deficiencies in object detection methods in locating text regions. First, the aspect ratio of words or text strips is usually much larger than that of general objects, and it is difficult for previous methods to generate bounding boxes of this ratio; second, some non-Latin texts have no gaps between adjacent words. Contains spaces, such as Chinese characters. Existing technologies can only detect words, and are not applicable to this kind of text, because this kind of text without spaces cannot provide visual information for dividing different words. Third, in large natural scene images, the text may be in any direction, but most of the existing technologies can only detect horizontal text. Therefore, text detection in natural scene pictures is still one of the difficulties in the field of computer vision technology.
发明内容Contents of the invention
本发明的目的在于提供一种基于连接文字段的自然图片中多方向文本检测方法,该方法检测文本准确率高,速度快,模型简易,且鲁棒性强,能克服复杂的图片背景,另外也能检测非拉丁文字的长文本。The purpose of the present invention is to provide a multi-directional text detection method in natural pictures based on connected text fields. The method has high text detection accuracy, fast speed, simple model, and strong robustness, and can overcome complex picture backgrounds. Also detects long text in non-Latin scripts.
为实现上述目的,本发明从一个全新的视角来解决场景文字检测问题,提供了一种基于连接文字段的自然图片中多方向文本检测方法,包括下述步骤:In order to achieve the above object, the present invention solves the scene text detection problem from a brand new perspective, and provides a multi-directional text detection method in a natural picture based on connected text fields, comprising the following steps:
(1)训练文字段连接检测网络模型,包括如下子步骤:(1) Training the text field connection detection network model, including the following sub-steps:
(1.1)以词条级别标记训练图像集中所有文本图像的文本内容,标签为词条的矩形初始包围盒的四个点坐标,得到训练数据集;(1.1) Mark the text content of all text images in the training image set with the entry level, and the label is the four point coordinates of the rectangular initial bounding box of the entry to obtain the training data set;
(1.2)定义用于根据词条标签可以预测输出文字段和连接的文字段检测模型,所述网络模型由级联卷积神经网络和卷积预测器组成,根据上述训练数据集计算得到文字段和连接的标签,设计损失函数,结合在线扩增和在线负样本难例挖掘技术手段,利用反向传导方法训练该网络,得到文字段检测模型,包括如下子步骤:(1.2) Define the text field detection model that can predict the output text field and connection according to the entry label, the network model is composed of a cascaded convolutional neural network and a convolution predictor, and the text field is calculated according to the above-mentioned training data set and connected tags, design a loss function, combine online amplification and online negative sample hard example mining techniques, use the reverse conduction method to train the network, and obtain a text field detection model, including the following sub-steps:
(1.2.1)构建文字段检测卷积神经网络模型:提取特征的前几层卷积单元来自预训练的VGG-16网络,前几层卷积单元为卷积层1到池化层5,全连接层6和全连接层7分别转换为卷积层6和卷积层7,连接在其后的是一些额外加入的卷积层,用于提取更深度的特征进行检测,包括卷积层8、卷积层9、卷积层10,最后一层是卷积层11;后6个不同的卷积层分别输出不同尺寸的特征图,便于提取出多种尺度的高质量特征,检测文字段和连接是在这六个不同尺寸的特征图上进行的;对于这6个卷积层,每一层之后都添加尺寸为3×3的滤波器作为卷积预测器,来共同检测文字段和连接;(1.2.1) Constructing a text field detection convolutional neural network model: the first few layers of convolutional units for extracting features come from the pre-trained VGG-16 network, and the first few layers of convolutional units are convolutional layer 1 to pooling layer 5, Fully connected layer 6 and fully connected layer 7 are converted to convolutional layer 6 and convolutional layer 7 respectively, followed by some additional convolutional layers for extracting deeper features for detection, including convolutional layers 8. Convolutional layer 9, convolutional layer 10, and the last layer is convolutional layer 11; the last 6 different convolutional layers output feature maps of different sizes, which is convenient for extracting high-quality features of multiple scales and detecting text Segments and connections are performed on these six feature maps of different sizes; for these six convolutional layers, a filter of size 3×3 is added after each layer as a convolutional predictor to jointly detect text fields and connect;
(1.2.2)从标注的词包围盒产生文字段包围盒标签:对于原始训练图像集Itr,记缩放后的训练图像集为Itr′,wI、hI分别为Itr′的宽度和高度,可以取384×384或512×512像素,第i张图片Itri′作为模型输入,Itri′上标注的所有词包围盒记作Wi=[Wi1,...,Wip],其中Wij为第i张图片上的第j个词包围盒,词包围盒可以是单词级别也可以是词条级别,j=1,...,p,p为第i张图片上词包围盒的总数量;记后6层卷积层分别输出的特征图构成集合Itroi′=[Itroi1′,...,Itroi6′],其中Itroil′为后6层卷积层中第l层输出的特征图,wl、hl分别为该特征图的宽度和高度,Itroil′上的坐标(x,y)对应Itri′上以(xa,ya)为中心点坐标的水平初始包围盒Bilq,它们满足下列公式:(1.2.2) Generate text field bounding box labels from marked word bounding boxes: For the original training image set Itr, record the scaled training image set as Itr′, w I , h I are the width and height of Itr′ respectively, Can take 384×384 or 512×512 pixels, the i-th picture Itr i ′ is used as the model input, and all word bounding boxes marked on Itr i ′ are recorded as W i =[W i1 ,...,W ip ], where W ij is the jth word bounding box on the i-th picture, the word bounding box can be word level or entry level, j=1,..., p, p is the word bounding box on the i-th picture The total number of ; Note that the feature maps output by the last 6 convolutional layers constitute a set Itro i ′=[Itro i1 ′,...,Itro i6 ′], where Itro il ′ is the lth in the last 6 convolutional layers The feature map output by the layer, w l and h l are the width and height of the feature map respectively, and the coordinates (x, y) on Itro il ′ correspond to the coordinates of the center point on Itr i ′ (x a , y a ) Horizontal initial bounding boxes B ilq , they satisfy the following formula:
初始包围盒Bilq的宽和高都被设置成一个常数al,用于控制输出文字段的比例,l=1,...,6;记第l层输出的特征图Itroil′对应的初始包围盒集合为Bil=[Bil1,...,Bilm],q=1,...,m,其中m为第l层输出的特征图上初始包围盒的数目;只要初始包围盒Bilq的中心被包含在Itr′上任一标注的词包围盒Wij内部,且Bilq的尺寸al和该标注的词包围盒Wij的高度h满足:那么这个初始包围盒Bilq被标记为正类,标签取值为1,并与高度最为接近的那个词包围盒Wij匹配;否则,当Bilq与所有词包围盒Wi都不满足上述两个条件时,Bilq就被标记为负类,标签取值为0;文字段在初始包围盒上产生,与初始包围盒标签类别相同;其中,比例常数1.5为经验值;The width and height of the initial bounding box B ilq are both set to a constant a l to control the proportion of the output text field, l=1,...,6; record the feature map Itro il 'corresponding to the l-th layer output The set of initial bounding boxes is B il =[B il1 ,...,B ilm ], q=1,...,m, where m is the number of initial bounding boxes on the feature map output by layer l; as long as the initial bounding boxes The center of the box B ilq is contained inside any marked word bounding box W ij on Itr′, and the size a l of B ilq and the height h of the marked word bounding box W ij satisfy: Then this initial bounding box B ilq is marked as a positive class, the value of the label is 1, and it matches the word bounding box W ij with the closest height; otherwise, when B ilq and all word bounding boxes W i do not satisfy the above two When a condition is met, B ilq is marked as a negative class, and the value of the label is 0; the text field is generated on the initial bounding box, which is the same as the label category of the initial bounding box; wherein, the proportional constant 1.5 is an empirical value;
(1.2.3)在所述步骤(1.2.2)产生的带标签的初始包围盒上产生文字段并计算正类文字段偏移量:负类文字段包围盒s-为负类初始包围盒B-;正类文字段包围盒s+由正类初始包围盒B+经过以下步骤得到:a)记正类初始包围盒B+匹配到的标注词包围盒W与水平方向夹角为θs,以B+的中心点为中心,将W顺时针旋转θs角;b)裁剪W,去除超出B+左边和右边的部分;c)以B+的中心点为中心,将裁剪后的词包围盒W′逆时针旋转θs角,得到文字段s+真实标签的几何参数xs、ys、ws、hs、θs;d)计算得到文s+相对于B+的偏移量(Δxs,Δys,Δws,Δhs,Δθs),计算公式如下:(1.2.3) Generate a text field on the labeled initial bounding box generated in the step (1.2.2) and calculate the offset of the positive text field: the negative text field bounding box s - is the negative initial bounding box B - ; the bounding box s of the positive text field + is obtained from the initial bounding box B + of the positive class through the following steps: a) Record the initial bounding box B of the positive class + the matched tag word bounding box W and the horizontal direction The angle is θ s , with the central point of B + as the center, rotate W clockwise by θ s angle; b) crop W, and remove the part beyond the left and right of B + ; c) center the center point of B + , the cropped word The bounding box W′ is rotated counterclockwise by the θ s angle, and the geometric parameters x s , y s , w s , h s , θ s of the text field s + the real label are obtained; d) Calculate the offset of the text s + relative to B + Quantity (Δx s , Δy s , Δw s , Δh s , Δθ s ), the calculation formula is as follows:
xs=alΔxs+xa x s =a l Δx s +x a
ys=alΔys+ya y s = a l Δy s + y a
ws=alexp(Δws)w s = a l exp(Δw s )
hs=alexp(Δhs)h s = a l exp(Δh s )
θs=Δθs θ s = Δθ s
其中,xs、ys、ws、hs、θs分别为文字段包围盒s+的中心点横坐标、中心点纵坐标、宽度、高度以及与水平方向之间的夹角;xa、ya、wa、ha分别为水平初始包围盒B+的中心点横坐标、中心点纵坐标、宽度、高度;Δxs、Δys、Δws、Δhs、Δθs分别为文字段包围盒s+中心点横坐标xs相对初始包围盒B+的偏移量、纵坐标ys相对初始包围盒的偏移量、宽度ws的偏移变化量、高度hs的偏移变化量、角度θs的偏移量;Among them, x s , y s , w s , h s , and θ s are respectively the abscissa of the center point, the ordinate of the center point, width, height, and the angle between the text field bounding box s + and the horizontal direction; x a , y a , w a , h a are the abscissa, ordinate, width, and height of the center point of the horizontal initial bounding box B + respectively; Δx s , Δy s , Δw s , Δh s , and Δθ s are text fields The offset of bounding box s + central point abscissa x s relative to initial bounding box B + , the offset of vertical coordinate y s relative to initial bounding box, the offset change of width w s , and the offset change of height h s amount, the offset of angle θ s ;
(1.2.4)对于步骤(1.2.3)产生的文字段包围盒计算连接标签:文字段s是在初始包围盒B上产生的,因此s之间的连接标签和它们对应的初始包围盒B之间的连接标签相同;对于特征图集合Itroi′=[Itroi1′,...,Itroi6′],如果在同一张特征图Itroil′的初始包围盒集合Bil里,两个初始包围盒的标签都是正类,且匹配到同一个词,那么之间的层内连接被标记为正类,否则标记为负类;如果在特征图Itroil′对应的初始包围盒集合Bil里的初始包围盒和Itroi(l-1)′对应的的初始包围盒集合Bi(l-1)里的初始包围盒的标签都是正类且匹配到同一个词包围盒Wij,那么之间的跨层连接被标记为正类,否则标记为负类;(1.2.4) Calculate the connection label for the bounding box of the text field generated in step (1.2.3): the text field s is generated on the initial bounding box B, so the connection labels between s and their corresponding initial bounding box B The connection labels between are the same; for the feature map set Itro i ′=[Itro i1 ′,...,Itro i6 ′], if in the initial bounding box set B il of the same feature map Itro il ′, two initial bounding box The labels of are all positive classes, and matches the same word, then The intra-layer connection between is marked as a positive class, otherwise it is marked as a negative class; if the initial bounding box in the initial bounding box set B il corresponding to the feature map Itro il ′ The initial bounding box in the initial bounding box set B i(l-1) corresponding to Itro i(l-1) ′ labels are all positive and match the same word bounding box W ij , then The cross-layer connections between are marked as positive class, otherwise marked as negative class;
(1.2.5)以缩放后的训练图像集Itr′作为文字段检测模型输入,预测文字段s输出:对模型初始化权重和偏置,前6万次训练迭代步骤学习率设置为10-3,之后学习率衰减到10-4;对于后6层卷积层,在第l层特征图Itroil′上的坐标(x,y)处,(x,y)对应到输入图像Itri′上以(xa,ya)为中心点坐标、以al为尺寸的初始包围盒Bilq,3×3的卷积预测器都会预测出Bilq被分别划分成正、负类的得分cs,cs为二维向量,取值范围为0-1之间的小数。同时也预测出5个数字作为被划分到正类文字段s+时的几何偏移量,其中分别为预测的文字段包围盒s+中心点横坐标相对正类初始包围盒B+的偏移量、纵坐标的相对正类初始包围盒B+的偏移量、高度的偏移变化量、宽度的偏移变化量、角度偏移量;(1.2.5) Take the scaled training image set Itr′ as the input of the text field detection model, predict the text field s output: initialize the weight and bias of the model, set the learning rate of the first 60,000 training iteration steps to 10 -3 , Then the learning rate decays to 10 -4 ; for the last 6 convolutional layers, at the coordinates (x, y) on the feature map Itro il ′ of the first layer, (x, y) corresponds to the input image Itr i ′ with (x a , y a ) is the center point coordinates, the initial bounding box B ilq with a l as the size, the 3×3 convolutional predictor will predict B ilq is divided into positive and negative scores c s , c s is a two-dimensional vector, and its value range is a decimal between 0-1. Also predicted 5 numbers As the geometric offset when divided into the positive text field s + , where Respectively, the predicted text field bounding box s + center point abscissa relative to the offset of the positive initial bounding box B + , the vertical coordinate relative to the offset of the positive initial bounding box B + , the offset change in height, Width offset variation, angle offset;
(1.2.6)在已预测的文字段基础上预测层内连接和跨层连接输出:对于层内连接,在同一张特征图Itroil′上坐标点(x,y)处,取x-1≤x′≤x+1、y-1≤y′≤y+1范围内近邻的点(x′,y′),这8个点对应到输入图像Itri′时,便获得了与(x,y)对应的基准文字段s(x,y,l)相连接的层内近邻文字段s(x′,y′,l),8个层内近邻文字段可表示为集合:(1.2.6) Predict the output of intra-layer connection and cross-layer connection based on the predicted text field: for intra-layer connection, at the coordinate point (x, y) on the same feature map Itro il ′, take x-1 ≤x′≤x+1, y-1≤y′≤y+1, the neighbor points (x′, y′) in the range of y-1≤y′≤y+1, when these 8 points correspond to the input image Itr i ′, the result of , y) and the adjacent text field s (x′, y′, l) in the layer connected to the reference text field s (x, y, l) corresponding to y), the 8 adjacent text fields in the layer can be expressed as a set:
3×3卷积预测器会预测出s(x,y,l)与层内近邻集合的连接的正、负得分cl1,cl1为16维向量,其中,w为上标,表示层内连接;A 3×3 convolutional predictor predicts s (x, y, l) and the set of neighbors in the layer The positive and negative scores c l1 of the connection, c l1 is a 16-dimensional vector, where w is a superscript, indicating the connection within the layer;
对于跨层连接,一个跨层连接将两个连续卷积层输出的特征图上两个点处对应的文字段相连;由于每经过一层卷积层,特征图的的宽度和高度都会缩小一半,第l层输出特征图Itroil′的宽度wl和高度hl是第l-1层特征图Itroi(l-1)′的宽度wl-1和高度hl-1的一半,而Itroil′对应的初始包围盒尺度al是Itroi(l-1)′对应的初始包围盒尺度al-1的2倍,对于在第l层输出特征图Itroil′上的(x,y),在特征图Itroi(l-1)′上取2x≤x′≤2x+1、2y≤y′≤2y+1范围内的4个跨层近邻点(x′,y′),Itroil′上(x,y)对应到输入图像Itri′上的初始包围盒刚好与Itroi(l-1)′上4个跨层近邻点对应到输入图像Itri′上的4个初始包围盒空间位置重合,4个跨层近邻文字段s(x′,y′,l-1)可表示为集合:For cross-layer connection, a cross-layer connection connects the text fields corresponding to two points on the feature map output by two consecutive convolutional layers; because each layer of convolutional layer passes through, the width and height of the feature map will be reduced by half , the width w l and height h l of the l-th layer output feature map Itro il ′ are half of the width w l-1 and height h l-1 of the l-1 layer feature map Itro i(l-1) ′, and The initial bounding box scale a l corresponding to Itro il ′ is twice the initial bounding box scale a l-1 corresponding to Itro i(l-1) ′, for (x, y), take 4 cross-layer neighbor points (x', y') in the range of 2x≤x'≤2x+1, 2y≤y'≤2y+1 on the feature map Itro i(l-1) ', (x, y) on Itro il ′ corresponds to the initial bounding box on the input image Itr i ′, and the 4 cross-layer neighbor points on Itro i(l-1) ′ correspond to the 4 initial bounding boxes on the input image Itr i ′ The spatial positions of the bounding boxes coincide, and the four cross-layer neighbor text fields s (x′, y′, l-1) can be expressed as a set:
3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上近邻文字段集合之间跨层连接的正、负得分cl2,cl2为8维向量:The 3×3 convolutional predictor will predict the base text field s (x, y, l) of layer l and the set of adjacent text fields on layer l-1 The positive and negative scores c l2 of the cross-layer connection, c l2 is an 8-dimensional vector:
其中,表示预测器预测出s(x,y,l)与其所有4个近邻文字段之间的连接的正、负得分,c为上标,表示跨层连接;in, Indicates that the predictor predicts the positive and negative scores of the connection between s (x, y, l) and all 4 adjacent text fields, and c is a superscript, indicating a cross-layer connection;
所有的层内连接和所有的跨层连接构成连接集合Ns;All intralayer connections and all cross-layer connections Constitute the connection set N s ;
(1.2.7)以步骤(1.2.3)和步骤(1.2.4)获得的文字段标签、正类文字段真实偏移量、连接标签作为输出基准,以步骤(1.2.5)预测的文字段类别及得分、预测的文字段偏移量、步骤(1.2.6)预测的连接得分为预测输出,设计预测输出与输出基准之间的目标损失函数,对文字段连接检测模型利用反向传导法进行不断地训练,来最小化文字段分类、文字段偏移回归和连接分类的损失,针对所述文字段连接检测模型设计目标损失函数,目标损失函数是三个损失的加权和:(1.2.7) Take the text field label obtained in step (1.2.3) and step (1.2.4), the real offset of the positive text field, and the connection label as the output reference, and the text predicted by step (1.2.5) The category and score of the segment, the offset of the predicted text field, and the connection score predicted in step (1.2.6) are the predicted output, and the target loss function between the predicted output and the output benchmark is designed, and the reverse conduction is used for the text field connection detection model The method is continuously trained to minimize the loss of text field classification, text field offset regression and connection classification, and the target loss function is designed for the text field connection detection model. The target loss function is the weighted sum of three losses:
其中ys是所有文字段的标签,cs是预测的文字段得分,yl是预测的连接标签,cl是预测的连接得分,由层内连接得分cl1和跨层得分cl2组成;如果第i个初始包围盒标记为正类,那么ys(i)=1,否则为0;Lconf(ys,cs)是预测的文字段得分cs的softmax损失,Lconf(ys,cl)是预测的连接得分cl的softmax损失,是预测的文字段几何参数s和真实标签之间的平滑L1回归损失;ns是正类初始包围盒的数量,用来对文字段分类和回归损失进行归一化;nl是正类连接总数,用来对连接分类损失进行归一化;λ1和λ2为权重常数,在实际中取1;where y s is the label of all text fields, c s is the predicted text field score, y l is the predicted connection label, c l is the predicted connection score, which is composed of intra-layer connection score c l1 and cross-layer score c l2 ; If the i-th initial bounding box is marked as a positive class, then y s (i) = 1, otherwise 0; L conf (y s , c s ) is the softmax loss of the predicted text field score c s , L conf (y s , c l ) is the softmax loss of the predicted connection score c l , is the predicted text field geometry parameter s and the ground truth label The smoothed L between 1 regression loss; n s is the number of positive initial bounding boxes, used to normalize text field classification and regression loss; n l is the total number of positive class connections, used to normalize the connection classification loss ; λ 1 and λ 2 are weight constants, which are 1 in practice;
(1.2.8)在步骤(1.2.7)的训练过程中,采用在线扩增方法对训练数据Itr进行在线扩增,并采用在线负样本难例挖掘策略来平衡正样本和负样本。在训练图片Itr被缩放到相同的尺寸并批量加载之前,它们被随机地裁剪成一个个图像块,每个图像块与文字段的真实包围盒的jaccard重叠系数o最小;对于多方向文字,数据扩增是在多方向文字包围盒的最小包围矩形上进行的,每个样本的重叠系数o从0、0.1、0.3、0.5、0.7和0.9中随机选择,图像块的大小为原始图片尺寸的0.1-1倍之间;训练图像不水平翻转;另外,文字段和连接负样本占据训练样本的大部分,采用在线负样本难例挖掘策略来平衡正样本和负样本,对文字段和连接分开进行挖掘,控制负样本与正样本之间的比例不超过3∶1。(1.2.8) During the training process of step (1.2.7), the online amplification method is used to amplify the training data I tr online, and the online negative sample hard case mining strategy is used to balance positive samples and negative samples. Before the training images I tr are scaled to the same size and loaded in batches, they are randomly cropped into image blocks, and the jaccard overlap coefficient o of each image block and the true bounding box of the text field is the smallest; for multi-directional text, The data augmentation is carried out on the minimum enclosing rectangle of the multi-directional text bounding box, the overlap coefficient o of each sample is randomly selected from 0, 0.1, 0.3, 0.5, 0.7 and 0.9, and the size of the image block is the size of the original image Between 0.1 and 1 times; the training image is not flipped horizontally; in addition, text fields and connection negative samples occupy most of the training samples, and the online negative sample hard case mining strategy is used to balance the positive samples and negative samples, and the text field and connection are separated For mining, control the ratio between negative samples and positive samples not to exceed 3:1.
(2)利用上述训练好的卷积神经网络对待检测文本图像进行文字段和连接检测,包括如下子步骤:(2) Use the above-mentioned trained convolutional neural network to perform text field and connection detection on the text image to be detected, including the following sub-steps:
(2.1)对待检测文本图像进行文字段检测,由不同卷积层输出的特征图可以预测出不同尺度的文字段,由同一卷积层输出的特征图预测出相同尺度的文字段:对待检测图像集Itst中的第i张待检测文本图像Itsti,缩放到统一尺寸,具体尺寸可随待检测文本图像的情况人为设定,记缩放后的待检测文本图像为Itsti′。将图像Itsti′输入到步骤(1.2)中训练好的文字段连接检测模型,得到后6层卷积层分别输出的特征图构成的集合Itstoi′=[Itstoi1′,...,Itstoi6′],其中Itstoil′为后6层卷积层中第l层输出的特征图,l=1,...,6,在每张输出特征图Itstoil′上的坐标(x,y)处,3×3的卷积预测器都会预测出(x,y)对应的初始包围盒Bilq被预测为正、负类文字段的得分cs,同时也预测出5个数字作为被预测为正类文字段s+时的几何偏移量;(2.1) Perform text field detection on the text image to be detected. The feature maps output by different convolutional layers can predict text fields of different scales, and the feature maps output by the same convolutional layer can predict text fields of the same scale: the image to be detected The i-th text image Itst i to be detected in the collection Itst is scaled to a uniform size, and the specific size can be set manually according to the situation of the text image to be detected, and the zoomed text image to be detected is Itst i ′. Input the image Itst i ′ into the text field connection detection model trained in step (1.2), and obtain the set Itsto i ′=[Itsto i1 ′,...,Itsto i6 ′], where Itsto il ′ is the feature map output by the l-th layer in the last 6 convolutional layers, l=1,...,6, the coordinates (x, y on each output feature map Itsto il ′ ), the 3×3 convolution predictor will predict the initial bounding box B ilq corresponding to (x, y) is predicted to be the score c s of positive and negative text fields, and also predicts 5 numbers As the geometric offset when it is predicted as a positive text field s + ;
(2.2)对待检测文本图像检测出的所有特征层上的文字段进行连接检测,所述连接包括层内连接和跨层连接:在(2.1)预测的文字段基础上预测层内连接和跨层连接,对于层内连接,在同一张特征图Itstoil′上坐标点(x,y)处,3×3卷积预测器预测出s(x,y,l)与它8个近邻文字段之间层内连接的正、负得分cl1;对于跨层连接,3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上4个近邻文字段的跨层连接正、负得分cl2,cl1和cl2构成预测的连接得分cl;(2.2) The text fields on all feature layers detected by the text image to be detected are connected and detected, and the connections include intra-layer connections and cross-layer connections: predict intra-layer connections and cross-layer connections based on the text fields predicted in (2.1) Connection, for intra-layer connections, at the coordinate point (x, y) on the same feature map Itsto il ′, the 3×3 convolutional predictor predicts s (x, y, l) and its 8 neighbor text fields The positive and negative scores c l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor will predict the reference text field s (x, y, l) of the l-th layer and the 4 on the l-1 layer neighbor text fields The cross-layer connection positive and negative scores c l2 , c l1 and c l2 constitute the predicted connection score c l ;
(2.3)将检测得到的文字段置信度得分和连接置信度得分组合,其中文字段置信度得分包括文字段正负类别得分和偏移量得分,利用卷积预测器输出softmax标准化得分:在(2.1)预测的文字段基础上预测层内连接和跨层连接,对于层内连接,在同一张特征图Itstoil′上坐标点(x,y)处,3×3卷积预测器预测出s(x,y,l)与8个近邻文字段之间层内连接的正、负得分cl1;对于跨层连接,3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上4个近邻文字段的跨层连接正、负得分cl2,cl1和cl2构成预测的连接得分cl。(2.3) Combine the detected text field confidence score and connection confidence score, where the text field confidence score includes the positive and negative category score and offset score of the text field, and use the convolution predictor to output the softmax standardized score: in ( 2.1) Based on the predicted text field, the intra-layer connection and cross-layer connection are predicted. For the intra-layer connection, at the coordinate point (x, y) on the same feature map Itsto il ′, the 3×3 convolutional predictor predicts s (x, y, l) and 8 neighbor text fields The positive and negative scores c l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor will predict the reference text field s (x, y, l) of the l-th layer and the 4 on the l-1 layer neighbor text field The cross-layer connection positive and negative scores c l2 , c l1 and c l2 constitute the predicted connection score c l .
(3)组合文字段和连接,得到输出包围盒,包括如下子步骤:(3) Combining text fields and connections to obtain an output bounding box, including the following sub-steps:
(3.1)根据(2.3)中检测得到的标准化得分,过滤卷积预测器输出的文字段和连接,以过滤后的文字段作为节点,以连接作为边,构建连接图:对于步骤(2)待检测文本图像输入到文字段检测模型而产生的固定数量的文字段s和连接Ns,通过它们的得分进行过滤;为文字段s和连接Ns设置不同的过滤阈值,分别为α和β;过滤阈值可以根据不同的数据人为设定不同的值,在实际中进行多方向文本图像检测时,可以取α=0.9,β=0.7,进行多语种长文本图像检测时,可以取α=0.9,β=0.5,进行水平文本检测时,可以取α=0.6,β=0.3;将过滤后的文字段s′作为节点,过滤后的连接Ns′作为边,利用它们构建一个图;(3.1) According to the standardized score detected in (2.3), filter the text field and connection output by the convolutional predictor, use the filtered text field as a node, and use the connection as an edge to construct a connection graph: For step (2) to be Detect a fixed number of text fields s and connection N s generated by inputting the text image to the text field detection model, and filter through their scores; set different filtering thresholds for the text field s and connection N s , respectively α and β; The filtering threshold can be artificially set different values according to different data. In practice, when performing multi-directional text image detection, α=0.9, β=0.7 can be used, and when multilingual long text image detection is performed, α=0.9 can be used. β=0.5, when performing horizontal text detection, you can take α=0.6, β=0.3; use the filtered text field s' as a node, and the filtered connection N s ' as an edge, and use them to construct a graph;
(3.2)在图上执行深度优先搜索以找到相互连接的组件,每个组件记作集合B,包含由连接相连起来的文字段;(3.2) Perform a depth-first search on the graph to find connected components. Each component is denoted as a set B, which contains text fields connected by connections;
(3.3)对步骤(3.2)在图上进行深度优先搜索得到的文字段集合S,通过下述步骤组合成一个完整的词,包括:(3.3) The text field set S obtained by performing depth-first search on the graph in step (3.2) is combined into a complete word through the following steps, including:
(3.3.1)输入:|S|为集合S里的文字段数量,其中为第i个文字段,i为上标,分别为第i个文字段包围盒s(i)的中心横坐标和纵坐标,分别为文字段包围盒s(i)的宽度和高度,为文字段包围盒s(i)与水平方向之间的夹角;(3.3.1) Input: |S| is the number of text fields in the set S, where is the i-th text field, i is the superscript, are respectively the central abscissa and ordinate of the i-th text field bounding box s (i) , are the width and height of text field bounding box s (i) , respectively, is the angle between the text field bounding box s (i) and the horizontal direction;
(3.3.2)其中θb为输出包围盒的偏移角度,为集合里第i个文字段包围盒s的偏移角度,由集合S里所有文字段的平均偏移角度得到;(3.3.2) where θ b is the offset angle of the output bounding box, is the offset angle of the i-th text field bounding box s in the set, obtained from the average offset angle of all text fields in the set S;
(3.3.3)找到直线tan(θb)x+b的截距b,使得集合S中所有的文字段到中心点的距离的总和最小;(3.3.3) Find the intercept b of the straight line tan(θ b )x+b, so that all text fields in the set S reach the center point The sum of the distances is the smallest;
(3.3.4)找到直线的两个端点(xp,yp)和(xq,yq),p表示第一个端点,q表示第二个端点,xp、yp分别为第一个端点的横、纵坐标,xq、yq分别为第二个端点的横、纵坐标;(3.3.4) Find the two endpoints (x p , y p ) and (x q , y q ) of the straight line, p represents the first endpoint, q represents the second endpoint, x p and y p are the first The abscissa and ordinate of the first endpoint, x q and y q are respectively the abscissa and ordinate of the second endpoint;
(3.3.5)b表示输出包围盒,xb、yb分别为输出包围盒中心的横、纵坐标;(3.3.5) b represents the output bounding box, x b and y b are the horizontal and vertical coordinates of the center of the output bounding box, respectively;
(3.3.6)其中wb为输出包围盒的宽度,wp、wq分别为以点p为中心的包围盒的宽度和以q为中心的包围盒的宽度;(3.3.6) Where w b is the width of the output bounding box, w p and w q are respectively the width of the bounding box centered on point p and the width of the bounding box centered on q;
(3.3.7)hb为输出包围盒的高度,为集合里第i个文字段包围盒s的高度,由由文字段集合S里所有文字段的平均高度得到;(3.3.7) h b is the height of the output bounding box, is the height of the i-th text field bounding box s in the set, obtained from the average height of all text fields in the text field set S;
(3.3.8)b:=(xb,yb,wb,hb,θb),b为输出包围盒,由坐标参数、尺寸参数、角度参数表示;(3.3.8)b:=(x b , y b , w b , h b , θ b ), b is the output bounding box, represented by coordinate parameters, size parameters, and angle parameters;
(3.3.9)输出组合而成的包围盒b。(3.3.9) Output the combined bounding box b.
通过本发明所构思的以上技术方案,与现有技术相比,本发明具有以下技术效果:Through the above technical solutions conceived by the present invention, compared with the prior art, the present invention has the following technical effects:
(1)可以检测多方向文字:自然场景图片里的文本通常是任意方向或者扭曲的,本发明方法文字区域可以通过文字段包围盒进行局部描述,文字段包围盒可以被设置成任意方向,因此可以包含多方向或扭曲形状的文字。(1) Can detect multi-directional text: the text in the natural scene picture is usually in any direction or distorted, the text area of the present invention can be described locally through the text field bounding box, and the text field bounding box can be set to any direction, so Text that can contain multi-directional or distorted shapes.
(2)灵活度高:本发明方法也可以检测任意长度的文字条,因为组合文字段只依赖于预测的连接,因此既可以检测单词,也可以检测文字条;(2) flexibility is high: the inventive method also can detect the text strip of any length, because the combination text field only depends on the connection of prediction, therefore both can detect word, also can detect text strip;
(3)鲁棒性强:本发明方法采用的是以文字段包围盒进行局部描述,这种局部描述的方法能克服复杂的自然图片背景,从图片里捕获文本特征;(3) Robustness is strong: what the present invention method adopts is to carry out partial description with text field bounding box, the method for this partial description can overcome complex natural picture background, capture text feature from picture;
(4)效率高:本发明方法的文字段检测模型是端到端进行训练的,每秒能够处理超过20张大小为512x512图像,因为文字段和连接都是通过在全卷积CNN模型进行一次正向传播获得,不需要对输入图像进行离线的缩放和旋转;(4) High efficiency: the text field detection model of the inventive method is trained end-to-end, and it can process more than 20 images per second with a size of 512x512, because the text fields and connections are all performed once in the full convolution CNN model Obtained by forward propagation, no offline scaling and rotation of the input image is required;
(5)通用性强:一些非拉丁语的文本在相邻单词之间并不包含空格,比如中文汉字。已有的技术都只能检测到单词,在检测这种文本时就会不适用,因为这种不包含空格的文本无法提供划分不同单词的视觉信息。除了拉丁文字,本发明也能够检测非拉丁文字的长文本,因为本发明方法不需要利用空格来提供视觉划分信息。(5) Strong versatility: Some non-Latin texts do not contain spaces between adjacent words, such as Chinese characters. Existing technologies can only detect words, and are not applicable to this kind of text, because this kind of text without spaces cannot provide visual information for dividing different words. In addition to Latin scripts, the present invention is also capable of detecting long texts in non-Latin scripts, because the method of the present invention does not need to use spaces to provide visual division information.
附图说明Description of drawings
图1是本发明基于文字段连接的自然图片中多方向文本检测的流程图;Fig. 1 is the flow chart of multidirectional text detection in the natural picture based on text field connection of the present invention;
图2是本发明计算文字段真实标签各项参数的示意图;Fig. 2 is the schematic diagram that the present invention calculates each parameter of text field true label;
图3是本发明卷积预测器的输出组成示意图;Fig. 3 is a schematic diagram of the output composition of the convolutional predictor of the present invention;
图4是本发明文字段连接检测模型网络连接图;Fig. 4 is a text field connection detection model network connection diagram of the present invention;
图5是本发明一实施例中利用训练好的文字段连接检测网络模型对待检测文本图像进行检测文字段和连接输出包围盒的结果图。Fig. 5 is a result diagram of detecting text fields and connecting output bounding boxes of the text image to be detected by using the trained text field connection detection network model in an embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。此外,下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.
以下首先就本发明的技术术语进行解释和说明:Below at first explain and illustrate with regard to the technical terms of the present invention:
卷积神经网络(Concolutional Neural Network,CNN):一种可用于图像分类、回归等任务的神经网络。网络通常由卷积层、降采样层和全连接层构成。卷积层和降采样层负责提取图像的特征,全连接层负责分类或回归。网络的参数包括卷积核以及全连接层的参数及偏置,参数可以通过反向传导算法,从数据中学习得到;Convolutional Neural Network (CNN): A neural network that can be used for image classification, regression, and other tasks. Networks usually consist of convolutional layers, downsampling layers, and fully connected layers. The convolutional layer and the downsampling layer are responsible for extracting the features of the image, and the fully connected layer is responsible for classification or regression. The parameters of the network include the parameters and offsets of the convolution kernel and the fully connected layer. The parameters can be learned from the data through the reverse conduction algorithm;
VGG16:2014年ILSVRC的亚军是VGGNet,包含16个CONV/FC层,具有非常均匀的架构,颇具吸引力,从开始到结束只执行3x3卷积和2x2池化层,成为经典的卷积神经网络模型。他们的预训练模型可用于Caffe的即插即用。它证明了网络的深度是良好性能的关键组成部分。VGG16: The runner-up of ILSVRC in 2014 is VGGNet, which contains 16 CONV/FC layers, has a very uniform architecture, is quite attractive, and only performs 3x3 convolution and 2x2 pooling layers from the beginning to the end, becoming a classic convolutional neural network. Model. Their pretrained models are available plug-and-play with Caffe. It demonstrates that the depth of the network is a key component of good performance.
深度优先搜索(DFS):它是一种用于遍历或搜索树或图的算法。沿着树的深度遍历树的节点,尽可能深的搜索树的分支。当节点v的所在边都己被探寻过,搜索将回溯到发现节点v的那条边的起始节点。这一过程一直进行到已发现从源节点可达的所有节点为止。如果还存在未被发现的节点,则选择其中一个作为源节点并重复以上过程,整个进程反复进行直到所有节点都被访问为止。属于图论中的经典算法,利用深度优先搜索算法可以产生目标图的相应拓扑排序表,利用拓扑排序表可以方便的解决很多相关的图论问题,如最大路径问题等等。Depth First Search (DFS): It is an algorithm for traversing or searching a tree or graph. Traverse the nodes of the tree along the depth of the tree, searching the branches of the tree as deep as possible. When all edges of node v have been explored, the search will backtrack to the start node of the edge where node v was found. This process continues until all nodes reachable from the source node have been found. If there are undiscovered nodes, select one of them as the source node and repeat the above process, and the whole process is repeated until all nodes are visited. It belongs to the classic algorithm in graph theory. Using the depth-first search algorithm can generate the corresponding topological sorting table of the target graph. Using the topological sorting table can easily solve many related graph theory problems, such as the maximum path problem and so on.
如图1所示,本发明基于空间变换的自然场景下文本检测方法包括以下步骤:As shown in Figure 1, the text detection method under the natural scene based on spatial transformation of the present invention comprises the following steps:
(1)训练文字段连接检测网络模型,包括如下子步骤:(1) Training the text field connection detection network model, including the following sub-steps:
(1.1)以词条级别标记训练图像集中所有文本图像的文本内容,标签为词条的矩形初始包围盒的四个点坐标,得到训练数据集;(1.1) Mark the text content of all text images in the training image set with the entry level, and the label is the four point coordinates of the rectangular initial bounding box of the entry to obtain the training data set;
(1.2)定义用于根据词条标注可以预测输出文字段和连接的文字段检测模型,所述网络模型由级联卷积神经网络和卷积预测器组成,根据上述训练数据集计算得到文字段和连接的标签,设计损失函数,结合在线扩增和在线负样本难例挖掘技术手段,利用反向传导方法训练该网络,得到文字段检测模型,包括如下子步骤:(1.2) Define the text field detection model that can predict the output text field and connection according to the entry label, the network model is composed of a cascaded convolutional neural network and a convolution predictor, and the text field is calculated according to the above-mentioned training data set and connected tags, design a loss function, combine online amplification and online negative sample hard example mining techniques, use the reverse conduction method to train the network, and obtain a text field detection model, including the following sub-steps:
(1.2.1)构建文字段检测卷积神经网络模型:提取特征的前几层卷积单元来自预训练的VGG-16网络,前几层卷积单元为卷积层1到池化层5,全连接层6和全连接层7分别转换为卷积层6和卷积层7,连接在其后的是一些额外加入的卷积层,用于提取更深度的特征进行检测,包括卷积层8、卷积层9、卷积层10,最后一层是卷积层11;后6个不同的卷积层分别输出不同尺寸的特征图,便于提取出多种尺度的高质量特征,检测文字段和连接是在这六个不同尺寸的特征图上进行的;对于这6个卷积层,每一层之后都添加尺寸为3×3的滤波器作为卷积预测器,来共同检测文字段和连接;(1.2.1) Constructing a text field detection convolutional neural network model: the first few layers of convolutional units for extracting features come from the pre-trained VGG-16 network, and the first few layers of convolutional units are convolutional layer 1 to pooling layer 5, Fully connected layer 6 and fully connected layer 7 are converted to convolutional layer 6 and convolutional layer 7 respectively, followed by some additional convolutional layers for extracting deeper features for detection, including convolutional layers 8. Convolutional layer 9, convolutional layer 10, and the last layer is convolutional layer 11; the last 6 different convolutional layers output feature maps of different sizes, which is convenient for extracting high-quality features of multiple scales and detecting text Segments and connections are performed on these six feature maps of different sizes; for these six convolutional layers, a filter of size 3×3 is added after each layer as a convolutional predictor to jointly detect text fields and connect;
(1.2.2)从标注的词包围盒产生文字段包围盒标签:对于原始训练图像集Itr,记缩放后的训练图像集为Itr′,wI、hI分别为Itr′的宽度和高度,可以取384×384或512×512像素,第i张图片Itri′作为模型输入,Itri′上标注的所有词包围盒记作Wi=[Wi1,...,Wip],其中Wij为第i张图片上的第j个词包围盒,词包围盒可以是单词级别也可以是词条级别,j=1,...,p,p为第i张图片上词包围盒的总数量;记后6层卷积层分别输出的特征图构成集合Itroi′=[Itroi1′,...,Itroi6′],其中Itroil′为后6层卷积层中第l层输出的特征图,wl、hl分别为该特征图的宽度和高度,Itroil′上的坐标(x,y)对应Itri′上以(xa,ya)为中心点坐标的水平初始包围盒Bilq,它们满足下列公式:(1.2.2) Generate text field bounding box labels from marked word bounding boxes: For the original training image set Itr, record the scaled training image set as Itr′, w I , h I are the width and height of Itr′ respectively, Can take 384×384 or 512×512 pixels, the i-th picture Itr i ′ is used as the model input, and all word bounding boxes marked on Itr i ′ are recorded as W i =[W i1 ,...,W ip ], where W ij is the jth word bounding box on the i-th picture, the word bounding box can be word level or entry level, j=1,..., p, p is the word bounding box on the i-th picture The total number of ; Note that the feature maps output by the last 6 convolutional layers constitute a set Itro i ′=[Itro i1 ′,...,Itro i6 ′], where Itro il ′ is the lth in the last 6 convolutional layers The feature map output by the layer, w l and h l are the width and height of the feature map respectively, and the coordinates (x, y) on Itro il ′ correspond to the coordinates of the center point on Itr i ′ (x a , y a ) Horizontal initial bounding boxes B ilq , they satisfy the following formula:
初始包围盒Bilq的宽和高都被设置成一个常数al,用于控制输出文字段的比例,l=1,...,6;记第l层输出的特征图Itroil′对应的初始包围盒集合为Bil=[Bil1,...,Bilm],q=1,...,m,其中m为第l层输出的特征图上初始包围盒的数目;只要初始包围盒Bilq的中心被包含在Itr′上任一标注的词包围盒Wij内部,且Bilq的尺寸al和该标注的词包围盒Wij的高度h满足:那么这个初始包围盒Bilq被标记为正类,标签取值为1,并与高度最为接近的那个词包围盒Wij匹配;否则,当Bilq与所有词包围盒Wi都不满足上述两个条件时,Bilq就被标记为负类,标签取值为0;文字段在初始包围盒上产生,与初始包围盒标签类别相同;其中,比例常数1.5为经验值;The width and height of the initial bounding box B ilq are both set to a constant a l to control the proportion of the output text field, l=1,...,6; record the feature map Itro il 'corresponding to the l-th layer output The set of initial bounding boxes is B il =[B il1 ,...,B ilm ], q=1,...,m, where m is the number of initial bounding boxes on the feature map output by layer l; as long as the initial bounding boxes The center of the box B ilq is contained inside any marked word bounding box W ij on Itr′, and the size a l of B ilq and the height h of the marked word bounding box W ij satisfy: Then this initial bounding box B ilq is marked as a positive class, the value of the label is 1, and it matches the word bounding box W ij with the closest height; otherwise, when B ilq and all word bounding boxes W i do not satisfy the above two When a condition is met, B ilq is marked as a negative class, and the value of the label is 0; the text field is generated on the initial bounding box, which is the same as the label category of the initial bounding box; wherein, the proportional constant 1.5 is an empirical value;
(1.2.3)在所述步骤(1.2.2)产生的带标签的初始包围盒上产生文字段并计算正类文字段偏移量:负类文字段包围盒s-为负类初始包围盒B-;正类文字段包围盒s+由正类初始包围盒B+经过以下步骤得到:a)记正类初始包围盒B+匹配到的标注词包围盒W与水平方向夹角为θs,以B+的中心点为中心,将W顺时针旋转θs角;b)裁剪W,去除超出B+左边和右边的部分;c)以B+的中心点为中心,将裁剪后的词包围盒W′逆时针旋转θs角,得到文字段s+真实标签的几何参数xs、ys、ws、hs、θs;d)计算得到文s+相对于B+的偏移量(Δxs,Δys,Δws,Δhs,Δθs),计算公式如下:(1.2.3) Generate a text field on the labeled initial bounding box generated in the step (1.2.2) and calculate the offset of the positive text field: the negative text field bounding box s - is the negative initial bounding box B - ; the bounding box s of the positive text field + is obtained from the initial bounding box B + of the positive class through the following steps: a) Record the initial bounding box B of the positive class + the matched tag word bounding box W and the horizontal direction The angle is θ s , with the central point of B + as the center, rotate W clockwise by θ s angle; b) crop W, and remove the part beyond the left and right of B + ; c) center the center point of B + , the cropped word The bounding box W′ is rotated counterclockwise by the θ s angle, and the geometric parameters x s , y s , w s , h s , θ s of the text field s + the real label are obtained; d) Calculate the offset of the text s + relative to B + Quantity (Δx s , Δy s , Δw s , Δh s , Δθ s ), the calculation formula is as follows:
xs=alΔxs+xa x s =a l Δx s +x a
ys=alΔys+ya y s = a l Δy s + y a
ws=alexp(Δws)w s = a l exp(Δw s )
hs=alexp(Δhs)h s = a l exp(Δh s )
θs=Δθs θ s = Δθ s
其中,xs、ys、ws、hs、θs分别为文字段包围盒s+的中心点横坐标、中心点纵坐标、宽度、高度以及与水平方向之间的夹角;xa、ya、wa、ha分别为水平初始包围盒B+的中心点横坐标、中心点纵坐标、宽度、高度;Δxs、Δys、Δws、Δhs、Δθs分别为文字段包围盒s+中心点横坐标xs相对初始包围盒B+的偏移量、纵坐标ys相对初始包围盒的偏移量、宽度ws的偏移变化量、高度hs的偏移变化量、角度θs的偏移量;Among them, x s , y s , w s , h s , and θ s are respectively the abscissa of the center point, the ordinate of the center point, width, height, and the angle between the text field bounding box s + and the horizontal direction; x a , y a , w a , h a are the abscissa, ordinate, width, and height of the center point of the horizontal initial bounding box B + respectively; Δx s , Δy s , Δw s , Δh s , and Δθ s are text fields The offset of bounding box s + central point abscissa x s relative to initial bounding box B + , the offset of vertical coordinate y s relative to initial bounding box, the offset change of width w s , and the offset change of height h s amount, the offset of angle θ s ;
(1.2.4)对于步骤(1.2.3)产生的文字段包围盒计算连接标签:文字段s是在初始包围盒B上产生的,因此s之间的连接标签和它们对应的初始包围盒B之间的连接标签相同;对于特征图集合Itroi′=[Itroi1′,...,Itroi6′],如果在同一张特征图Itroil′的初始包围盒集合Bil里,两个初始包围盒的标签都是正类,且匹配到同一个词,那么之间的层内连接被标记为正类,否则标记为负类;如果在特征图Itroil′对应的初始包围盒集合Bil里的初始包围盒和Itroi(l-1)′对应的的初始包围盒集合Bi(l-1)里的初始包围盒的标签都是正类且匹配到同一个词包围盒Wij,那么之间的跨层连接被标记为正类,否则标记为负类;(1.2.4) Calculate the connection label for the bounding box of the text field generated in step (1.2.3): the text field s is generated on the initial bounding box B, so the connection labels between s and their corresponding initial bounding box B The connection labels between are the same; for the feature map set Itro i ′=[Itro i1 ′,...,Itro i6 ′], if in the initial bounding box set B il of the same feature map Itro il ′, two initial bounding box The labels of are all positive classes, and matches the same word, then The intra-layer connection between is marked as a positive class, otherwise it is marked as a negative class; if the initial bounding box in the initial bounding box set B il corresponding to the feature map Itro il ′ The initial bounding box in the initial bounding box set B i(l-1) corresponding to Itro i(l-1) ′ labels are all positive and match the same word bounding box W ij , then The cross-layer connections between are marked as positive class, otherwise marked as negative class;
(1.2.5)以缩放后的训练图像集Itr′作为文字段检测模型输入,预测文字段s输出:对模型初始化权重和偏置,前6万次训练迭代步骤学习率设置为10-3,之后学习率衰减到10-4;对于后6层卷积层,在第l层特征图Itroil′上的坐标(x,y)处,(x,y)对应到输入图像Itri′上以(xa,ya)为中心点坐标、以al为尺寸的初始包围盒Bilq,3×3的卷积预测器都会预测出Bilq被分别划分成正、负类的得分cs,cs为二维向量,取值范围为0-1之间的小数。同时也预测出5个数字作为被划分到正类文字段s+时的几何偏移量,其中分别为预测的文字段包围盒s+中心点横坐标相对正类初始包围盒B+的偏移量、纵坐标的相对正类初始包围盒B+的偏移量、高度的偏移变化量、宽度的偏移变化量、角度偏移量;(1.2.5) Take the scaled training image set Itr′ as the input of the text field detection model, predict the text field s output: initialize the weight and bias of the model, set the learning rate of the first 60,000 training iteration steps to 10 -3 , Then the learning rate decays to 10 -4 ; for the last 6 convolutional layers, at the coordinates (x, y) on the feature map Itro il ′ of the first layer, (x, y) corresponds to the input image Itr i ′ with (x a , y a ) is the center point coordinates, the initial bounding box B ilq with a l as the size, the 3×3 convolutional predictor will predict B ilq is divided into positive and negative scores c s , c s is a two-dimensional vector, and its value range is a decimal between 0-1. Also predicted 5 numbers As the geometric offset when divided into the positive text field s + , where Respectively, the predicted text field bounding box s + center point abscissa relative to the offset of the positive initial bounding box B + , the vertical coordinate relative to the offset of the positive initial bounding box B + , the offset change in height, Width offset variation, angle offset;
(1.2.6)在已预测的文字段基础上预测层内连接和跨层连接输出:对于层内连接,在同一张特征图Itroil′上坐标点(x,y)处,取x-1≤x′≤x+1、y-1≤y′≤y+1范围内近邻的点(x′,y′),这8个点对应到输入图像Itri′时,便获得了与(x,y)对应的基准文字段s(x,y,l)相连接的8个层内近邻文字段s(x′,y′,l),8个层内近邻文字段可表示为集合:(1.2.6) Predict the output of intra-layer connection and cross-layer connection based on the predicted text field: for intra-layer connection, at the coordinate point (x, y) on the same feature map Itro il ′, take x-1 ≤x′≤x+1, y-1≤y′≤y+1, the neighbor points (x′, y′) in the range of y-1≤y′≤y+1, when these 8 points correspond to the input image Itr i ′, the result of , y) The 8 intra-layer neighbor text fields s (x′, y′, l) connected with the reference text field s (x, y, l) corresponding to y), the 8 intra-layer neighbor text fields can be expressed as a set:
3×3卷积预测器会预测出s(x,y,l)与层内近邻集合的连接的正、负得分cl1,cl1为16维向量,其中,w为上标,表示层内连接;A 3×3 convolutional predictor predicts s (x, y, l) and the set of neighbors in the layer The positive and negative scores c l1 of the connection, c l1 is a 16-dimensional vector, where w is a superscript, indicating the connection within the layer;
对于跨层连接,一个跨层连接将两个连续卷积层输出的特征图上两个点处对应的文字段相连;由于每经过一层卷积层,特征图的的宽度和高度都会缩小一半,第l层输出特征图Itroil′的宽度wl和高度hl是第l-1层特征图Itroi(l-1)′的宽度wl-1和高度hl-1的一半,而Itroil′对应的初始包围盒尺度al是Itroi(l-1)′对应的初始包围盒尺度al-1的2倍,对于在第l层输出特征图Itroil′上的(x,y),在特征图Itroi(l-1)′上取2x≤x′≤2x+1、2y≤y′≤2y+1范围内的4个跨层近邻点(x′,y′),Itroil′上(x,y)对应到输入图像Itri′上的初始包围盒刚好与Itroi(l-1)′上4个跨层近邻点对应到输入图像Itri′上的4个初始包围盒空间位置重合,4个跨层近邻文字段s(x′,y′,l-1)可表示为集合:For cross-layer connection, a cross-layer connection connects the text fields corresponding to two points on the feature map output by two consecutive convolutional layers; because each layer of convolutional layer passes through, the width and height of the feature map will be reduced by half , the width w l and height h l of the l-th layer output feature map Itro il ′ are half of the width w l-1 and height h l-1 of the l-1 layer feature map Itro i(l-1) ′, and The initial bounding box scale a l corresponding to Itro il ′ is twice the initial bounding box scale a l-1 corresponding to Itro i(l-1) ′, for (x, y), take 4 cross-layer neighbor points (x', y') in the range of 2x≤x'≤2x+1, 2y≤y'≤2y+1 on the feature map Itro i(l-1) ', (x, y) on Itro il ′ corresponds to the initial bounding box on the input image Itr i ′, and the 4 cross-layer neighbor points on Itro i(l-1) ′ correspond to the 4 initial bounding boxes on the input image Itr i ′ The spatial positions of the bounding boxes coincide, and the four cross-layer neighbor text fields s (x′, y′, l-1) can be expressed as a set:
3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上近邻文字段集合之间跨层连接的正、负得分cl2,cl2为8维向量:The 3×3 convolutional predictor will predict the base text field s (x, y, l) of layer l and the set of adjacent text fields on layer l-1 The positive and negative scores c l2 of the cross-layer connection, c l2 is an 8-dimensional vector:
其中,表示预测器预测出s(x,y,l)与其所有4个近邻文字段之间的连接的正、负得分,c为上标,表示跨层连接;in, Indicates that the predictor predicts the positive and negative scores of the connection between s (x, y, l) and all 4 adjacent text fields, and c is a superscript, indicating a cross-layer connection;
所有的层内连接和所有的跨层连接构成连接集合Ns;All intralayer connections and all cross-layer connections Constitute the connection set N s ;
(1.2.7)以步骤(1.2.3)和步骤(1.2.4)获得的文字段标签、正类文字段真实偏移量、连接标签作为输出基准,以步骤(1.2.5)预测的文字段类别及得分、预测的文字段偏移量、步骤(1.2.6)预测的连接得分为预测输出,设计预测输出与输出基准之间的目标损失函数,对文字段连接检测模型利用反向传导法进行不断地训练,来最小化文字段分类、文字段偏移回归和连接分类的损失,针对所述文字段连接检测模型设计目标损失函数,目标损失函数是三个损失的加权和:(1.2.7) Take the text field label obtained in step (1.2.3) and step (1.2.4), the real offset of the positive text field, and the connection label as the output reference, and the text predicted by step (1.2.5) The category and score of the segment, the offset of the predicted text field, and the connection score predicted in step (1.2.6) are the predicted output, and the target loss function between the predicted output and the output benchmark is designed, and the reverse conduction is used for the text field connection detection model The method is continuously trained to minimize the loss of text field classification, text field offset regression and connection classification, and the target loss function is designed for the text field connection detection model. The target loss function is the weighted sum of three losses:
其中ys是所有文字段的标签,cs是预测的文字段得分,yl是预测的连接标签,cl是预测的连接得分,由层内连接得分cl1和跨层得分cl2组成;如果第i个初始包围盒标记为正类,那么ys(i)=1,否则为0;Lconf(ys,cs)是预测的文字段得分cs的softmax损失,Lconf(ys,cl)是预测的连接得分cl的softmax损失,是预测的文字段几何参数s和真实标签之间的平滑L1回归损失;ns是正类初始包围盒的数量,用来对文字段分类和回归损失进行归一化;nl是正类连接总数,用来对连接分类损失进行归一化;λ1和λ2为权重常数,在实际中取1。where y s is the label of all text fields, c s is the predicted text field score, y l is the predicted connection label, c l is the predicted connection score, which is composed of intra-layer connection score c l1 and cross-layer score c l2 ; If the i-th initial bounding box is marked as a positive class, then y s (i) = 1, otherwise 0; L conf (y s , c s ) is the softmax loss of the predicted text field score c s , L conf (y s , c l ) is the softmax loss of the predicted connection score c l , is the predicted text field geometry parameter s and the ground truth label The smoothed L between 1 regression loss; n s is the number of positive initial bounding boxes, used to normalize text field classification and regression loss; n l is the total number of positive class connections, used to normalize the connection classification loss ; λ 1 and λ 2 are weight constants, which are 1 in practice.
(1.2.8)在步骤(1.2.7)的训练过程中,采用在线扩增方法对训练数据Itr进行在线扩增,并采用在线负样本难例挖掘策略来平衡正样本和负样本。在训练图片Itr被缩放到相同的尺寸并批量加载之前,它们被随机地裁剪成一个个图像块,每个图像块与文字段的真实包围盒的jaccard重叠系数o最小;对于多方向文字,数据扩增是在多方向文字包围盒的最小包围矩形上进行的,每个样本的重叠系数o从0、0.1、0.3、0.5、0.7和0.9中随机选择,图像块的大小为原始图片尺寸的0.1-1倍之间;训练图像不水平翻转;另外,文字段和连接负样本占据训练样本的大部分,采用在线负样本难例挖掘策略来平衡正样本和负样本,对文字段和连接分开进行挖掘,控制负样本与正样本之间的比例不超过3∶1。(1.2.8) During the training process of step (1.2.7), the online amplification method is used to amplify the training data I tr online, and the online negative sample hard case mining strategy is used to balance positive samples and negative samples. Before the training images I tr are scaled to the same size and loaded in batches, they are randomly cropped into image blocks, and the jaccard overlap coefficient o of each image block and the true bounding box of the text field is the smallest; for multi-directional text, The data augmentation is carried out on the minimum enclosing rectangle of the multi-directional text bounding box, the overlap coefficient o of each sample is randomly selected from 0, 0.1, 0.3, 0.5, 0.7 and 0.9, and the size of the image block is the size of the original image Between 0.1 and 1 times; the training image is not flipped horizontally; in addition, text fields and connection negative samples occupy most of the training samples, and the online negative sample hard case mining strategy is used to balance the positive samples and negative samples, and the text field and connection are separated For mining, control the ratio between negative samples and positive samples not to exceed 3:1.
(2)利用上述训练好的卷积神经网络对待检测文本图像进行文字段和连接检测,包括如下子步骤:(2) Use the above-mentioned trained convolutional neural network to perform text field and connection detection on the text image to be detected, including the following sub-steps:
(2.1)对待检测文本图像进行文字段检测,由不同卷积层输出的特征图可以预测出不同尺度的文字段,由同一卷积层输出的特征图预测出相同尺度的文字段:对待检测图像集Itst中的第i张待检测文本图像Itsti,缩放到统一尺寸,具体尺寸可随待检测文本图像的情况人为设定,记缩放后的待检测文本图像为Itsti′。将图像Itsti′输入到步骤(1.2)中训练好的文字段连接检测模型,得到后6层卷积层分别输出的特征图构成的集合Itstoi′=[Itstoi1′,...,Itstoi6′],其中Itstoil′为后6层卷积层中第l层输出的特征图,l=1,...,6,在每张输出特征图Itstoil′上的坐标(x,y)处,3×3的卷积预测器都会预测出(x,y)对应的初始包围盒Bilq被预测为正、负类文字段的得分cs,同时也预测出5个数字作为被预测为正类文字段s+时的几何偏移量;(2.1) Perform text field detection on the text image to be detected. The feature maps output by different convolutional layers can predict text fields of different scales, and the feature maps output by the same convolutional layer can predict text fields of the same scale: the image to be detected The i-th text image Itst i to be detected in the collection Itst is scaled to a uniform size, and the specific size can be set manually according to the situation of the text image to be detected, and the zoomed text image to be detected is Itst i ′. Input the image Itst i ′ into the text field connection detection model trained in step (1.2), and obtain the set Itsto i ′=[Itsto i1 ′,...,Itsto i6 ′], where Itsto il ′ is the feature map output by the l-th layer in the last 6 convolutional layers, l=1,...,6, the coordinates (x, y on each output feature map Itsto il ′ ), the 3×3 convolution predictor will predict the initial bounding box B ilq corresponding to (x, y) is predicted to be the score c s of positive and negative text fields, and also predicts 5 numbers As the geometric offset when it is predicted as a positive text field s + ;
(2.2)对待检测文本图像检测出的所有特征层上的文字段进行连接检测,所述连接包括层内连接和跨层连接:在(2.1)预测的文字段基础上预测层内连接和跨层连接,对于层内连接,在同一张特征图Itstoil′上坐标点(x,y)处,3×3卷积预测器预测出s(x,y,l)与它8个近邻文字段之间层内连接的正、负得分cl1;对于跨层连接,3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上4个近邻文字段的跨层连接正、负得分cl2,cl1和cl2构成预测的连接得分cl;(2.2) The text fields on all feature layers detected by the text image to be detected are connected and detected, and the connections include intra-layer connections and cross-layer connections: predict intra-layer connections and cross-layer connections based on the text fields predicted in (2.1) Connection, for intra-layer connections, at the coordinate point (x, y) on the same feature map Itsto il ′, the 3×3 convolutional predictor predicts s (x, y, l) and its 8 neighbor text fields The positive and negative scores c l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor will predict the reference text field s (x, y, l) of the l-th layer and the 4 on the l-1 layer neighbor text field The cross-layer connection positive and negative scores c l2 , c l1 and c l2 constitute the predicted connection score c l ;
(2.3)将检测得到的文字段置信度得分和连接置信度得分组合,其中文字段置信度得分包括文字段正负类别得分和偏移量得分,利用卷积预测器输出softmax标准化得分:在(2.1)预测的文字段基础上预测层内连接和跨层连接,对于层内连接,在同一张特征图Itstoil′上坐标点(x,y)处,3×3卷积预测器预测出s(x,y,l)与8个近邻文字段之间层内连接的正、负得分cl1;对于跨层连接,3×3卷积预测器预测出第l层基准文字段s(x,y,l)与第l-1层上4个近邻文字段的跨层连接正、负得分cl2,cl1和cl2构成预测的连接得分cl。(2.3) Combine the detected text field confidence score and connection confidence score, where the text field confidence score includes the positive and negative category score and offset score of the text field, and use the convolution predictor to output the softmax standardized score: in ( 2.1) Based on the predicted text field, the intra-layer connection and cross-layer connection are predicted. For the intra-layer connection, at the coordinate point (x, y) on the same feature map Itsto il ′, the 3×3 convolutional predictor predicts s (x, y, l) and 8 neighbor text fields The positive and negative scores c l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor predicts the reference text field s (x, y, l) of the l-th layer and the 4 on the l-1 layer Neighbor Text Field The cross-layer connection positive and negative scores c l2 , c l1 and c l2 constitute the predicted connection score c l .
(3)组合文字段和连接,得到输出包围盒,包括如下子步骤:(3) Combining text fields and connections to obtain an output bounding box, including the following sub-steps:
(3.1)根据(2.3)中检测得到的标准化得分,过滤卷积预测器输出的文字段和连接,以过滤后的文字段作为节点,以连接作为边,构建连接图:对于步骤(2)待检测文本图像输入到文字段检测模型而产生的固定数量的文字段s和连接Ns,通过它们的得分进行过滤;为文字段s和连接Ns设置不同的过滤阈值,分别为α和β;过滤阈值可以根据不同的数据人为设定不同的值,在实际中进行多方向文本图像检测时,可以取α=0.9,β=0.7,进行多语种长文本图像检测时,可以取α=0.9,β=0.5,进行水平文本检测时,可以取α=0.6,β=0.3;将过滤后的文字段s′作为节点,过滤后的连接Ns′作为边,利用它们构建一个图;(3.1) According to the standardized score detected in (2.3), filter the text field and connection output by the convolutional predictor, use the filtered text field as a node, and use the connection as an edge to construct a connection graph: For step (2) to be Detect a fixed number of text fields s and connection N s generated by inputting the text image to the text field detection model, and filter through their scores; set different filtering thresholds for the text field s and connection N s , respectively α and β; The filtering threshold can be artificially set different values according to different data. In practice, when performing multi-directional text image detection, α=0.9, β=0.7 can be used, and when multilingual long text image detection is performed, α=0.9 can be used. β=0.5, when performing horizontal text detection, you can take α=0.6, β=0.3; use the filtered text field s' as a node, and the filtered connection N s ' as an edge, and use them to construct a graph;
(3.2)在图上执行深度优先搜索以找到相互连接的组件,每个组件记作集合B,包含由连接相连起来的文字段;(3.2) Perform a depth-first search on the graph to find connected components. Each component is denoted as a set B, which contains text fields connected by connections;
(3.3)对步骤(3.2)在图上进行深度优先搜索得到的文字段集合S,通过下述步骤组合成一个完整的词,包括:(3.3) The text field set S obtained by performing depth-first search on the graph in step (3.2) is combined into a complete word through the following steps, including:
(3.3.1)输入:|S|为集合S里的文字段数量,其中为第i个文字段,i为上标,分别为第i个文字段包围盒s(i)的中心横坐标和纵坐标,分别为文字段包围盒s(i)的宽度和高度,为文字段包围盒s(i)与水平方向之间的夹角;(3.3.1) Input: |S| is the number of text fields in the set S, where is the i-th text field, i is the superscript, are respectively the central abscissa and ordinate of the i-th text field bounding box s (i) , are the width and height of text field bounding box s (i) , respectively, is the angle between the text field bounding box s (i) and the horizontal direction;
(3.3.2)其中θb为输出包围盒的偏移角度,为集合里第i个文字段包围盒s的偏移角度,由集合S里所有文字段的平均偏移角度得到;(3.3.2) where θ b is the offset angle of the output bounding box, is the offset angle of the i-th text field bounding box s in the set, obtained from the average offset angle of all text fields in the set S;
(3.3.3)找到直线tan(θb)x+b的截距b,使得集合S中所有的文字段到中心点的距离的总和最小;(3.3.3) Find the intercept b of the straight line tan(θ b )x+b, so that all text fields in the set S reach the center point The sum of the distances is the smallest;
(3.3.4)找到直线的两个端点(xp,yp)和(xq,yq),p表示第一个端点,q表示第二个端点,xp、yp分别为第一个端点的横、纵坐标,xq、yq分别为第二个端点的横、纵坐标;(3.3.4) Find the two endpoints (x p , y p ) and (x q , y q ) of the straight line, p represents the first endpoint, q represents the second endpoint, x p and y p are the first The abscissa and ordinate of the first endpoint, x q and y q are respectively the abscissa and ordinate of the second endpoint;
(3.3.5)b表示输出包围盒,xb、yb分别为输出包围盒中心的横、纵坐标;(3.3.5) b represents the output bounding box, x b and y b are the horizontal and vertical coordinates of the center of the output bounding box, respectively;
(3.3.6)其中wb为输出包围盒的宽度,wp、wq分别为以点p为中心的包围盒的宽度和以q为中心的包围盒的宽度;(3.3.6) Where w b is the width of the output bounding box, w p and w q are respectively the width of the bounding box centered on point p and the width of the bounding box centered on q;
(3.3.7)hb为输出包围盒的高度,为集合里第i个文字段包围盒s的高度,由由文字段集合S里所有文字段的平均高度得到;(3.3.7) h b is the height of the output bounding box, is the height of the i-th text field bounding box s in the set, obtained from the average height of all text fields in the text field set S;
(3.3.8)b:=(xb,yb,wb,hb,θb),b为输出包围盒,由坐标参数、尺寸参数、角度参数表示;(3.3.8)b:=(x b , y b , w b , h b , θ b ), b is the output bounding box, represented by coordinate parameters, size parameters, and angle parameters;
(3.3.9)输出组合而成的包围盒b。(3.3.9) Output the combined bounding box b.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710010596.7A CN106897732B (en) | 2017-01-06 | 2017-01-06 | A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710010596.7A CN106897732B (en) | 2017-01-06 | 2017-01-06 | A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106897732A CN106897732A (en) | 2017-06-27 |
CN106897732B true CN106897732B (en) | 2019-10-08 |
Family
ID=59197865
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710010596.7A Active CN106897732B (en) | 2017-01-06 | 2017-01-06 | A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106897732B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11797774B2 (en) * | 2019-07-16 | 2023-10-24 | Ancestry.Com Operations Inc. | Extraction of genealogy data from obituaries |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304761A (en) * | 2017-09-25 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Method for text detection, device, storage medium and computer equipment |
CN107766860A (en) * | 2017-10-31 | 2018-03-06 | 武汉大学 | Natural scene image Method for text detection based on concatenated convolutional neutral net |
CN107977620B (en) * | 2017-11-29 | 2020-05-19 | 华中科技大学 | Multi-direction scene text single detection method based on full convolution network |
CN107844785B (en) * | 2017-12-08 | 2019-09-24 | 浙江捷尚视觉科技股份有限公司 | A kind of method for detecting human face based on size estimation |
CN108304835B (en) * | 2018-01-30 | 2019-12-06 | 百度在线网络技术(北京)有限公司 | character detection method and device |
CN108427924B (en) * | 2018-03-09 | 2020-06-23 | 华中科技大学 | A Text Regression Detection Method Based on Rotation Sensitive Features |
CN108549893B (en) * | 2018-04-04 | 2020-03-31 | 华中科技大学 | An End-to-End Recognition Method for Scene Texts of Arbitrary Shapes |
CN109086663B (en) * | 2018-06-27 | 2021-11-05 | 大连理工大学 | Scale-adaptive natural scene text detection method based on convolutional neural network |
CN109583367A (en) * | 2018-11-28 | 2019-04-05 | 网易(杭州)网络有限公司 | Image text row detection method and device, storage medium and electronic equipment |
CN109685718B (en) * | 2018-12-17 | 2020-11-10 | 中国科学院自动化研究所 | Picture squaring zooming method, system and device |
CN109886286B (en) * | 2019-01-03 | 2021-07-23 | 武汉精测电子集团股份有限公司 | Target detection method, target detection model and system based on cascade detectors |
CN109886264A (en) * | 2019-01-08 | 2019-06-14 | 深圳禾思众成科技有限公司 | A kind of character detecting method, equipment and computer readable storage medium |
CN109977997B (en) * | 2019-02-13 | 2021-02-02 | 中国科学院自动化研究所 | Image target detection and segmentation method based on convolutional neural network rapid robustness |
CN110032969B (en) * | 2019-04-11 | 2021-11-05 | 北京百度网讯科技有限公司 | Method, apparatus, device, and medium for detecting text region in image |
CN110490232B (en) * | 2019-07-18 | 2021-08-13 | 北京捷通华声科技股份有限公司 | Method, device, equipment and medium for training character row direction prediction model |
CN113065544B (en) * | 2020-01-02 | 2024-05-10 | 阿里巴巴集团控股有限公司 | Character recognition method and device and electronic equipment |
CN111259764A (en) * | 2020-01-10 | 2020-06-09 | 中国科学技术大学 | Text detection method and device, electronic equipment and storage device |
CN111291759A (en) * | 2020-01-17 | 2020-06-16 | 北京三快在线科技有限公司 | Character detection method and device, electronic equipment and storage medium |
CN111444674B (en) * | 2020-03-09 | 2022-07-01 | 稿定(厦门)科技有限公司 | Character deformation method, medium and computer equipment |
CN113515920B (en) * | 2020-04-09 | 2024-06-21 | 北京庖丁科技有限公司 | Method, electronic device and computer readable medium for extracting formulas from tables |
CN111967463A (en) * | 2020-06-23 | 2020-11-20 | 南昌大学 | Method for detecting curve fitting of curved text in natural scene |
CN111914822B (en) * | 2020-07-23 | 2023-11-17 | 腾讯科技(深圳)有限公司 | Text image labeling method, device, computer readable storage medium and equipment |
CN113888759A (en) * | 2021-10-13 | 2022-01-04 | 广东金赋科技股份有限公司 | Key-value pair extraction method and system based on deep learning model |
CN115620081B (en) * | 2022-09-27 | 2023-07-07 | 北京百度网讯科技有限公司 | Training method of target detection model and target detection method and device |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104050471B (en) * | 2014-05-27 | 2017-02-01 | 华中科技大学 | Natural scene character detection method and system |
CN105989330A (en) * | 2015-02-03 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Picture detection method and apparatus |
CN106156711B (en) * | 2015-04-21 | 2020-06-30 | 华中科技大学 | Text line positioning method and device |
CN105184312B (en) * | 2015-08-24 | 2018-09-25 | 中国科学院自动化研究所 | A kind of character detecting method and device based on deep learning |
CN105469047B (en) * | 2015-11-23 | 2019-02-22 | 上海交通大学 | Chinese detection method and system based on unsupervised learning deep learning network |
CN105608456B (en) * | 2015-12-22 | 2017-07-18 | 华中科技大学 | A kind of multi-direction Method for text detection based on full convolutional network |
CN105574513B (en) * | 2015-12-22 | 2017-11-24 | 北京旷视科技有限公司 | Character detecting method and device |
-
2017
- 2017-01-06 CN CN201710010596.7A patent/CN106897732B/en active Active
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11797774B2 (en) * | 2019-07-16 | 2023-10-24 | Ancestry.Com Operations Inc. | Extraction of genealogy data from obituaries |
Also Published As
Publication number | Publication date |
---|---|
CN106897732A (en) | 2017-06-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106897732B (en) | A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields | |
Lyu et al. | Multi-oriented scene text detection via corner localization and region segmentation | |
CN107977620B (en) | Multi-direction scene text single detection method based on full convolution network | |
Yuliang et al. | Detecting curve text in the wild: New dataset and new solution | |
Ma et al. | Rpt: Learning point set representation for siamese visual tracking | |
Barroso-Laguna et al. | Key. net: Keypoint detection by handcrafted and learned cnn filters revisited | |
CN106650725B (en) | Candidate text box generation and text detection method based on fully convolutional neural network | |
Donati et al. | Deep orientation-aware functional maps: Tackling symmetry issues in shape matching | |
CN108427924A (en) | A kind of text recurrence detection method based on rotational sensitive feature | |
CN111079739B (en) | Multi-scale attention feature detection method | |
Xia et al. | Loop closure detection for visual SLAM using PCANet features | |
Yu et al. | Robust thermal infrared object tracking with continuous correlation filters and adaptive feature fusion | |
Zhao et al. | Adversarial deep tracking | |
Han et al. | Research on remote sensing image target recognition based on deep convolution neural network | |
CN113011359B (en) | Method for simultaneously detecting plane structure and generating plane description based on image and application | |
Wang et al. | Adaptive temporal feature modeling for visual tracking via cross-channel learning | |
He et al. | Aggregating local context for accurate scene text detection | |
Sun et al. | Pseudo-LiDAR-based road detection | |
Wang et al. | A robust approach for scene text detection and tracking in video | |
Gu et al. | Linear time offline tracking and lower envelope algorithms | |
Dai et al. | RGB‐D SLAM with moving object tracking in dynamic environments | |
Pu et al. | Learning temporal regularized correlation filter tracker with spatial reliable constraint | |
Geng et al. | SANet: A novel segmented attention mechanism and multi-level information fusion network for 6D object pose estimation | |
Li et al. | Learning spatial self‐attention information for visual tracking | |
Li et al. | Centroid-based graph matching networks for planar object tracking |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |