CN106897732B

CN106897732B - A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields

Info

Publication number: CN106897732B
Application number: CN201710010596.7A
Authority: CN
Inventors: 白翔; 石葆光
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-01-06
Filing date: 2017-01-06
Publication date: 2019-10-08
Anticipated expiration: 2037-01-06
Also published as: CN106897732A

Abstract

The invention discloses multi-direction Method for text detection in a kind of natural picture based on connection text section, text section and connection are two steps crucial in the detection method, be defined as follows: text section refers to marking off many single multidirectional bounding box regions on picture, they surround a part of a text item or word；Connection refers to connecting adjacent field, it is meant that they belong to the same word or in short.The full convolutional neural networks that text section and connection are used in combination an end-to-end training are equally spaced detected with a variety of scales.Last testing result is first to connect multiple text section composition new regions, obtained from being then combined to these new regions.Detection method proposed by the present invention all achieves brilliant effect in terms of these in accuracy rate, speed and model ease compared with the existing technology, high-efficient and strong robustness, complicated picture background can be overcome, in addition also can in detection image non-Latin text long text.

Description

A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields

技术领域technical field

本发明属于计算机视觉技术领域，更具体地，涉及一种基于连接文字段的自然图片中多方向文本检测方法。The invention belongs to the technical field of computer vision, and more specifically relates to a multi-directional text detection method in natural pictures based on connected text fields.

背景技术Background technique

读取自然图片中的文本是一个充满挑战的热门任务，在照片光学识别、地理定位和图像检索方面都有许多实际的应用。在文本读取系统中，文字检测就是在单词级别或文字条级别上以包围盒来定位文字区域，它通常都作为非常关键的第一步。从某种意义上而言，文字检测也可以视为一种特殊的物体检测，即将单词、字符或文字条作为检测的目标。Reading text in natural images is a challenging and popular task with many practical applications in photo optical recognition, geolocation, and image retrieval. In the text reading system, text detection is to locate the text area with a bounding box at the word level or text strip level, and it is usually regarded as a very critical first step. In a sense, text detection can also be regarded as a special object detection, which takes words, characters or text strips as the detection target.

尽管已有的技术已经在将物体检测方法应用于文字检测上取得了极大的成功，但是物体检测方法在定位文字区域方面仍有几点明显的不足。第一，单词或文字条的长宽比通常都比一般物体要大的多，之前的方法难以产生这种比例的包围盒；第二，一些非拉丁语的文本在相邻单词之间并不包含空格，比如中文汉字。已有的技术都只能检测到单词，在检测这种文本时就会不适用，因为这种不包含空格的文本无法提供划分不同单词的视觉信息。第三，在大型自然场景图片中，文字可能是任意方向的，然而现有的技术绝大多数都只能检测水平方向的文字。因此自然场景图片中的文本检测仍然是计算机视觉技术领域的难点之一。Although existing technologies have achieved great success in applying object detection methods to text detection, there are still several obvious deficiencies in object detection methods in locating text regions. First, the aspect ratio of words or text strips is usually much larger than that of general objects, and it is difficult for previous methods to generate bounding boxes of this ratio; second, some non-Latin texts have no gaps between adjacent words. Contains spaces, such as Chinese characters. Existing technologies can only detect words, and are not applicable to this kind of text, because this kind of text without spaces cannot provide visual information for dividing different words. Third, in large natural scene images, the text may be in any direction, but most of the existing technologies can only detect horizontal text. Therefore, text detection in natural scene pictures is still one of the difficulties in the field of computer vision technology.

发明内容Contents of the invention

本发明的目的在于提供一种基于连接文字段的自然图片中多方向文本检测方法，该方法检测文本准确率高，速度快，模型简易，且鲁棒性强，能克服复杂的图片背景，另外也能检测非拉丁文字的长文本。The purpose of the present invention is to provide a multi-directional text detection method in natural pictures based on connected text fields. The method has high text detection accuracy, fast speed, simple model, and strong robustness, and can overcome complex picture backgrounds. Also detects long text in non-Latin scripts.

为实现上述目的，本发明从一个全新的视角来解决场景文字检测问题，提供了一种基于连接文字段的自然图片中多方向文本检测方法，包括下述步骤：In order to achieve the above object, the present invention solves the scene text detection problem from a brand new perspective, and provides a multi-directional text detection method in a natural picture based on connected text fields, comprising the following steps:

(1)训练文字段连接检测网络模型，包括如下子步骤：(1) Training the text field connection detection network model, including the following sub-steps:

(1.1)以词条级别标记训练图像集中所有文本图像的文本内容，标签为词条的矩形初始包围盒的四个点坐标，得到训练数据集；(1.1) Mark the text content of all text images in the training image set with the entry level, and the label is the four point coordinates of the rectangular initial bounding box of the entry to obtain the training data set;

(1.2)定义用于根据词条标签可以预测输出文字段和连接的文字段检测模型，所述网络模型由级联卷积神经网络和卷积预测器组成，根据上述训练数据集计算得到文字段和连接的标签，设计损失函数，结合在线扩增和在线负样本难例挖掘技术手段，利用反向传导方法训练该网络，得到文字段检测模型，包括如下子步骤：(1.2) Define the text field detection model that can predict the output text field and connection according to the entry label, the network model is composed of a cascaded convolutional neural network and a convolution predictor, and the text field is calculated according to the above-mentioned training data set and connected tags, design a loss function, combine online amplification and online negative sample hard example mining techniques, use the reverse conduction method to train the network, and obtain a text field detection model, including the following sub-steps:

(1.2.1)构建文字段检测卷积神经网络模型：提取特征的前几层卷积单元来自预训练的VGG-16网络，前几层卷积单元为卷积层1到池化层5，全连接层6和全连接层7分别转换为卷积层6和卷积层7，连接在其后的是一些额外加入的卷积层，用于提取更深度的特征进行检测，包括卷积层8、卷积层9、卷积层10，最后一层是卷积层11；后6个不同的卷积层分别输出不同尺寸的特征图，便于提取出多种尺度的高质量特征，检测文字段和连接是在这六个不同尺寸的特征图上进行的；对于这6个卷积层，每一层之后都添加尺寸为3×3的滤波器作为卷积预测器，来共同检测文字段和连接；(1.2.1) Constructing a text field detection convolutional neural network model: the first few layers of convolutional units for extracting features come from the pre-trained VGG-16 network, and the first few layers of convolutional units are convolutional layer 1 to pooling layer 5, Fully connected layer 6 and fully connected layer 7 are converted to convolutional layer 6 and convolutional layer 7 respectively, followed by some additional convolutional layers for extracting deeper features for detection, including convolutional layers 8. Convolutional layer 9, convolutional layer 10, and the last layer is convolutional layer 11; the last 6 different convolutional layers output feature maps of different sizes, which is convenient for extracting high-quality features of multiple scales and detecting text Segments and connections are performed on these six feature maps of different sizes; for these six convolutional layers, a filter of size 3×3 is added after each layer as a convolutional predictor to jointly detect text fields and connect;

(1.2.2)从标注的词包围盒产生文字段包围盒标签：对于原始训练图像集Itr，记缩放后的训练图像集为Itr′，w_I、h_I分别为Itr′的宽度和高度，可以取384×384或512×512像素，第i张图片Itr_i′作为模型输入，Itr_i′上标注的所有词包围盒记作W_i＝[W_i1，...，W_ip]，其中W_ij为第i张图片上的第j个词包围盒，词包围盒可以是单词级别也可以是词条级别，j＝1，...，p，p为第i张图片上词包围盒的总数量；记后6层卷积层分别输出的特征图构成集合Itro_i′＝[Itro_i1′，...，Itro_i6′]，其中Itro_il′为后6层卷积层中第l层输出的特征图，w_l、h_l分别为该特征图的宽度和高度，Itro_il′上的坐标(x，y)对应Itr_i′上以(x_a，y_a)为中心点坐标的水平初始包围盒B_ilq，它们满足下列公式：(1.2.2) Generate text field bounding box labels from marked word bounding boxes: For the original training image set Itr, record the scaled training image set as Itr′, w _I , h _I are the width and height of Itr′ respectively, Can take 384×384 or 512×512 pixels, the i-th picture Itr _i ′ is used as the model input, and all word bounding boxes marked on Itr _i ′ are recorded as W _i =[W _i1 ,...,W _ip ], where W _ij is the jth word bounding box on the i-th picture, the word bounding box can be word level or entry level, j=1,..., p, p is the word bounding box on the i-th picture The total number of ; Note that the feature maps output by the last 6 convolutional layers constitute a set Itro _i ′=[Itro _i1 ′,...,Itro _i6 ′], where Itro _il ′ is the lth in the last 6 convolutional layers The feature map output by the layer, w _l and h _l are the width and height of the feature map respectively, and the coordinates (x, y) on Itro _il ′ correspond to the coordinates of the center point on Itr _i ′ (x _a , y _a ) Horizontal initial bounding boxes B _ilq , they satisfy the following formula:

初始包围盒B_ilq的宽和高都被设置成一个常数a_l，用于控制输出文字段的比例，l＝1，...，6；记第l层输出的特征图Itro_il′对应的初始包围盒集合为B_il＝[B_il1，...，B_ilm]，q＝1，...，m，其中m为第l层输出的特征图上初始包围盒的数目；只要初始包围盒B_ilq的中心被包含在Itr′上任一标注的词包围盒W_ij内部，且B_ilq的尺寸a_l和该标注的词包围盒W_ij的高度h满足：那么这个初始包围盒B_ilq被标记为正类，标签取值为1，并与高度最为接近的那个词包围盒W_ij匹配；否则，当B_ilq与所有词包围盒W_i都不满足上述两个条件时，B_ilq就被标记为负类，标签取值为0；文字段在初始包围盒上产生，与初始包围盒标签类别相同；其中，比例常数1.5为经验值；The width and height of the initial bounding box B _ilq are both set to a constant a _l to control the proportion of the output text field, l=1,...,6; record the feature map Itro _il 'corresponding to the l-th layer output The set of initial bounding boxes is B _il =[B _il1 ,...,B _ilm ], q=1,...,m, where m is the number of initial bounding boxes on the feature map output by layer l; as long as the initial bounding boxes The center of the box B _ilq is contained inside any marked word bounding box W _ij on Itr′, and the size a _l of B _ilq and the height h of the marked word bounding box W _ij satisfy: Then this initial bounding box B _ilq is marked as a positive class, the value of the label is 1, and it matches the word bounding box W _ij with the closest height; otherwise, when B _ilq and all word bounding boxes W _i do not satisfy the above two When a condition is met, B _ilq is marked as a negative class, and the value of the label is 0; the text field is generated on the initial bounding box, which is the same as the label category of the initial bounding box; wherein, the proportional constant 1.5 is an empirical value;

(1.2.3)在所述步骤(1.2.2)产生的带标签的初始包围盒上产生文字段并计算正类文字段偏移量：负类文字段包围盒s^-为负类初始包围盒B^-；正类文字段包围盒s⁺由正类初始包围盒B⁺经过以下步骤得到：a)记正类初始包围盒B⁺匹配到的标注词包围盒W与水平方向夹角为θ_s，以B⁺的中心点为中心，将W顺时针旋转θ_s角；b)裁剪W，去除超出B⁺左边和右边的部分；c)以B⁺的中心点为中心，将裁剪后的词包围盒W′逆时针旋转θ_s角，得到文字段s⁺真实标签的几何参数x_s、y_s、w_s、h_s、θ_s；d)计算得到文s⁺相对于B⁺的偏移量(Δx_s，Δy_s，Δw_s，Δh_s，Δθ_s)，计算公式如下：(1.2.3) Generate a text field on the labeled initial bounding box generated in the step (1.2.2) and calculate the offset of the positive text field: the negative text field bounding box s ^- is the negative initial bounding box B ^- ; the bounding box s of the positive text field ⁺ is obtained from the initial bounding box B ⁺ of the positive class through the following steps: a) Record the initial bounding box B of the positive class ⁺ the matched tag word bounding box W and the horizontal direction The angle is θ _s , with the central point of B ⁺ as the center, rotate W clockwise by θ _s angle; b) crop W, and remove the part beyond the left and right of B ⁺ ; c) center the center point of B ⁺ , the cropped word The bounding box W′ is rotated counterclockwise by the θ _s angle, and the geometric parameters x _s , y _s , w _s , h _s , θ _s of the text field s ⁺ the real label are obtained; d) Calculate the offset of the text s ⁺ relative to B ⁺ Quantity (Δx _s , Δy _s , Δw _s , Δh _s , Δθ _s ), the calculation formula is as follows:

x_s＝a_lΔx_s+x_a x _s ＝a _l Δx _s +x _a

y_s＝a_lΔy_s+y_a y _s = a _l Δy _s + y _a

w_s＝a_lexp(Δw_s)w _s = a _l exp(Δw _s )

h_s＝a_lexp(Δh_s)h _s = a _l exp(Δh _s )

θ_s＝Δθ_s θ _s = Δθ _s

其中，x_s、y_s、w_s、h_s、θ_s分别为文字段包围盒s⁺的中心点横坐标、中心点纵坐标、宽度、高度以及与水平方向之间的夹角；x_a、y_a、w_a、h_a分别为水平初始包围盒B⁺的中心点横坐标、中心点纵坐标、宽度、高度；Δx_s、Δy_s、Δw_s、Δh_s、Δθ_s分别为文字段包围盒s⁺中心点横坐标x_s相对初始包围盒B⁺的偏移量、纵坐标y_s相对初始包围盒的偏移量、宽度w_s的偏移变化量、高度h_s的偏移变化量、角度θ_s的偏移量；Among them, x _s , y _s , w _s , h _s , and θ _s are respectively the abscissa of the center point, the ordinate of the center point, width, height, and the angle between the text field bounding box s ⁺ and the horizontal direction; x _a , y _a , w _a , h _a are the abscissa, ordinate, width, and height of the center point of the horizontal initial bounding box B ⁺ respectively; Δx _s , Δy _s , Δw _s , Δh _s , and Δθ _s are text fields The offset of bounding box s ⁺ central point abscissa x _s relative to initial bounding box B ⁺ , the offset of vertical coordinate y _s relative to initial bounding box, the offset change of width w _s , and the offset change of height h _s amount, the offset of angle θ _s ;

(1.2.4)对于步骤(1.2.3)产生的文字段包围盒计算连接标签：文字段s是在初始包围盒B上产生的，因此s之间的连接标签和它们对应的初始包围盒B之间的连接标签相同；对于特征图集合Itro_i′＝[Itro_i1′，...，Itro_i6′]，如果在同一张特征图Itro_il′的初始包围盒集合B_il里，两个初始包围盒的标签都是正类，且匹配到同一个词，那么之间的层内连接被标记为正类，否则标记为负类；如果在特征图Itro_il′对应的初始包围盒集合B_il里的初始包围盒和Itro_i(l-1)′对应的的初始包围盒集合B_i(l-1)里的初始包围盒的标签都是正类且匹配到同一个词包围盒W_ij，那么之间的跨层连接被标记为正类，否则标记为负类；(1.2.4) Calculate the connection label for the bounding box of the text field generated in step (1.2.3): the text field s is generated on the initial bounding box B, so the connection labels between s and their corresponding initial bounding box B The connection labels between are the same; for the feature map set Itro _i ′=[Itro _i1 ′,...,Itro _i6 ′], if in the initial bounding box set B _il of the same feature map Itro _il ′, two initial bounding box The labels of are all positive classes, and matches the same word, then The intra-layer connection between is marked as a positive class, otherwise it is marked as a negative class; if the initial bounding box in the initial bounding box set B _il corresponding to the feature map Itro _il ′ The initial bounding box in the initial bounding box set B _i(l-1) corresponding to Itro _i(l-1) ′ labels are all positive and match the same word bounding box W _ij , then The cross-layer connections between are marked as positive class, otherwise marked as negative class;

(1.2.5)以缩放后的训练图像集Itr′作为文字段检测模型输入，预测文字段s输出：对模型初始化权重和偏置，前6万次训练迭代步骤学习率设置为10^-3，之后学习率衰减到10^-4；对于后6层卷积层，在第l层特征图Itro_il′上的坐标(x，y)处，(x，y)对应到输入图像Itr_i′上以(x_a，y_a)为中心点坐标、以a_l为尺寸的初始包围盒B_ilq，3×3的卷积预测器都会预测出B_ilq被分别划分成正、负类的得分c_s，c_s为二维向量，取值范围为0-1之间的小数。同时也预测出5个数字作为被划分到正类文字段s⁺时的几何偏移量，其中分别为预测的文字段包围盒s⁺中心点横坐标相对正类初始包围盒B⁺的偏移量、纵坐标的相对正类初始包围盒B⁺的偏移量、高度的偏移变化量、宽度的偏移变化量、角度偏移量；(1.2.5) Take the scaled training image set Itr′ as the input of the text field detection model, predict the text field s output: initialize the weight and bias of the model, set the learning rate of the first 60,000 training iteration steps to 10 ^-3 , Then the learning rate decays to 10 ^-4 ; for the last 6 convolutional layers, at the coordinates (x, y) on the feature map Itro _il ′ of the first layer, (x, y) corresponds to the input image Itr _i ′ with (x _a , y _a ) is the center point coordinates, the initial bounding box B _ilq with a _l as the size, the 3×3 convolutional predictor will predict B _ilq is divided into positive and negative scores c _s , c _s is a two-dimensional vector, and its value range is a decimal between 0-1. Also predicted 5 numbers As the geometric offset when divided into the positive text field s ⁺ , where Respectively, the predicted text field bounding box s ⁺ center point abscissa relative to the offset of the positive initial bounding box B ⁺ , the vertical coordinate relative to the offset of the positive initial bounding box B ⁺ , the offset change in height, Width offset variation, angle offset;

(1.2.6)在已预测的文字段基础上预测层内连接和跨层连接输出：对于层内连接，在同一张特征图Itro_il′上坐标点(x，y)处，取x-1≤x′≤x+1、y-1≤y′≤y+1范围内近邻的点(x′，y′)，这8个点对应到输入图像Itr_i′时，便获得了与(x，y)对应的基准文字段s^(x，y，l)相连接的层内近邻文字段s^{(x′，y′，l)}，8个层内近邻文字段可表示为集合：(1.2.6) Predict the output of intra-layer connection and cross-layer connection based on the predicted text field: for intra-layer connection, at the coordinate point (x, y) on the same feature map Itro _il ′, take x-1 ≤x′≤x+1, y-1≤y′≤y+1, the neighbor points (x′, y′) in the range of y-1≤y′≤y+1, when these 8 points correspond to the input image Itr _i ′, the result of , y) and the adjacent text field s ^{(x′, y′, l)} in the layer connected to the reference text field s ^{(x, y, l)} corresponding to y), the 8 adjacent text fields in the layer can be expressed as a set:

3×3卷积预测器会预测出s^(x，y，l)与层内近邻集合的连接的正、负得分c_l1，c_l1为16维向量，其中，w为上标，表示层内连接；A 3×3 convolutional predictor predicts s ^{(x, y, l)} and the set of neighbors in the layer The positive and negative scores c _l1 of the connection, c _l1 is a 16-dimensional vector, where w is a superscript, indicating the connection within the layer;

对于跨层连接，一个跨层连接将两个连续卷积层输出的特征图上两个点处对应的文字段相连；由于每经过一层卷积层，特征图的的宽度和高度都会缩小一半，第l层输出特征图Itro_il′的宽度w_l和高度h_l是第l-1层特征图Itro_i(l-1)′的宽度w_l-1和高度h_l-1的一半，而Itro_il′对应的初始包围盒尺度a_l是Itro_i(l-1)′对应的初始包围盒尺度a_l-1的2倍，对于在第l层输出特征图Itro_il′上的(x，y)，在特征图Itro_i(l-1)′上取2x≤x′≤2x+1、2y≤y′≤2y+1范围内的4个跨层近邻点(x′，y′)，Itro_il′上(x，y)对应到输入图像Itr_i′上的初始包围盒刚好与Itro_i(l-1)′上4个跨层近邻点对应到输入图像Itr_i′上的4个初始包围盒空间位置重合，4个跨层近邻文字段s^{(x′，y′，l-1)}可表示为集合：For cross-layer connection, a cross-layer connection connects the text fields corresponding to two points on the feature map output by two consecutive convolutional layers; because each layer of convolutional layer passes through, the width and height of the feature map will be reduced by half , the width w _l and height h _l of the l-th layer output feature map Itro _il ′ are half of the width w _l-1 and height h _l-1 of the l-1 layer feature map Itro _i(l-1) ′, and The initial bounding box scale a _l corresponding to Itro _il ′ is twice the initial bounding box scale a _l-1 corresponding to _Itro _i(l-1) ′, for (x, y), take 4 cross-layer neighbor points (x', y') in the range of 2x≤x'≤2x+1, 2y≤y'≤2y+1 on the feature map Itro _i(l-1) ', (x, y) on Itro _il ′ corresponds to the initial bounding box on the input image Itr _i ′, and the 4 cross-layer neighbor points on Itro _i(l-1) ′ correspond to the 4 initial bounding boxes on the input image Itr _i ′ The spatial positions of the bounding boxes coincide, and the four cross-layer neighbor text fields s ^{(x′, y′, l-1)} can be expressed as a set:

3×3卷积预测器会预测出第l层基准文字段s^(x，y，l)与第l-1层上近邻文字段集合之间跨层连接的正、负得分c_l2，c_l2为8维向量：The 3×3 convolutional predictor will predict the base text field s ^{(x, y, l)} of layer l and the set of adjacent text fields on layer l-1 The positive and negative scores c _l2 of the cross-layer connection, c _l2 is an 8-dimensional vector:

其中，表示预测器预测出s^(x，y，l)与其所有4个近邻文字段之间的连接的正、负得分，c为上标，表示跨层连接；in, Indicates that the predictor predicts the positive and negative scores of the connection between s ^{(x, y, l)} and all 4 adjacent text fields, and c is a superscript, indicating a cross-layer connection;

所有的层内连接和所有的跨层连接构成连接集合N_s；All intralayer connections and all cross-layer connections Constitute the connection set N _s ;

(1.2.7)以步骤(1.2.3)和步骤(1.2.4)获得的文字段标签、正类文字段真实偏移量、连接标签作为输出基准，以步骤(1.2.5)预测的文字段类别及得分、预测的文字段偏移量、步骤(1.2.6)预测的连接得分为预测输出，设计预测输出与输出基准之间的目标损失函数，对文字段连接检测模型利用反向传导法进行不断地训练，来最小化文字段分类、文字段偏移回归和连接分类的损失，针对所述文字段连接检测模型设计目标损失函数，目标损失函数是三个损失的加权和：(1.2.7) Take the text field label obtained in step (1.2.3) and step (1.2.4), the real offset of the positive text field, and the connection label as the output reference, and the text predicted by step (1.2.5) The category and score of the segment, the offset of the predicted text field, and the connection score predicted in step (1.2.6) are the predicted output, and the target loss function between the predicted output and the output benchmark is designed, and the reverse conduction is used for the text field connection detection model The method is continuously trained to minimize the loss of text field classification, text field offset regression and connection classification, and the target loss function is designed for the text field connection detection model. The target loss function is the weighted sum of three losses:

其中y_s是所有文字段的标签，c_s是预测的文字段得分，y_l是预测的连接标签，c_l是预测的连接得分，由层内连接得分c_l1和跨层得分c_l2组成；如果第i个初始包围盒标记为正类，那么y_s(i)＝1，否则为0；L_conf(y_s，c_s)是预测的文字段得分c_s的softmax损失，L_conf(y_s，c_l)是预测的连接得分c_l的softmax损失，是预测的文字段几何参数s和真实标签之间的平滑L₁回归损失；n_s是正类初始包围盒的数量，用来对文字段分类和回归损失进行归一化；n_l是正类连接总数，用来对连接分类损失进行归一化；λ₁和λ₂为权重常数，在实际中取1；where y _s is the label of all text fields, c _s is the predicted text field score, y _l is the predicted connection label, c _l is the predicted connection score, which is composed of intra-layer connection score c _l1 and cross-layer score c _l2 ; If the i-th initial bounding box is marked as a positive class, then y _s (i) = 1, otherwise 0; L _conf (y _s , c _s ) is the softmax loss of the predicted text field score c _s , L _conf (y _s , c _l ) is the softmax loss of the predicted connection score c _l , is the predicted text field geometry parameter s and the ground truth label The smoothed L between ₁ regression loss; n _s is the number of positive initial bounding boxes, used to normalize text field classification and regression loss; n _l is the total number of positive class connections, used to normalize the connection classification loss ; λ ₁ and λ ₂ are weight constants, which are 1 in practice;

(1.2.8)在步骤(1.2.7)的训练过程中，采用在线扩增方法对训练数据I_tr进行在线扩增，并采用在线负样本难例挖掘策略来平衡正样本和负样本。在训练图片I_tr被缩放到相同的尺寸并批量加载之前，它们被随机地裁剪成一个个图像块，每个图像块与文字段的真实包围盒的jaccard重叠系数o最小；对于多方向文字，数据扩增是在多方向文字包围盒的最小包围矩形上进行的，每个样本的重叠系数o从0、0.1、0.3、0.5、0.7和0.9中随机选择，图像块的大小为原始图片尺寸的0.1-1倍之间；训练图像不水平翻转；另外，文字段和连接负样本占据训练样本的大部分，采用在线负样本难例挖掘策略来平衡正样本和负样本，对文字段和连接分开进行挖掘，控制负样本与正样本之间的比例不超过3∶1。(1.2.8) During the training process of step (1.2.7), the online amplification method is used to amplify the training data I _tr online, and the online negative sample hard case mining strategy is used to balance positive samples and negative samples. Before the training images I _tr are scaled to the same size and loaded in batches, they are randomly cropped into image blocks, and the jaccard overlap coefficient o of each image block and the true bounding box of the text field is the smallest; for multi-directional text, The data augmentation is carried out on the minimum enclosing rectangle of the multi-directional text bounding box, the overlap coefficient o of each sample is randomly selected from 0, 0.1, 0.3, 0.5, 0.7 and 0.9, and the size of the image block is the size of the original image Between 0.1 and 1 times; the training image is not flipped horizontally; in addition, text fields and connection negative samples occupy most of the training samples, and the online negative sample hard case mining strategy is used to balance the positive samples and negative samples, and the text field and connection are separated For mining, control the ratio between negative samples and positive samples not to exceed 3:1.

(2)利用上述训练好的卷积神经网络对待检测文本图像进行文字段和连接检测，包括如下子步骤：(2) Use the above-mentioned trained convolutional neural network to perform text field and connection detection on the text image to be detected, including the following sub-steps:

(2.1)对待检测文本图像进行文字段检测，由不同卷积层输出的特征图可以预测出不同尺度的文字段，由同一卷积层输出的特征图预测出相同尺度的文字段：对待检测图像集Itst中的第i张待检测文本图像Itst_i，缩放到统一尺寸，具体尺寸可随待检测文本图像的情况人为设定，记缩放后的待检测文本图像为Itst_i′。将图像Itst_i′输入到步骤(1.2)中训练好的文字段连接检测模型，得到后6层卷积层分别输出的特征图构成的集合Itsto_i′＝[Itsto_i1′，...，Itsto_i6′]，其中Itsto_il′为后6层卷积层中第l层输出的特征图，l＝1，...，6，在每张输出特征图Itsto_il′上的坐标(x，y)处，3×3的卷积预测器都会预测出(x，y)对应的初始包围盒B_ilq被预测为正、负类文字段的得分c_s，同时也预测出5个数字作为被预测为正类文字段s⁺时的几何偏移量；(2.1) Perform text field detection on the text image to be detected. The feature maps output by different convolutional layers can predict text fields of different scales, and the feature maps output by the same convolutional layer can predict text fields of the same scale: the image to be detected The i-th text image Itst _i to be detected in the collection Itst is scaled to a uniform size, and the specific size can be set manually according to the situation of the text image to be detected, and the zoomed text image to be detected is Itst _i ′. Input the image Itst _i ′ into the text field connection detection model trained in step (1.2), and obtain the set Itsto _i ′=[Itsto _i1 ′,...,Itsto _i6 ′], where Itsto _il ′ is the feature map output by the l-th layer in the last 6 convolutional layers, l=1,...,6, the coordinates (x, y on each output feature map Itsto _il ′ ), the 3×3 convolution predictor will predict the initial bounding box B _ilq corresponding to (x, y) is predicted to be the score c _s of positive and negative text fields, and also predicts 5 numbers As the geometric offset when it is predicted as a positive text field s ⁺ ;

(2.2)对待检测文本图像检测出的所有特征层上的文字段进行连接检测，所述连接包括层内连接和跨层连接：在(2.1)预测的文字段基础上预测层内连接和跨层连接，对于层内连接，在同一张特征图Itsto_il′上坐标点(x，y)处，3×3卷积预测器预测出s^(x，y，l)与它8个近邻文字段之间层内连接的正、负得分c_l1；对于跨层连接，3×3卷积预测器会预测出第l层基准文字段s^(x，y，l)与第l-1层上4个近邻文字段的跨层连接正、负得分c_l2，c_l1和c_l2构成预测的连接得分c_l；(2.2) The text fields on all feature layers detected by the text image to be detected are connected and detected, and the connections include intra-layer connections and cross-layer connections: predict intra-layer connections and cross-layer connections based on the text fields predicted in (2.1) Connection, for intra-layer connections, at the coordinate point (x, y) on the same feature map Itsto _il ′, the 3×3 convolutional predictor predicts s ^{(x, y, l)} and its 8 neighbor text fields The positive and negative scores c _l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor will predict the reference text field s ^{(x, y, l)} of the l-th layer and the 4 on the l-1 layer neighbor text fields The cross-layer connection positive and negative scores c _l2 , c _l1 and c _l2 constitute the predicted connection score c _l ;

(2.3)将检测得到的文字段置信度得分和连接置信度得分组合，其中文字段置信度得分包括文字段正负类别得分和偏移量得分，利用卷积预测器输出softmax标准化得分：在(2.1)预测的文字段基础上预测层内连接和跨层连接，对于层内连接，在同一张特征图Itsto_il′上坐标点(x，y)处，3×3卷积预测器预测出s^(x，y，l)与8个近邻文字段之间层内连接的正、负得分c_l1；对于跨层连接，3×3卷积预测器会预测出第l层基准文字段s^(x，y，l)与第l-1层上4个近邻文字段的跨层连接正、负得分c_l2，c_l1和c_l2构成预测的连接得分c_l。(2.3) Combine the detected text field confidence score and connection confidence score, where the text field confidence score includes the positive and negative category score and offset score of the text field, and use the convolution predictor to output the softmax standardized score: in ( 2.1) Based on the predicted text field, the intra-layer connection and cross-layer connection are predicted. For the intra-layer connection, at the coordinate point (x, y) on the same feature map Itsto _il ′, the 3×3 convolutional predictor predicts s ^{(x, y, l)} and 8 neighbor text fields The positive and negative scores c _l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor will predict the reference text field s ^{(x, y, l)} of the l-th layer and the 4 on the l-1 layer neighbor text field The cross-layer connection positive and negative scores c _l2 , c _l1 and c _l2 constitute the predicted connection score c _l .

(3)组合文字段和连接，得到输出包围盒，包括如下子步骤：(3) Combining text fields and connections to obtain an output bounding box, including the following sub-steps:

(3.1)根据(2.3)中检测得到的标准化得分，过滤卷积预测器输出的文字段和连接，以过滤后的文字段作为节点，以连接作为边，构建连接图：对于步骤(2)待检测文本图像输入到文字段检测模型而产生的固定数量的文字段s和连接N_s，通过它们的得分进行过滤；为文字段s和连接N_s设置不同的过滤阈值，分别为α和β；过滤阈值可以根据不同的数据人为设定不同的值，在实际中进行多方向文本图像检测时，可以取α＝0.9，β＝0.7，进行多语种长文本图像检测时，可以取α＝0.9，β＝0.5，进行水平文本检测时，可以取α＝0.6，β＝0.3；将过滤后的文字段s′作为节点，过滤后的连接N_s′作为边，利用它们构建一个图；(3.1) According to the standardized score detected in (2.3), filter the text field and connection output by the convolutional predictor, use the filtered text field as a node, and use the connection as an edge to construct a connection graph: For step (2) to be Detect a fixed number of text fields s and connection N _s generated by inputting the text image to the text field detection model, and filter through their scores; set different filtering thresholds for the text field s and connection N _s , respectively α and β; The filtering threshold can be artificially set different values according to different data. In practice, when performing multi-directional text image detection, α=0.9, β=0.7 can be used, and when multilingual long text image detection is performed, α=0.9 can be used. β=0.5, when performing horizontal text detection, you can take α=0.6, β=0.3; use the filtered text field s' as a node, and the filtered connection N _s ' as an edge, and use them to construct a graph;

(3.2)在图上执行深度优先搜索以找到相互连接的组件，每个组件记作集合B，包含由连接相连起来的文字段；(3.2) Perform a depth-first search on the graph to find connected components. Each component is denoted as a set B, which contains text fields connected by connections;

(3.3)对步骤(3.2)在图上进行深度优先搜索得到的文字段集合S，通过下述步骤组合成一个完整的词，包括：(3.3) The text field set S obtained by performing depth-first search on the graph in step (3.2) is combined into a complete word through the following steps, including:

(3.3.1)输入：|S|为集合S里的文字段数量，其中为第i个文字段，i为上标，分别为第i个文字段包围盒s⁽ⁱ⁾的中心横坐标和纵坐标，分别为文字段包围盒s⁽ⁱ⁾的宽度和高度，为文字段包围盒s⁽ⁱ⁾与水平方向之间的夹角；(3.3.1) Input: |S| is the number of text fields in the set S, where is the i-th text field, i is the superscript, are respectively the central abscissa and ordinate of the i-th text field bounding box s ⁽ⁱ⁾ , are the width and height of text field bounding box s ⁽ⁱ⁾ , respectively, is the angle between the text field bounding box s ⁽ⁱ⁾ and the horizontal direction;

(3.3.2)其中θ_b为输出包围盒的偏移角度，为集合里第i个文字段包围盒s的偏移角度，由集合S里所有文字段的平均偏移角度得到；(3.3.2) where θ _b is the offset angle of the output bounding box, is the offset angle of the i-th text field bounding box s in the set, obtained from the average offset angle of all text fields in the set S;

(3.3.3)找到直线tan(θ_b)x+b的截距b，使得集合S中所有的文字段到中心点的距离的总和最小；(3.3.3) Find the intercept b of the straight line tan(θ _b )x+b, so that all text fields in the set S reach the center point The sum of the distances is the smallest;

(3.3.4)找到直线的两个端点(x_p，y_p)和(x_q，y_q)，p表示第一个端点，q表示第二个端点，x_p、y_p分别为第一个端点的横、纵坐标，x_q、y_q分别为第二个端点的横、纵坐标；(3.3.4) Find the two endpoints (x _p , y _p ) and (x _q , y _q ) of the straight line, p represents the first endpoint, q represents the second endpoint, x _p and y _p are the first The abscissa and ordinate of the first endpoint, x _q and y _q are respectively the abscissa and ordinate of the second endpoint;

(3.3.5)b表示输出包围盒，x_b、y_b分别为输出包围盒中心的横、纵坐标；(3.3.5) b represents the output bounding box, x _b and y _b are the horizontal and vertical coordinates of the center of the output bounding box, respectively;

(3.3.6)其中w_b为输出包围盒的宽度，w_p、w_q分别为以点p为中心的包围盒的宽度和以q为中心的包围盒的宽度；(3.3.6) Where w _b is the width of the output bounding box, w _p and w _q are respectively the width of the bounding box centered on point p and the width of the bounding box centered on q;

(3.3.7)h_b为输出包围盒的高度，为集合里第i个文字段包围盒s的高度，由由文字段集合S里所有文字段的平均高度得到；(3.3.7) h _b is the height of the output bounding box, is the height of the i-th text field bounding box s in the set, obtained from the average height of all text fields in the text field set S;

(3.3.8)b：＝(x_b，y_b，w_b，h_b，θ_b)，b为输出包围盒，由坐标参数、尺寸参数、角度参数表示；(3.3.8)b:=(x _b , y _b , w _b , h _b , θ _b ), b is the output bounding box, represented by coordinate parameters, size parameters, and angle parameters;

(3.3.9)输出组合而成的包围盒b。(3.3.9) Output the combined bounding box b.

通过本发明所构思的以上技术方案，与现有技术相比，本发明具有以下技术效果：Through the above technical solutions conceived by the present invention, compared with the prior art, the present invention has the following technical effects:

(1)可以检测多方向文字：自然场景图片里的文本通常是任意方向或者扭曲的，本发明方法文字区域可以通过文字段包围盒进行局部描述，文字段包围盒可以被设置成任意方向，因此可以包含多方向或扭曲形状的文字。(1) Can detect multi-directional text: the text in the natural scene picture is usually in any direction or distorted, the text area of the present invention can be described locally through the text field bounding box, and the text field bounding box can be set to any direction, so Text that can contain multi-directional or distorted shapes.

(2)灵活度高：本发明方法也可以检测任意长度的文字条，因为组合文字段只依赖于预测的连接，因此既可以检测单词，也可以检测文字条；(2) flexibility is high: the inventive method also can detect the text strip of any length, because the combination text field only depends on the connection of prediction, therefore both can detect word, also can detect text strip;

(3)鲁棒性强：本发明方法采用的是以文字段包围盒进行局部描述，这种局部描述的方法能克服复杂的自然图片背景，从图片里捕获文本特征；(3) Robustness is strong: what the present invention method adopts is to carry out partial description with text field bounding box, the method for this partial description can overcome complex natural picture background, capture text feature from picture;

(4)效率高：本发明方法的文字段检测模型是端到端进行训练的，每秒能够处理超过20张大小为512x512图像，因为文字段和连接都是通过在全卷积CNN模型进行一次正向传播获得，不需要对输入图像进行离线的缩放和旋转；(4) High efficiency: the text field detection model of the inventive method is trained end-to-end, and it can process more than 20 images per second with a size of 512x512, because the text fields and connections are all performed once in the full convolution CNN model Obtained by forward propagation, no offline scaling and rotation of the input image is required;

(5)通用性强：一些非拉丁语的文本在相邻单词之间并不包含空格，比如中文汉字。已有的技术都只能检测到单词，在检测这种文本时就会不适用，因为这种不包含空格的文本无法提供划分不同单词的视觉信息。除了拉丁文字，本发明也能够检测非拉丁文字的长文本，因为本发明方法不需要利用空格来提供视觉划分信息。(5) Strong versatility: Some non-Latin texts do not contain spaces between adjacent words, such as Chinese characters. Existing technologies can only detect words, and are not applicable to this kind of text, because this kind of text without spaces cannot provide visual information for dividing different words. In addition to Latin scripts, the present invention is also capable of detecting long texts in non-Latin scripts, because the method of the present invention does not need to use spaces to provide visual division information.

附图说明Description of drawings

图1是本发明基于文字段连接的自然图片中多方向文本检测的流程图；Fig. 1 is the flow chart of multidirectional text detection in the natural picture based on text field connection of the present invention;

图2是本发明计算文字段真实标签各项参数的示意图；Fig. 2 is the schematic diagram that the present invention calculates each parameter of text field true label;

图3是本发明卷积预测器的输出组成示意图；Fig. 3 is a schematic diagram of the output composition of the convolutional predictor of the present invention;

图4是本发明文字段连接检测模型网络连接图；Fig. 4 is a text field connection detection model network connection diagram of the present invention;

图5是本发明一实施例中利用训练好的文字段连接检测网络模型对待检测文本图像进行检测文字段和连接输出包围盒的结果图。Fig. 5 is a result diagram of detecting text fields and connecting output bounding boxes of the text image to be detected by using the trained text field connection detection network model in an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

以下首先就本发明的技术术语进行解释和说明：Below at first explain and illustrate with regard to the technical terms of the present invention:

卷积神经网络(Concolutional Neural Network，CNN)：一种可用于图像分类、回归等任务的神经网络。网络通常由卷积层、降采样层和全连接层构成。卷积层和降采样层负责提取图像的特征，全连接层负责分类或回归。网络的参数包括卷积核以及全连接层的参数及偏置，参数可以通过反向传导算法，从数据中学习得到；Convolutional Neural Network (CNN): A neural network that can be used for image classification, regression, and other tasks. Networks usually consist of convolutional layers, downsampling layers, and fully connected layers. The convolutional layer and the downsampling layer are responsible for extracting the features of the image, and the fully connected layer is responsible for classification or regression. The parameters of the network include the parameters and offsets of the convolution kernel and the fully connected layer. The parameters can be learned from the data through the reverse conduction algorithm;

VGG16：2014年ILSVRC的亚军是VGGNet，包含16个CONV/FC层，具有非常均匀的架构，颇具吸引力，从开始到结束只执行3x3卷积和2x2池化层，成为经典的卷积神经网络模型。他们的预训练模型可用于Caffe的即插即用。它证明了网络的深度是良好性能的关键组成部分。VGG16: The runner-up of ILSVRC in 2014 is VGGNet, which contains 16 CONV/FC layers, has a very uniform architecture, is quite attractive, and only performs 3x3 convolution and 2x2 pooling layers from the beginning to the end, becoming a classic convolutional neural network. Model. Their pretrained models are available plug-and-play with Caffe. It demonstrates that the depth of the network is a key component of good performance.

深度优先搜索(DFS)：它是一种用于遍历或搜索树或图的算法。沿着树的深度遍历树的节点，尽可能深的搜索树的分支。当节点v的所在边都己被探寻过，搜索将回溯到发现节点v的那条边的起始节点。这一过程一直进行到已发现从源节点可达的所有节点为止。如果还存在未被发现的节点，则选择其中一个作为源节点并重复以上过程，整个进程反复进行直到所有节点都被访问为止。属于图论中的经典算法，利用深度优先搜索算法可以产生目标图的相应拓扑排序表，利用拓扑排序表可以方便的解决很多相关的图论问题，如最大路径问题等等。Depth First Search (DFS): It is an algorithm for traversing or searching a tree or graph. Traverse the nodes of the tree along the depth of the tree, searching the branches of the tree as deep as possible. When all edges of node v have been explored, the search will backtrack to the start node of the edge where node v was found. This process continues until all nodes reachable from the source node have been found. If there are undiscovered nodes, select one of them as the source node and repeat the above process, and the whole process is repeated until all nodes are visited. It belongs to the classic algorithm in graph theory. Using the depth-first search algorithm can generate the corresponding topological sorting table of the target graph. Using the topological sorting table can easily solve many related graph theory problems, such as the maximum path problem and so on.

如图1所示，本发明基于空间变换的自然场景下文本检测方法包括以下步骤：As shown in Figure 1, the text detection method under the natural scene based on spatial transformation of the present invention comprises the following steps:

(1.2)定义用于根据词条标注可以预测输出文字段和连接的文字段检测模型，所述网络模型由级联卷积神经网络和卷积预测器组成，根据上述训练数据集计算得到文字段和连接的标签，设计损失函数，结合在线扩增和在线负样本难例挖掘技术手段，利用反向传导方法训练该网络，得到文字段检测模型，包括如下子步骤：(1.2) Define the text field detection model that can predict the output text field and connection according to the entry label, the network model is composed of a cascaded convolutional neural network and a convolution predictor, and the text field is calculated according to the above-mentioned training data set and connected tags, design a loss function, combine online amplification and online negative sample hard example mining techniques, use the reverse conduction method to train the network, and obtain a text field detection model, including the following sub-steps:

x_s＝a_lΔx_s+x_a x _s ＝a _l Δx _s +x _a

y_s＝a_lΔy_s+y_a y _s = a _l Δy _s + y _a

w_s＝a_lexp(Δw_s)w _s = a _l exp(Δw _s )

h_s＝a_lexp(Δh_s)h _s = a _l exp(Δh _s )

θ_s＝Δθ_s θ _s = Δθ _s

(1.2.6)在已预测的文字段基础上预测层内连接和跨层连接输出：对于层内连接，在同一张特征图Itro_il′上坐标点(x，y)处，取x-1≤x′≤x+1、y-1≤y′≤y+1范围内近邻的点(x′，y′)，这8个点对应到输入图像Itr_i′时，便获得了与(x，y)对应的基准文字段s^(x，y，l)相连接的8个层内近邻文字段s^{(x′，y′，l)}，8个层内近邻文字段可表示为集合：(1.2.6) Predict the output of intra-layer connection and cross-layer connection based on the predicted text field: for intra-layer connection, at the coordinate point (x, y) on the same feature map Itro _il ′, take x-1 ≤x′≤x+1, y-1≤y′≤y+1, the neighbor points (x′, y′) in the range of y-1≤y′≤y+1, when these 8 points correspond to the input image Itr _i ′, the result of , y) The 8 intra-layer neighbor text fields s ^{(x′, y′, l)} connected with the reference text field s ^{(x, y, l)} corresponding to y), the 8 intra-layer neighbor text fields can be expressed as a set:

其中y_s是所有文字段的标签，c_s是预测的文字段得分，y_l是预测的连接标签，c_l是预测的连接得分，由层内连接得分c_l1和跨层得分c_l2组成；如果第i个初始包围盒标记为正类，那么y_s(i)＝1，否则为0；L_conf(y_s，c_s)是预测的文字段得分c_s的softmax损失，L_conf(y_s，c_l)是预测的连接得分c_l的softmax损失，是预测的文字段几何参数s和真实标签之间的平滑L₁回归损失；n_s是正类初始包围盒的数量，用来对文字段分类和回归损失进行归一化；n_l是正类连接总数，用来对连接分类损失进行归一化；λ₁和λ₂为权重常数，在实际中取1。where y _s is the label of all text fields, c _s is the predicted text field score, y _l is the predicted connection label, c _l is the predicted connection score, which is composed of intra-layer connection score c _l1 and cross-layer score c _l2 ; If the i-th initial bounding box is marked as a positive class, then y _s (i) = 1, otherwise 0; L _conf (y _s , c _s ) is the softmax loss of the predicted text field score c _s , L _conf (y _s , c _l ) is the softmax loss of the predicted connection score c _l , is the predicted text field geometry parameter s and the ground truth label The smoothed L between ₁ regression loss; n _s is the number of positive initial bounding boxes, used to normalize text field classification and regression loss; n _l is the total number of positive class connections, used to normalize the connection classification loss ; λ ₁ and λ ₂ are weight constants, which are 1 in practice.

(2.2)对待检测文本图像检测出的所有特征层上的文字段进行连接检测，所述连接包括层内连接和跨层连接：在(2.1)预测的文字段基础上预测层内连接和跨层连接，对于层内连接，在同一张特征图Itsto_il′上坐标点(x，y)处，3×3卷积预测器预测出s^(x，y，l)与它8个近邻文字段之间层内连接的正、负得分c_l1；对于跨层连接，3×3卷积预测器会预测出第l层基准文字段s^(x，y，l)与第l-1层上4个近邻文字段的跨层连接正、负得分c_l2，c_l1和c_l2构成预测的连接得分c_l；(2.2) The text fields on all feature layers detected by the text image to be detected are connected and detected, and the connections include intra-layer connections and cross-layer connections: predict intra-layer connections and cross-layer connections based on the text fields predicted in (2.1) Connection, for intra-layer connections, at the coordinate point (x, y) on the same feature map Itsto _il ′, the 3×3 convolutional predictor predicts s ^{(x, y, l)} and its 8 neighbor text fields The positive and negative scores c _l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor will predict the reference text field s ^{(x, y, l)} of the l-th layer and the 4 on the l-1 layer neighbor text field The cross-layer connection positive and negative scores c _l2 , c _l1 and c _l2 constitute the predicted connection score c _l ;

(2.3)将检测得到的文字段置信度得分和连接置信度得分组合，其中文字段置信度得分包括文字段正负类别得分和偏移量得分，利用卷积预测器输出softmax标准化得分：在(2.1)预测的文字段基础上预测层内连接和跨层连接，对于层内连接，在同一张特征图Itsto_il′上坐标点(x，y)处，3×3卷积预测器预测出s^(x，y，l)与8个近邻文字段之间层内连接的正、负得分c_l1；对于跨层连接，3×3卷积预测器预测出第l层基准文字段s^(x，y，l)与第l-1层上4个近邻文字段的跨层连接正、负得分c_l2，c_l1和c_l2构成预测的连接得分c_l。(2.3) Combine the detected text field confidence score and connection confidence score, where the text field confidence score includes the positive and negative category score and offset score of the text field, and use the convolution predictor to output the softmax standardized score: in ( 2.1) Based on the predicted text field, the intra-layer connection and cross-layer connection are predicted. For the intra-layer connection, at the coordinate point (x, y) on the same feature map Itsto _il ′, the 3×3 convolutional predictor predicts s ^{(x, y, l)} and 8 neighbor text fields The positive and negative scores c _l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor predicts the reference text field s ^{(x, y, l)} of the l-th layer and the 4 on the l-1 layer Neighbor Text Field The cross-layer connection positive and negative scores c _l2 , c _l1 and c _l2 constitute the predicted connection score c _l .

Claims

1. a kind of multidirectional text detection method in the natural picture based on connection text field, it is characterized in that, described method comprises the steps:

(1) Training the text field connection detection network model, including the following sub-steps:

(1.1) Mark the text content of all text images in the training image set with the entry level, and the label is the four point coordinates of the rectangular initial bounding box of the entry to obtain the training data set;

(1.2) Define the text field connection detection network model that can predict the output text field and connection according to the entry label, the text field connection detection network model is composed of a cascade convolutional neural network and a convolution predictor, according to the above training The data set is calculated to obtain the label of the text field and the connection, and the loss function is designed. Combined with the online amplification and online negative sample mining method, the text field connection detection network is trained by the reverse conduction method, and the text field connection detection network model is obtained;

(2) Utilize the well-trained above-mentioned text field connection detection network model to carry out text field and connection detection on the text image to be detected, including the following sub-steps:

(2.1) Perform text field detection on the text image to be detected, predict text fields of different scales from the feature maps output by different convolution layers, and predict text fields of the same scale from the feature maps output by the same convolution layer;

(2.2) text fields on all feature layers detected by the text image to be detected are connected and detected, and the connections include intra-layer connections and cross-layer connections;

(2.3) Combining the confidence score and the connection confidence score of the detected text field, wherein the Chinese field confidence score includes the text field positive and negative category score and offset score, and uses the convolution predictor to output the softmax standardized score;

(3) Combining text fields and connections to obtain an output bounding box, including the following sub-steps:

(3.1) According to the standardized score detected in (2.3), filter the text field and connection output by the convolutional predictor, use the filtered text field as a node, and use the connection as an edge to construct a connection graph;

(3.2) Perform a depth-first search on the graph to find interconnected components. Each component is denoted as a set S, which contains text fields connected by connections;

(3.3) Combine the text fields in a collection into a complete term, calculate the bounding box of the complete term and output it.

2. the multi-directional text detection method in the natural picture based on connecting text field according to claim 1, is characterized in that, described step (1.2) is specially:

(1.2.1) Constructing a text field detection convolutional neural network model: the first few layers of convolutional units for extracting features come from the pre-trained VGG-16 network, and the first few layers of convolutional units are convolutional layer 1 to pooling layer 5, Fully connected layer 6 and fully connected layer 7 are converted to convolutional layer 6 and convolutional layer 7 respectively, followed by some additional convolutional layers for extracting deeper features for detection, including convolutional layers 8. Convolutional layer 9, convolutional layer 10, and the last layer is convolutional layer 11; the last 6 different convolutional layers output feature maps of different sizes, which is convenient for extracting high-quality features of multiple scales and detecting text Segments and connections are performed on these six feature maps of different sizes; for these six convolutional layers, a filter of size 3×3 is added after each layer as a convolutional predictor to jointly detect text fields and connect;

(1.2.2) Generate a text field bounding box label from the labeled word bounding box: For the original training image set Itr, record the scaled training image set as Itr', w _I , h _I are the width and height of Itr' respectively, Taking the i-th picture Itr _i ' as the model input, all word bounding boxes marked on Itr _i ' are written as W _i ＝[W _i1 ,...,W _ip ], where W _ij is the i-th picture on the j word bounding boxes, the word bounding boxes are word level or entry level, j=1,...,p, p is the total number of word bounding boxes on Itr _i '; the output of the 6 layers of convolutional layers after the note The feature map constitutes a set Itro _i '=[Itro _i1 ',...,Itro _i6 '], where Itro _il ' is the feature map output by the lth layer in the last 6 convolutional layers, w _l and h _l are the The width and height of the feature map, the coordinates (x, y) on Itro _il ' correspond to the horizontal initial bounding box B _ilq with (x _a , y _a ) as the center point coordinates on Itr _i ', and they satisfy the following formula:

The width and height of the initial bounding box B _ilq are set to a constant a _l to control the proportion of the output text field, l=1,...,6; record the feature map Itro _il 'corresponding to the l-th layer output The set of initial bounding boxes is B _il =[B _il1 ,...,B _ilm ], q=1,...,m, where m is the number of initial bounding boxes on the feature map output by layer l; as long as the initial bounding boxes The center of the box B _ilq is contained inside any marked word bounding box W _ij on Itr', and the size a _l of B _ilq and the height h of the marked word bounding box W _ij satisfy: Then this initial bounding box B _ilq is marked as a positive class, the value of the label is 1, and matches the word bounding box W _ij with the closest height; otherwise, when B _ilq and all word bounding boxes W _i do not satisfy the above two When a condition is met, B _ilq is marked as a negative class, and the value of the label is 0; the text field is generated on the initial bounding box, which is the same as the label category of the initial bounding box;

(1.2.3) Generate a text field on the labeled initial bounding box generated in the step (1.2.2) and calculate the offset of the positive text field: the negative text field bounding box s ^- is the negative initial bounding box B ^- ; the bounding box s of the positive text field ⁺ is obtained from the initial bounding box B ⁺ of the positive class through the following steps: a) Record the initial bounding box B of the positive class ⁺ the matched tag word bounding box W and the horizontal direction The angle is θ _s , with the central point of B ⁺ as the center, rotate W clockwise by θ _s angle; b) crop W, and remove the part beyond the left and right of B ⁺ ; c) center the center point of B ⁺ , the cropped word The bounding box W' is rotated counterclockwise by the angle θ _s , and the geometric parameters x _s , y _s , w _s , h _s , θ _s of the text field s ⁺ the real label are obtained; d) Calculate the offset of the text s ⁺ relative to B ⁺ Quantity (Δx _s , Δy _s , Δw _s , Δh _s , Δθ _s ), the calculation formula is as follows:

x _s ＝a _l Δx _s +x _a

y _s = a _l Δy _s + y _a

w _s = a _l exp(Δw _s )

h _s = a _l exp(Δh _s )

θ _s = Δθ _s

Among them, x _s , y _s , w _s , h _s , and θ _s are respectively the abscissa of the center point, the ordinate of the center point, width, height, and the angle between the text field bounding box s ⁺ and the horizontal direction; x _a , y _a , w _a , h _a are the abscissa, ordinate, width, and height of the center point of the horizontal initial bounding box B ⁺ respectively; Δx _s , Δy _s , Δw _s , Δh _s , and Δθ _s are text fields The offset of bounding box s ⁺ central point abscissa x _s relative to initial bounding box B ⁺ , the offset of vertical coordinate y _s relative to initial bounding box, the offset change of width w _s , and the offset change of height h _s amount, the offset of angle θ _s ;

(1.2.4) Calculate the connection label for the bounding box of the text field generated in step (1.2.3): the text field s is generated on the initial bounding box B, so the connection labels between s and their corresponding initial bounding box B The connection labels between are the same; for the feature map set Itro _i '=[Itro _i1 ',...,Itro _i6 '], if in the initial bounding box set B _il of the same feature map Itro _il ', two initial bounding box The labels of are all positive classes, and matches the same word, then The intra-layer connection between is marked as a positive class, otherwise it is marked as a negative class; if the initial bounding box in the initial bounding box set B _il corresponding to the feature map Itro _il ' The initial bounding box in the initial bounding box set B _i(l-1) corresponding to Itro _i(l-1) ' labels are all positive and match the same word bounding box W _ij , then The cross-layer connections between are marked as positive class, otherwise marked as negative class;

(1.2.5) Take the scaled training image set Itr' as the input of the text field detection model, and predict the text field s output: initialize the weight and bias of the model, and set the learning rate of the first 60,000 training iteration steps to 10 ^-3 , Afterwards, the learning rate decays to 10 ^-4 ; for the last 6 convolutional layers, at the coordinates (x, y) on the l-th layer feature map Itro _il ', (x, y) corresponds to the input image Itr _i ' with (x _a , y _a ) is the center point coordinates, the initial bounding box B _ilq with a _l as the size, the 3×3 convolutional predictor will predict B _ilq is divided into positive and negative scores c _s , c _s is a two-dimensional vector, the value range is a decimal between 0-1; at the same time, 5 numbers are predicted As the geometric offset when divided into the positive text field s ⁺ , where Respectively, the predicted text field bounding box s ⁺ center point abscissa relative to the offset of the positive initial bounding box B ⁺ , the vertical coordinate relative to the offset of the positive initial bounding box B ⁺ , the offset change in height, Width offset variation, angle offset;

(1.2.6) Predict intra-layer connection and cross-layer connection output based on the predicted text field: For intra-layer connection, at the coordinate point (x, y) on the same feature map Itro _il ', take x-1 ≤x'≤x+1, y-1≤y'≤y+1 neighbor points (x', y'), when these 8 points correspond to the input image Itr _i ', then obtained with (x , y) is an intra-layer neighbor text field s ^(x',y',l) connected to the reference text field s ^(x,y,l ), and the eight intra-layer neighbor text fields can be expressed as a set:

The 3×3 convolution predictor will predict s ^{(x, y, l)} and the set of neighbors in the layer The positive and negative scores c _l1 of the connection, c _l1 is a 16-dimensional vector, where w is a superscript, indicating the connection within the layer;

For cross-layer connection, a cross-layer connection connects the text fields corresponding to two points on the feature map output by two consecutive convolutional layers; because each layer of convolutional layer passes through, the width and height of the feature map will be reduced by half , the width w _l and height h _l of the l-th layer output feature map Itro _il ' is half of the width w _l-1 and height h _l-1 of the l-1 layer feature map Itro _i(l-1) ', and The initial bounding box scale a _l corresponding to Itro _il 'is 2 times the initial bounding box scale a _l-1 corresponding to _Itro _i(l-1) ', for the (x, y), take 4 cross-layer neighbor points (x', y') in the range of 2x≤x'≤2x+1, 2y≤y'≤2y+1 on the feature map Itro _i(l-1) ', The (x, y) on Itro _il 'corresponds to the initial bounding box on the input image Itr _i ', and the 4 cross-layer neighbor points on Itro _i(l-1) ' correspond to the 4 initial bounding boxes on the input image Itr _i ' The spatial positions of the bounding boxes coincide, and the four cross-layer neighbor text fields s ^{(x', y', l-1)} can be expressed as a set:

The 3×3 convolutional predictor will predict the reference text field s ^{(x, y, l)} of layer l and the set of adjacent text fields on layer l-1 The positive and negative scores c _l2 of the cross-layer connection, c _l2 is an 8-dimensional vector:

in, Indicates that the predictor predicts the positive and negative scores of the connection between s ^{(x, y, l)} and all 4 adjacent text fields, and c is a superscript, indicating a cross-layer connection;

All intralayer connections and all cross-layer connections Constitute the connection set N _s ;

(1.2.7) Take the text field label obtained in step (1.2.3) and step (1.2.4), the real offset of the positive text field, and the connection label as the output reference, and the text predicted by step (1.2.5) The category and score of the segment, the offset of the predicted text field, and the connection score predicted in step (1.2.6) are the predicted output, and the target loss function between the predicted output and the output benchmark is designed, and the reverse conduction is used for the text field connection detection model The method is continuously trained to minimize the loss of text field classification, text field offset regression and connection classification, and the target loss function is designed for the text field connection detection model. The target loss function is the weighted sum of three losses:

where y _s is the label of all text fields, c _s is the predicted text field score, y _l is the predicted connection label, c _l is the predicted connection score, which is composed of intra-layer connection score c _l1 and cross-layer score c _l2 ; If the i-th initial bounding box is marked as a positive class, then y _s (i)=1, otherwise it is 0; L _conf (y _s ,c _s ) is the softmax loss of the predicted text field score c _s , L _conf (y _s , c _l ) is the softmax loss of the predicted connection score c _l , is the predicted text field geometry parameter s and the ground truth label The smoothing L between ₁ regression loss; n _s is the number of positive initial bounding boxes, used to normalize the text field classification and regression loss; n _l is the total number of positive connections, used to normalize the connection classification loss ; λ ₁ and λ ₂ are weight constants;

(1.2.8) In the training process of step (1.2.7), the online amplification method is used to amplify the training data I _tr online, and the online negative sample difficult example mining strategy is used to balance the positive samples and negative samples. Before the training images I _tr are scaled to the same size and loaded in batches, they are randomly cropped into image blocks, and the jaccard overlap coefficient o of each image block and the real bounding box of the text field is the smallest; for multi-directional text, the data Amplification is performed on the minimum enclosing rectangle of the multi-directional text bounding box, the overlap coefficient o of each sample is randomly selected from 0, 0.1, 0.3, 0.5, 0.7, and 0.9, and the size of the image block is 0.1 of the original image size Between -1 times; the training image is not flipped horizontally; in addition, the text field and connection negative samples occupy most of the training samples, and the online negative sample hard case mining strategy is used to balance the positive samples and negative samples, and the text field and connection are separated. Mining, control the ratio between negative samples and positive samples not to exceed 3:1.

3. the multidirectional text detection method in the natural picture based on connection text field according to claim 1 or 2, is characterized in that, described step (2.1) is specially:

The i-th text image Itst _i to be detected in the image set Itst to be detected is scaled to a uniform size, and the specific size can be artificially set according to the situation of the text image to be detected, and the text image to be detected after the scaling is recorded as Itst _i '; The image Itst _i ' is input to the text field connection detection model trained in step (1.2), and the set Itsto _i '=[Itsto _i1 ',...,Itsto _i6 composed of the feature maps respectively output by the last 6 convolutional layers is obtained '], where Itsto _il 'is the feature map output by the lth layer in the last 6 convolutional layers, l=1,...,6, the coordinates (x,y) on each output feature map Itsto _il ' , the 3×3 convolution predictor will predict the initial bounding box B _ilq corresponding to (x, y) is predicted as the score c _s of the positive and negative text fields, and also predicts 5 numbers As the geometric offset when it is predicted as the positive text field s ⁺ .

4. the multidirectional text detection method in the natural picture based on connecting text field according to claim 1 or 2, is characterized in that, described step (2.2) is specially:

Predict the intra-layer connection and cross-layer connection based on the text field predicted in (2.1). For the intra-layer connection, at the coordinate point (x, y) on the same feature map Itsto _il ', the 3×3 convolutional predictor predicts Output s ^{(x, y, l)} and its 8 neighbor text fields The positive and negative scores c _l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolution predictor will predict the reference text field s ^(x,y,l) of the l-th layer and the 4 on the l-1 layer neighbor text field The cross-layer connection positive and negative scores c _l2 , c _l1 and c _l2 constitute the predicted connection score c _l .

5. the multidirectional text detection method in the natural picture based on connecting text field according to claim 1 or 2, is characterized in that, described step (2.3) is specially:

According to the results of step (2.1) and step (2.2), at the coordinates (x, y) on each feature map Itsto _il ', the score c _s of the predicted text field and the offset of the text field The four items of intra-layer connection score c _l1 and cross-layer connection score c _l2 are concatenated into a 33-dimensional vector, and an additional softmax layer is added after the output channel of the convolutional predictor to standardize the text field score and connection score respectively.

6. the multidirectional text detection method in the natural picture based on connecting text field according to claim 1 or 2, is characterized in that, described step (3.1) is specifically;

For the fixed number of text fields s and connection N _s generated by inputting the text image to be detected into the text field detection model in step (2), filter by their scores; set different filtering thresholds for the text field s and connection N _s , are α and β respectively; take the filtered text field s' as a node, and the filtered connection N _s ' as an edge, and use them to construct a graph.

7. The method for detecting multi-directional text in natural pictures based on connected text fields according to claim 1 or 2, wherein the step (3.3) is specifically: performing a depth-first search on the graph for step (3.2) The obtained text field set S is combined into a complete word through the following steps, including:

(3.3.1) Input: |S| is the number of text fields in the set S, where is the i-th text field, i is the superscript, are respectively the central abscissa and ordinate of the i-th text field bounding box s ⁽ⁱ⁾ , are the width and height of text field bounding box s ⁽ⁱ⁾ , respectively, is the angle between the text field bounding box s ⁽ⁱ⁾ and the horizontal direction;

(3.3.2) where θ _b is the offset angle of the output bounding box, is the offset angle of the i-th text field bounding box s in the set, obtained from the average offset angle of all text fields in the set S;

(3.3.3) Find the intercept b of the straight line tan(θ _b )x+b, so that all text fields in the set S reach the center point The sum of the distances is the smallest;

(3.3.4) Find the two endpoints (x _p , y _p ) and (x _q , y _q ) of the straight line, p represents the first endpoint, q represents the second endpoint, x _p and y _p are the first The abscissa and ordinate of the first endpoint, x _q and y _q are respectively the abscissa and ordinate of the second endpoint;

(3.3.5) b represents the output bounding box, x _b and y _b are the horizontal and vertical coordinates of the center of the output bounding box, respectively;

(3.3.6) Where w _b is the width of the output bounding box, w _p and w _q are respectively the width of the bounding box centered on point p and the width of the bounding box centered on q;

(3.3.7) h _b is the height of the output bounding box, is the height of the i-th text field bounding box s in the set, obtained from the average height of all text fields in the text field set S;

(3.3.8)b:=(x _b , y _b , w _b , h _b , θ _b ), b is the output bounding box, represented by coordinate parameters, size parameters, and angle parameters;

(3.3.9) Output the combined bounding box b.