[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN106897732B - A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields - Google Patents

A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields Download PDF

Info

Publication number
CN106897732B
CN106897732B CN201710010596.7A CN201710010596A CN106897732B CN 106897732 B CN106897732 B CN 106897732B CN 201710010596 A CN201710010596 A CN 201710010596A CN 106897732 B CN106897732 B CN 106897732B
Authority
CN
China
Prior art keywords
text
bounding box
text field
connection
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710010596.7A
Other languages
Chinese (zh)
Other versions
CN106897732A (en
Inventor
白翔
石葆光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710010596.7A priority Critical patent/CN106897732B/en
Publication of CN106897732A publication Critical patent/CN106897732A/en
Application granted granted Critical
Publication of CN106897732B publication Critical patent/CN106897732B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/225Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on a marking or identifier characterising the area
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses multi-direction Method for text detection in a kind of natural picture based on connection text section, text section and connection are two steps crucial in the detection method, be defined as follows: text section refers to marking off many single multidirectional bounding box regions on picture, they surround a part of a text item or word;Connection refers to connecting adjacent field, it is meant that they belong to the same word or in short.The full convolutional neural networks that text section and connection are used in combination an end-to-end training are equally spaced detected with a variety of scales.Last testing result is first to connect multiple text section composition new regions, obtained from being then combined to these new regions.Detection method proposed by the present invention all achieves brilliant effect in terms of these in accuracy rate, speed and model ease compared with the existing technology, high-efficient and strong robustness, complicated picture background can be overcome, in addition also can in detection image non-Latin text long text.

Description

一种基于连接文字段的自然图片中多方向文本检测方法A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields

技术领域technical field

本发明属于计算机视觉技术领域,更具体地,涉及一种基于连接文字段的自然图片中多方向文本检测方法。The invention belongs to the technical field of computer vision, and more specifically relates to a multi-directional text detection method in natural pictures based on connected text fields.

背景技术Background technique

读取自然图片中的文本是一个充满挑战的热门任务,在照片光学识别、地理定位和图像检索方面都有许多实际的应用。在文本读取系统中,文字检测就是在单词级别或文字条级别上以包围盒来定位文字区域,它通常都作为非常关键的第一步。从某种意义上而言,文字检测也可以视为一种特殊的物体检测,即将单词、字符或文字条作为检测的目标。Reading text in natural images is a challenging and popular task with many practical applications in photo optical recognition, geolocation, and image retrieval. In the text reading system, text detection is to locate the text area with a bounding box at the word level or text strip level, and it is usually regarded as a very critical first step. In a sense, text detection can also be regarded as a special object detection, which takes words, characters or text strips as the detection target.

尽管已有的技术已经在将物体检测方法应用于文字检测上取得了极大的成功,但是物体检测方法在定位文字区域方面仍有几点明显的不足。第一,单词或文字条的长宽比通常都比一般物体要大的多,之前的方法难以产生这种比例的包围盒;第二,一些非拉丁语的文本在相邻单词之间并不包含空格,比如中文汉字。已有的技术都只能检测到单词,在检测这种文本时就会不适用,因为这种不包含空格的文本无法提供划分不同单词的视觉信息。第三,在大型自然场景图片中,文字可能是任意方向的,然而现有的技术绝大多数都只能检测水平方向的文字。因此自然场景图片中的文本检测仍然是计算机视觉技术领域的难点之一。Although existing technologies have achieved great success in applying object detection methods to text detection, there are still several obvious deficiencies in object detection methods in locating text regions. First, the aspect ratio of words or text strips is usually much larger than that of general objects, and it is difficult for previous methods to generate bounding boxes of this ratio; second, some non-Latin texts have no gaps between adjacent words. Contains spaces, such as Chinese characters. Existing technologies can only detect words, and are not applicable to this kind of text, because this kind of text without spaces cannot provide visual information for dividing different words. Third, in large natural scene images, the text may be in any direction, but most of the existing technologies can only detect horizontal text. Therefore, text detection in natural scene pictures is still one of the difficulties in the field of computer vision technology.

发明内容Contents of the invention

本发明的目的在于提供一种基于连接文字段的自然图片中多方向文本检测方法,该方法检测文本准确率高,速度快,模型简易,且鲁棒性强,能克服复杂的图片背景,另外也能检测非拉丁文字的长文本。The purpose of the present invention is to provide a multi-directional text detection method in natural pictures based on connected text fields. The method has high text detection accuracy, fast speed, simple model, and strong robustness, and can overcome complex picture backgrounds. Also detects long text in non-Latin scripts.

为实现上述目的,本发明从一个全新的视角来解决场景文字检测问题,提供了一种基于连接文字段的自然图片中多方向文本检测方法,包括下述步骤:In order to achieve the above object, the present invention solves the scene text detection problem from a brand new perspective, and provides a multi-directional text detection method in a natural picture based on connected text fields, comprising the following steps:

(1)训练文字段连接检测网络模型,包括如下子步骤:(1) Training the text field connection detection network model, including the following sub-steps:

(1.1)以词条级别标记训练图像集中所有文本图像的文本内容,标签为词条的矩形初始包围盒的四个点坐标,得到训练数据集;(1.1) Mark the text content of all text images in the training image set with the entry level, and the label is the four point coordinates of the rectangular initial bounding box of the entry to obtain the training data set;

(1.2)定义用于根据词条标签可以预测输出文字段和连接的文字段检测模型,所述网络模型由级联卷积神经网络和卷积预测器组成,根据上述训练数据集计算得到文字段和连接的标签,设计损失函数,结合在线扩增和在线负样本难例挖掘技术手段,利用反向传导方法训练该网络,得到文字段检测模型,包括如下子步骤:(1.2) Define the text field detection model that can predict the output text field and connection according to the entry label, the network model is composed of a cascaded convolutional neural network and a convolution predictor, and the text field is calculated according to the above-mentioned training data set and connected tags, design a loss function, combine online amplification and online negative sample hard example mining techniques, use the reverse conduction method to train the network, and obtain a text field detection model, including the following sub-steps:

(1.2.1)构建文字段检测卷积神经网络模型:提取特征的前几层卷积单元来自预训练的VGG-16网络,前几层卷积单元为卷积层1到池化层5,全连接层6和全连接层7分别转换为卷积层6和卷积层7,连接在其后的是一些额外加入的卷积层,用于提取更深度的特征进行检测,包括卷积层8、卷积层9、卷积层10,最后一层是卷积层11;后6个不同的卷积层分别输出不同尺寸的特征图,便于提取出多种尺度的高质量特征,检测文字段和连接是在这六个不同尺寸的特征图上进行的;对于这6个卷积层,每一层之后都添加尺寸为3×3的滤波器作为卷积预测器,来共同检测文字段和连接;(1.2.1) Constructing a text field detection convolutional neural network model: the first few layers of convolutional units for extracting features come from the pre-trained VGG-16 network, and the first few layers of convolutional units are convolutional layer 1 to pooling layer 5, Fully connected layer 6 and fully connected layer 7 are converted to convolutional layer 6 and convolutional layer 7 respectively, followed by some additional convolutional layers for extracting deeper features for detection, including convolutional layers 8. Convolutional layer 9, convolutional layer 10, and the last layer is convolutional layer 11; the last 6 different convolutional layers output feature maps of different sizes, which is convenient for extracting high-quality features of multiple scales and detecting text Segments and connections are performed on these six feature maps of different sizes; for these six convolutional layers, a filter of size 3×3 is added after each layer as a convolutional predictor to jointly detect text fields and connect;

(1.2.2)从标注的词包围盒产生文字段包围盒标签:对于原始训练图像集Itr,记缩放后的训练图像集为Itr′,wI、hI分别为Itr′的宽度和高度,可以取384×384或512×512像素,第i张图片Itri′作为模型输入,Itri′上标注的所有词包围盒记作Wi=[Wi1,...,Wip],其中Wij为第i张图片上的第j个词包围盒,词包围盒可以是单词级别也可以是词条级别,j=1,...,p,p为第i张图片上词包围盒的总数量;记后6层卷积层分别输出的特征图构成集合Itroi′=[Itroi1′,...,Itroi6′],其中Itroil′为后6层卷积层中第l层输出的特征图,wl、hl分别为该特征图的宽度和高度,Itroil′上的坐标(x,y)对应Itri′上以(xa,ya)为中心点坐标的水平初始包围盒Bilq,它们满足下列公式:(1.2.2) Generate text field bounding box labels from marked word bounding boxes: For the original training image set Itr, record the scaled training image set as Itr′, w I , h I are the width and height of Itr′ respectively, Can take 384×384 or 512×512 pixels, the i-th picture Itr i ′ is used as the model input, and all word bounding boxes marked on Itr i ′ are recorded as W i =[W i1 ,...,W ip ], where W ij is the jth word bounding box on the i-th picture, the word bounding box can be word level or entry level, j=1,..., p, p is the word bounding box on the i-th picture The total number of ; Note that the feature maps output by the last 6 convolutional layers constitute a set Itro i ′=[Itro i1 ′,...,Itro i6 ′], where Itro il ′ is the lth in the last 6 convolutional layers The feature map output by the layer, w l and h l are the width and height of the feature map respectively, and the coordinates (x, y) on Itro il ′ correspond to the coordinates of the center point on Itr i ′ (x a , y a ) Horizontal initial bounding boxes B ilq , they satisfy the following formula:

初始包围盒Bilq的宽和高都被设置成一个常数al,用于控制输出文字段的比例,l=1,...,6;记第l层输出的特征图Itroil′对应的初始包围盒集合为Bil=[Bil1,...,Bilm],q=1,...,m,其中m为第l层输出的特征图上初始包围盒的数目;只要初始包围盒Bilq的中心被包含在Itr′上任一标注的词包围盒Wij内部,且Bilq的尺寸al和该标注的词包围盒Wij的高度h满足:那么这个初始包围盒Bilq被标记为正类,标签取值为1,并与高度最为接近的那个词包围盒Wij匹配;否则,当Bilq与所有词包围盒Wi都不满足上述两个条件时,Bilq就被标记为负类,标签取值为0;文字段在初始包围盒上产生,与初始包围盒标签类别相同;其中,比例常数1.5为经验值;The width and height of the initial bounding box B ilq are both set to a constant a l to control the proportion of the output text field, l=1,...,6; record the feature map Itro il 'corresponding to the l-th layer output The set of initial bounding boxes is B il =[B il1 ,...,B ilm ], q=1,...,m, where m is the number of initial bounding boxes on the feature map output by layer l; as long as the initial bounding boxes The center of the box B ilq is contained inside any marked word bounding box W ij on Itr′, and the size a l of B ilq and the height h of the marked word bounding box W ij satisfy: Then this initial bounding box B ilq is marked as a positive class, the value of the label is 1, and it matches the word bounding box W ij with the closest height; otherwise, when B ilq and all word bounding boxes W i do not satisfy the above two When a condition is met, B ilq is marked as a negative class, and the value of the label is 0; the text field is generated on the initial bounding box, which is the same as the label category of the initial bounding box; wherein, the proportional constant 1.5 is an empirical value;

(1.2.3)在所述步骤(1.2.2)产生的带标签的初始包围盒上产生文字段并计算正类文字段偏移量:负类文字段包围盒s-为负类初始包围盒B-;正类文字段包围盒s+由正类初始包围盒B+经过以下步骤得到:a)记正类初始包围盒B+匹配到的标注词包围盒W与水平方向夹角为θs,以B+的中心点为中心,将W顺时针旋转θs角;b)裁剪W,去除超出B+左边和右边的部分;c)以B+的中心点为中心,将裁剪后的词包围盒W′逆时针旋转θs角,得到文字段s+真实标签的几何参数xs、ys、ws、hs、θs;d)计算得到文s+相对于B+的偏移量(Δxs,Δys,Δws,Δhs,Δθs),计算公式如下:(1.2.3) Generate a text field on the labeled initial bounding box generated in the step (1.2.2) and calculate the offset of the positive text field: the negative text field bounding box s - is the negative initial bounding box B - ; the bounding box s of the positive text field + is obtained from the initial bounding box B + of the positive class through the following steps: a) Record the initial bounding box B of the positive class + the matched tag word bounding box W and the horizontal direction The angle is θ s , with the central point of B + as the center, rotate W clockwise by θ s angle; b) crop W, and remove the part beyond the left and right of B + ; c) center the center point of B + , the cropped word The bounding box W′ is rotated counterclockwise by the θ s angle, and the geometric parameters x s , y s , w s , h s , θ s of the text field s + the real label are obtained; d) Calculate the offset of the text s + relative to B + Quantity (Δx s , Δy s , Δw s , Δh s , Δθ s ), the calculation formula is as follows:

xs=alΔxs+xa x s =a l Δx s +x a

ys=alΔys+ya y s = a l Δy s + y a

ws=alexp(Δws)w s = a l exp(Δw s )

hs=alexp(Δhs)h s = a l exp(Δh s )

θs=Δθs θ s = Δθ s

其中,xs、ys、ws、hs、θs分别为文字段包围盒s+的中心点横坐标、中心点纵坐标、宽度、高度以及与水平方向之间的夹角;xa、ya、wa、ha分别为水平初始包围盒B+的中心点横坐标、中心点纵坐标、宽度、高度;Δxs、Δys、Δws、Δhs、Δθs分别为文字段包围盒s+中心点横坐标xs相对初始包围盒B+的偏移量、纵坐标ys相对初始包围盒的偏移量、宽度ws的偏移变化量、高度hs的偏移变化量、角度θs的偏移量;Among them, x s , y s , w s , h s , and θ s are respectively the abscissa of the center point, the ordinate of the center point, width, height, and the angle between the text field bounding box s + and the horizontal direction; x a , y a , w a , h a are the abscissa, ordinate, width, and height of the center point of the horizontal initial bounding box B + respectively; Δx s , Δy s , Δw s , Δh s , and Δθ s are text fields The offset of bounding box s + central point abscissa x s relative to initial bounding box B + , the offset of vertical coordinate y s relative to initial bounding box, the offset change of width w s , and the offset change of height h s amount, the offset of angle θ s ;

(1.2.4)对于步骤(1.2.3)产生的文字段包围盒计算连接标签:文字段s是在初始包围盒B上产生的,因此s之间的连接标签和它们对应的初始包围盒B之间的连接标签相同;对于特征图集合Itroi′=[Itroi1′,...,Itroi6′],如果在同一张特征图Itroil′的初始包围盒集合Bil里,两个初始包围盒的标签都是正类,且匹配到同一个词,那么之间的层内连接被标记为正类,否则标记为负类;如果在特征图Itroil′对应的初始包围盒集合Bil里的初始包围盒和Itroi(l-1)′对应的的初始包围盒集合Bi(l-1)里的初始包围盒的标签都是正类且匹配到同一个词包围盒Wij,那么之间的跨层连接被标记为正类,否则标记为负类;(1.2.4) Calculate the connection label for the bounding box of the text field generated in step (1.2.3): the text field s is generated on the initial bounding box B, so the connection labels between s and their corresponding initial bounding box B The connection labels between are the same; for the feature map set Itro i ′=[Itro i1 ′,...,Itro i6 ′], if in the initial bounding box set B il of the same feature map Itro il ′, two initial bounding box The labels of are all positive classes, and matches the same word, then The intra-layer connection between is marked as a positive class, otherwise it is marked as a negative class; if the initial bounding box in the initial bounding box set B il corresponding to the feature map Itro il The initial bounding box in the initial bounding box set B i(l-1) corresponding to Itro i(l-1) labels are all positive and match the same word bounding box W ij , then The cross-layer connections between are marked as positive class, otherwise marked as negative class;

(1.2.5)以缩放后的训练图像集Itr′作为文字段检测模型输入,预测文字段s输出:对模型初始化权重和偏置,前6万次训练迭代步骤学习率设置为10-3,之后学习率衰减到10-4;对于后6层卷积层,在第l层特征图Itroil′上的坐标(x,y)处,(x,y)对应到输入图像Itri′上以(xa,ya)为中心点坐标、以al为尺寸的初始包围盒Bilq,3×3的卷积预测器都会预测出Bilq被分别划分成正、负类的得分cs,cs为二维向量,取值范围为0-1之间的小数。同时也预测出5个数字作为被划分到正类文字段s+时的几何偏移量,其中分别为预测的文字段包围盒s+中心点横坐标相对正类初始包围盒B+的偏移量、纵坐标的相对正类初始包围盒B+的偏移量、高度的偏移变化量、宽度的偏移变化量、角度偏移量;(1.2.5) Take the scaled training image set Itr′ as the input of the text field detection model, predict the text field s output: initialize the weight and bias of the model, set the learning rate of the first 60,000 training iteration steps to 10 -3 , Then the learning rate decays to 10 -4 ; for the last 6 convolutional layers, at the coordinates (x, y) on the feature map Itro il ′ of the first layer, (x, y) corresponds to the input image Itr i ′ with (x a , y a ) is the center point coordinates, the initial bounding box B ilq with a l as the size, the 3×3 convolutional predictor will predict B ilq is divided into positive and negative scores c s , c s is a two-dimensional vector, and its value range is a decimal between 0-1. Also predicted 5 numbers As the geometric offset when divided into the positive text field s + , where Respectively, the predicted text field bounding box s + center point abscissa relative to the offset of the positive initial bounding box B + , the vertical coordinate relative to the offset of the positive initial bounding box B + , the offset change in height, Width offset variation, angle offset;

(1.2.6)在已预测的文字段基础上预测层内连接和跨层连接输出:对于层内连接,在同一张特征图Itroil′上坐标点(x,y)处,取x-1≤x′≤x+1、y-1≤y′≤y+1范围内近邻的点(x′,y′),这8个点对应到输入图像Itri′时,便获得了与(x,y)对应的基准文字段s(x,y,l)相连接的层内近邻文字段s(x′,y′,l),8个层内近邻文字段可表示为集合:(1.2.6) Predict the output of intra-layer connection and cross-layer connection based on the predicted text field: for intra-layer connection, at the coordinate point (x, y) on the same feature map Itro il ′, take x-1 ≤x′≤x+1, y-1≤y′≤y+1, the neighbor points (x′, y′) in the range of y-1≤y′≤y+1, when these 8 points correspond to the input image Itr i ′, the result of , y) and the adjacent text field s (x′, y′, l) in the layer connected to the reference text field s (x, y, l) corresponding to y), the 8 adjacent text fields in the layer can be expressed as a set:

3×3卷积预测器会预测出s(x,y,l)与层内近邻集合的连接的正、负得分cl1,cl1为16维向量,其中,w为上标,表示层内连接;A 3×3 convolutional predictor predicts s (x, y, l) and the set of neighbors in the layer The positive and negative scores c l1 of the connection, c l1 is a 16-dimensional vector, where w is a superscript, indicating the connection within the layer;

对于跨层连接,一个跨层连接将两个连续卷积层输出的特征图上两个点处对应的文字段相连;由于每经过一层卷积层,特征图的的宽度和高度都会缩小一半,第l层输出特征图Itroil′的宽度wl和高度hl是第l-1层特征图Itroi(l-1)′的宽度wl-1和高度hl-1的一半,而Itroil′对应的初始包围盒尺度al是Itroi(l-1)′对应的初始包围盒尺度al-1的2倍,对于在第l层输出特征图Itroil′上的(x,y),在特征图Itroi(l-1)′上取2x≤x′≤2x+1、2y≤y′≤2y+1范围内的4个跨层近邻点(x′,y′),Itroil′上(x,y)对应到输入图像Itri′上的初始包围盒刚好与Itroi(l-1)′上4个跨层近邻点对应到输入图像Itri′上的4个初始包围盒空间位置重合,4个跨层近邻文字段s(x′,y′,l-1)可表示为集合:For cross-layer connection, a cross-layer connection connects the text fields corresponding to two points on the feature map output by two consecutive convolutional layers; because each layer of convolutional layer passes through, the width and height of the feature map will be reduced by half , the width w l and height h l of the l-th layer output feature map Itro il ′ are half of the width w l-1 and height h l-1 of the l-1 layer feature map Itro i(l-1) ′, and The initial bounding box scale a l corresponding to Itro il ′ is twice the initial bounding box scale a l-1 corresponding to Itro i(l-1) ′, for (x, y), take 4 cross-layer neighbor points (x', y') in the range of 2x≤x'≤2x+1, 2y≤y'≤2y+1 on the feature map Itro i(l-1) ', (x, y) on Itro il ′ corresponds to the initial bounding box on the input image Itr i ′, and the 4 cross-layer neighbor points on Itro i(l-1) ′ correspond to the 4 initial bounding boxes on the input image Itr i ′ The spatial positions of the bounding boxes coincide, and the four cross-layer neighbor text fields s (x′, y′, l-1) can be expressed as a set:

3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上近邻文字段集合之间跨层连接的正、负得分cl2,cl2为8维向量:The 3×3 convolutional predictor will predict the base text field s (x, y, l) of layer l and the set of adjacent text fields on layer l-1 The positive and negative scores c l2 of the cross-layer connection, c l2 is an 8-dimensional vector:

其中,表示预测器预测出s(x,y,l)与其所有4个近邻文字段之间的连接的正、负得分,c为上标,表示跨层连接;in, Indicates that the predictor predicts the positive and negative scores of the connection between s (x, y, l) and all 4 adjacent text fields, and c is a superscript, indicating a cross-layer connection;

所有的层内连接和所有的跨层连接构成连接集合NsAll intralayer connections and all cross-layer connections Constitute the connection set N s ;

(1.2.7)以步骤(1.2.3)和步骤(1.2.4)获得的文字段标签、正类文字段真实偏移量、连接标签作为输出基准,以步骤(1.2.5)预测的文字段类别及得分、预测的文字段偏移量、步骤(1.2.6)预测的连接得分为预测输出,设计预测输出与输出基准之间的目标损失函数,对文字段连接检测模型利用反向传导法进行不断地训练,来最小化文字段分类、文字段偏移回归和连接分类的损失,针对所述文字段连接检测模型设计目标损失函数,目标损失函数是三个损失的加权和:(1.2.7) Take the text field label obtained in step (1.2.3) and step (1.2.4), the real offset of the positive text field, and the connection label as the output reference, and the text predicted by step (1.2.5) The category and score of the segment, the offset of the predicted text field, and the connection score predicted in step (1.2.6) are the predicted output, and the target loss function between the predicted output and the output benchmark is designed, and the reverse conduction is used for the text field connection detection model The method is continuously trained to minimize the loss of text field classification, text field offset regression and connection classification, and the target loss function is designed for the text field connection detection model. The target loss function is the weighted sum of three losses:

其中ys是所有文字段的标签,cs是预测的文字段得分,yl是预测的连接标签,cl是预测的连接得分,由层内连接得分cl1和跨层得分cl2组成;如果第i个初始包围盒标记为正类,那么ys(i)=1,否则为0;Lconf(ys,cs)是预测的文字段得分cs的softmax损失,Lconf(ys,cl)是预测的连接得分cl的softmax损失,是预测的文字段几何参数s和真实标签之间的平滑L1回归损失;ns是正类初始包围盒的数量,用来对文字段分类和回归损失进行归一化;nl是正类连接总数,用来对连接分类损失进行归一化;λ1和λ2为权重常数,在实际中取1;where y s is the label of all text fields, c s is the predicted text field score, y l is the predicted connection label, c l is the predicted connection score, which is composed of intra-layer connection score c l1 and cross-layer score c l2 ; If the i-th initial bounding box is marked as a positive class, then y s (i) = 1, otherwise 0; L conf (y s , c s ) is the softmax loss of the predicted text field score c s , L conf (y s , c l ) is the softmax loss of the predicted connection score c l , is the predicted text field geometry parameter s and the ground truth label The smoothed L between 1 regression loss; n s is the number of positive initial bounding boxes, used to normalize text field classification and regression loss; n l is the total number of positive class connections, used to normalize the connection classification loss ; λ 1 and λ 2 are weight constants, which are 1 in practice;

(1.2.8)在步骤(1.2.7)的训练过程中,采用在线扩增方法对训练数据Itr进行在线扩增,并采用在线负样本难例挖掘策略来平衡正样本和负样本。在训练图片Itr被缩放到相同的尺寸并批量加载之前,它们被随机地裁剪成一个个图像块,每个图像块与文字段的真实包围盒的jaccard重叠系数o最小;对于多方向文字,数据扩增是在多方向文字包围盒的最小包围矩形上进行的,每个样本的重叠系数o从0、0.1、0.3、0.5、0.7和0.9中随机选择,图像块的大小为原始图片尺寸的0.1-1倍之间;训练图像不水平翻转;另外,文字段和连接负样本占据训练样本的大部分,采用在线负样本难例挖掘策略来平衡正样本和负样本,对文字段和连接分开进行挖掘,控制负样本与正样本之间的比例不超过3∶1。(1.2.8) During the training process of step (1.2.7), the online amplification method is used to amplify the training data I tr online, and the online negative sample hard case mining strategy is used to balance positive samples and negative samples. Before the training images I tr are scaled to the same size and loaded in batches, they are randomly cropped into image blocks, and the jaccard overlap coefficient o of each image block and the true bounding box of the text field is the smallest; for multi-directional text, The data augmentation is carried out on the minimum enclosing rectangle of the multi-directional text bounding box, the overlap coefficient o of each sample is randomly selected from 0, 0.1, 0.3, 0.5, 0.7 and 0.9, and the size of the image block is the size of the original image Between 0.1 and 1 times; the training image is not flipped horizontally; in addition, text fields and connection negative samples occupy most of the training samples, and the online negative sample hard case mining strategy is used to balance the positive samples and negative samples, and the text field and connection are separated For mining, control the ratio between negative samples and positive samples not to exceed 3:1.

(2)利用上述训练好的卷积神经网络对待检测文本图像进行文字段和连接检测,包括如下子步骤:(2) Use the above-mentioned trained convolutional neural network to perform text field and connection detection on the text image to be detected, including the following sub-steps:

(2.1)对待检测文本图像进行文字段检测,由不同卷积层输出的特征图可以预测出不同尺度的文字段,由同一卷积层输出的特征图预测出相同尺度的文字段:对待检测图像集Itst中的第i张待检测文本图像Itsti,缩放到统一尺寸,具体尺寸可随待检测文本图像的情况人为设定,记缩放后的待检测文本图像为Itsti′。将图像Itsti′输入到步骤(1.2)中训练好的文字段连接检测模型,得到后6层卷积层分别输出的特征图构成的集合Itstoi′=[Itstoi1′,...,Itstoi6′],其中Itstoil′为后6层卷积层中第l层输出的特征图,l=1,...,6,在每张输出特征图Itstoil′上的坐标(x,y)处,3×3的卷积预测器都会预测出(x,y)对应的初始包围盒Bilq被预测为正、负类文字段的得分cs,同时也预测出5个数字作为被预测为正类文字段s+时的几何偏移量;(2.1) Perform text field detection on the text image to be detected. The feature maps output by different convolutional layers can predict text fields of different scales, and the feature maps output by the same convolutional layer can predict text fields of the same scale: the image to be detected The i-th text image Itst i to be detected in the collection Itst is scaled to a uniform size, and the specific size can be set manually according to the situation of the text image to be detected, and the zoomed text image to be detected is Itst i ′. Input the image Itst i ′ into the text field connection detection model trained in step (1.2), and obtain the set Itsto i ′=[Itsto i1 ′,...,Itsto i6 ′], where Itsto il ′ is the feature map output by the l-th layer in the last 6 convolutional layers, l=1,...,6, the coordinates (x, y on each output feature map Itsto il ′ ), the 3×3 convolution predictor will predict the initial bounding box B ilq corresponding to (x, y) is predicted to be the score c s of positive and negative text fields, and also predicts 5 numbers As the geometric offset when it is predicted as a positive text field s + ;

(2.2)对待检测文本图像检测出的所有特征层上的文字段进行连接检测,所述连接包括层内连接和跨层连接:在(2.1)预测的文字段基础上预测层内连接和跨层连接,对于层内连接,在同一张特征图Itstoil′上坐标点(x,y)处,3×3卷积预测器预测出s(x,y,l)与它8个近邻文字段之间层内连接的正、负得分cl1;对于跨层连接,3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上4个近邻文字段的跨层连接正、负得分cl2,cl1和cl2构成预测的连接得分cl(2.2) The text fields on all feature layers detected by the text image to be detected are connected and detected, and the connections include intra-layer connections and cross-layer connections: predict intra-layer connections and cross-layer connections based on the text fields predicted in (2.1) Connection, for intra-layer connections, at the coordinate point (x, y) on the same feature map Itsto il ′, the 3×3 convolutional predictor predicts s (x, y, l) and its 8 neighbor text fields The positive and negative scores c l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor will predict the reference text field s (x, y, l) of the l-th layer and the 4 on the l-1 layer neighbor text fields The cross-layer connection positive and negative scores c l2 , c l1 and c l2 constitute the predicted connection score c l ;

(2.3)将检测得到的文字段置信度得分和连接置信度得分组合,其中文字段置信度得分包括文字段正负类别得分和偏移量得分,利用卷积预测器输出softmax标准化得分:在(2.1)预测的文字段基础上预测层内连接和跨层连接,对于层内连接,在同一张特征图Itstoil′上坐标点(x,y)处,3×3卷积预测器预测出s(x,y,l)与8个近邻文字段之间层内连接的正、负得分cl1;对于跨层连接,3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上4个近邻文字段的跨层连接正、负得分cl2,cl1和cl2构成预测的连接得分cl(2.3) Combine the detected text field confidence score and connection confidence score, where the text field confidence score includes the positive and negative category score and offset score of the text field, and use the convolution predictor to output the softmax standardized score: in ( 2.1) Based on the predicted text field, the intra-layer connection and cross-layer connection are predicted. For the intra-layer connection, at the coordinate point (x, y) on the same feature map Itsto il ′, the 3×3 convolutional predictor predicts s (x, y, l) and 8 neighbor text fields The positive and negative scores c l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor will predict the reference text field s (x, y, l) of the l-th layer and the 4 on the l-1 layer neighbor text field The cross-layer connection positive and negative scores c l2 , c l1 and c l2 constitute the predicted connection score c l .

(3)组合文字段和连接,得到输出包围盒,包括如下子步骤:(3) Combining text fields and connections to obtain an output bounding box, including the following sub-steps:

(3.1)根据(2.3)中检测得到的标准化得分,过滤卷积预测器输出的文字段和连接,以过滤后的文字段作为节点,以连接作为边,构建连接图:对于步骤(2)待检测文本图像输入到文字段检测模型而产生的固定数量的文字段s和连接Ns,通过它们的得分进行过滤;为文字段s和连接Ns设置不同的过滤阈值,分别为α和β;过滤阈值可以根据不同的数据人为设定不同的值,在实际中进行多方向文本图像检测时,可以取α=0.9,β=0.7,进行多语种长文本图像检测时,可以取α=0.9,β=0.5,进行水平文本检测时,可以取α=0.6,β=0.3;将过滤后的文字段s′作为节点,过滤后的连接Ns′作为边,利用它们构建一个图;(3.1) According to the standardized score detected in (2.3), filter the text field and connection output by the convolutional predictor, use the filtered text field as a node, and use the connection as an edge to construct a connection graph: For step (2) to be Detect a fixed number of text fields s and connection N s generated by inputting the text image to the text field detection model, and filter through their scores; set different filtering thresholds for the text field s and connection N s , respectively α and β; The filtering threshold can be artificially set different values according to different data. In practice, when performing multi-directional text image detection, α=0.9, β=0.7 can be used, and when multilingual long text image detection is performed, α=0.9 can be used. β=0.5, when performing horizontal text detection, you can take α=0.6, β=0.3; use the filtered text field s' as a node, and the filtered connection N s ' as an edge, and use them to construct a graph;

(3.2)在图上执行深度优先搜索以找到相互连接的组件,每个组件记作集合B,包含由连接相连起来的文字段;(3.2) Perform a depth-first search on the graph to find connected components. Each component is denoted as a set B, which contains text fields connected by connections;

(3.3)对步骤(3.2)在图上进行深度优先搜索得到的文字段集合S,通过下述步骤组合成一个完整的词,包括:(3.3) The text field set S obtained by performing depth-first search on the graph in step (3.2) is combined into a complete word through the following steps, including:

(3.3.1)输入:|S|为集合S里的文字段数量,其中为第i个文字段,i为上标,分别为第i个文字段包围盒s(i)的中心横坐标和纵坐标,分别为文字段包围盒s(i)的宽度和高度,为文字段包围盒s(i)与水平方向之间的夹角;(3.3.1) Input: |S| is the number of text fields in the set S, where is the i-th text field, i is the superscript, are respectively the central abscissa and ordinate of the i-th text field bounding box s (i) , are the width and height of text field bounding box s (i) , respectively, is the angle between the text field bounding box s (i) and the horizontal direction;

(3.3.2)其中θb为输出包围盒的偏移角度,为集合里第i个文字段包围盒s的偏移角度,由集合S里所有文字段的平均偏移角度得到;(3.3.2) where θ b is the offset angle of the output bounding box, is the offset angle of the i-th text field bounding box s in the set, obtained from the average offset angle of all text fields in the set S;

(3.3.3)找到直线tan(θb)x+b的截距b,使得集合S中所有的文字段到中心点的距离的总和最小;(3.3.3) Find the intercept b of the straight line tan(θ b )x+b, so that all text fields in the set S reach the center point The sum of the distances is the smallest;

(3.3.4)找到直线的两个端点(xp,yp)和(xq,yq),p表示第一个端点,q表示第二个端点,xp、yp分别为第一个端点的横、纵坐标,xq、yq分别为第二个端点的横、纵坐标;(3.3.4) Find the two endpoints (x p , y p ) and (x q , y q ) of the straight line, p represents the first endpoint, q represents the second endpoint, x p and y p are the first The abscissa and ordinate of the first endpoint, x q and y q are respectively the abscissa and ordinate of the second endpoint;

(3.3.5)b表示输出包围盒,xb、yb分别为输出包围盒中心的横、纵坐标;(3.3.5) b represents the output bounding box, x b and y b are the horizontal and vertical coordinates of the center of the output bounding box, respectively;

(3.3.6)其中wb为输出包围盒的宽度,wp、wq分别为以点p为中心的包围盒的宽度和以q为中心的包围盒的宽度;(3.3.6) Where w b is the width of the output bounding box, w p and w q are respectively the width of the bounding box centered on point p and the width of the bounding box centered on q;

(3.3.7)hb为输出包围盒的高度,为集合里第i个文字段包围盒s的高度,由由文字段集合S里所有文字段的平均高度得到;(3.3.7) h b is the height of the output bounding box, is the height of the i-th text field bounding box s in the set, obtained from the average height of all text fields in the text field set S;

(3.3.8)b:=(xb,yb,wb,hb,θb),b为输出包围盒,由坐标参数、尺寸参数、角度参数表示;(3.3.8)b:=(x b , y b , w b , h b , θ b ), b is the output bounding box, represented by coordinate parameters, size parameters, and angle parameters;

(3.3.9)输出组合而成的包围盒b。(3.3.9) Output the combined bounding box b.

通过本发明所构思的以上技术方案,与现有技术相比,本发明具有以下技术效果:Through the above technical solutions conceived by the present invention, compared with the prior art, the present invention has the following technical effects:

(1)可以检测多方向文字:自然场景图片里的文本通常是任意方向或者扭曲的,本发明方法文字区域可以通过文字段包围盒进行局部描述,文字段包围盒可以被设置成任意方向,因此可以包含多方向或扭曲形状的文字。(1) Can detect multi-directional text: the text in the natural scene picture is usually in any direction or distorted, the text area of the present invention can be described locally through the text field bounding box, and the text field bounding box can be set to any direction, so Text that can contain multi-directional or distorted shapes.

(2)灵活度高:本发明方法也可以检测任意长度的文字条,因为组合文字段只依赖于预测的连接,因此既可以检测单词,也可以检测文字条;(2) flexibility is high: the inventive method also can detect the text strip of any length, because the combination text field only depends on the connection of prediction, therefore both can detect word, also can detect text strip;

(3)鲁棒性强:本发明方法采用的是以文字段包围盒进行局部描述,这种局部描述的方法能克服复杂的自然图片背景,从图片里捕获文本特征;(3) Robustness is strong: what the present invention method adopts is to carry out partial description with text field bounding box, the method for this partial description can overcome complex natural picture background, capture text feature from picture;

(4)效率高:本发明方法的文字段检测模型是端到端进行训练的,每秒能够处理超过20张大小为512x512图像,因为文字段和连接都是通过在全卷积CNN模型进行一次正向传播获得,不需要对输入图像进行离线的缩放和旋转;(4) High efficiency: the text field detection model of the inventive method is trained end-to-end, and it can process more than 20 images per second with a size of 512x512, because the text fields and connections are all performed once in the full convolution CNN model Obtained by forward propagation, no offline scaling and rotation of the input image is required;

(5)通用性强:一些非拉丁语的文本在相邻单词之间并不包含空格,比如中文汉字。已有的技术都只能检测到单词,在检测这种文本时就会不适用,因为这种不包含空格的文本无法提供划分不同单词的视觉信息。除了拉丁文字,本发明也能够检测非拉丁文字的长文本,因为本发明方法不需要利用空格来提供视觉划分信息。(5) Strong versatility: Some non-Latin texts do not contain spaces between adjacent words, such as Chinese characters. Existing technologies can only detect words, and are not applicable to this kind of text, because this kind of text without spaces cannot provide visual information for dividing different words. In addition to Latin scripts, the present invention is also capable of detecting long texts in non-Latin scripts, because the method of the present invention does not need to use spaces to provide visual division information.

附图说明Description of drawings

图1是本发明基于文字段连接的自然图片中多方向文本检测的流程图;Fig. 1 is the flow chart of multidirectional text detection in the natural picture based on text field connection of the present invention;

图2是本发明计算文字段真实标签各项参数的示意图;Fig. 2 is the schematic diagram that the present invention calculates each parameter of text field true label;

图3是本发明卷积预测器的输出组成示意图;Fig. 3 is a schematic diagram of the output composition of the convolutional predictor of the present invention;

图4是本发明文字段连接检测模型网络连接图;Fig. 4 is a text field connection detection model network connection diagram of the present invention;

图5是本发明一实施例中利用训练好的文字段连接检测网络模型对待检测文本图像进行检测文字段和连接输出包围盒的结果图。Fig. 5 is a result diagram of detecting text fields and connecting output bounding boxes of the text image to be detected by using the trained text field connection detection network model in an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。此外,下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

以下首先就本发明的技术术语进行解释和说明:Below at first explain and illustrate with regard to the technical terms of the present invention:

卷积神经网络(Concolutional Neural Network,CNN):一种可用于图像分类、回归等任务的神经网络。网络通常由卷积层、降采样层和全连接层构成。卷积层和降采样层负责提取图像的特征,全连接层负责分类或回归。网络的参数包括卷积核以及全连接层的参数及偏置,参数可以通过反向传导算法,从数据中学习得到;Convolutional Neural Network (CNN): A neural network that can be used for image classification, regression, and other tasks. Networks usually consist of convolutional layers, downsampling layers, and fully connected layers. The convolutional layer and the downsampling layer are responsible for extracting the features of the image, and the fully connected layer is responsible for classification or regression. The parameters of the network include the parameters and offsets of the convolution kernel and the fully connected layer. The parameters can be learned from the data through the reverse conduction algorithm;

VGG16:2014年ILSVRC的亚军是VGGNet,包含16个CONV/FC层,具有非常均匀的架构,颇具吸引力,从开始到结束只执行3x3卷积和2x2池化层,成为经典的卷积神经网络模型。他们的预训练模型可用于Caffe的即插即用。它证明了网络的深度是良好性能的关键组成部分。VGG16: The runner-up of ILSVRC in 2014 is VGGNet, which contains 16 CONV/FC layers, has a very uniform architecture, is quite attractive, and only performs 3x3 convolution and 2x2 pooling layers from the beginning to the end, becoming a classic convolutional neural network. Model. Their pretrained models are available plug-and-play with Caffe. It demonstrates that the depth of the network is a key component of good performance.

深度优先搜索(DFS):它是一种用于遍历或搜索树或图的算法。沿着树的深度遍历树的节点,尽可能深的搜索树的分支。当节点v的所在边都己被探寻过,搜索将回溯到发现节点v的那条边的起始节点。这一过程一直进行到已发现从源节点可达的所有节点为止。如果还存在未被发现的节点,则选择其中一个作为源节点并重复以上过程,整个进程反复进行直到所有节点都被访问为止。属于图论中的经典算法,利用深度优先搜索算法可以产生目标图的相应拓扑排序表,利用拓扑排序表可以方便的解决很多相关的图论问题,如最大路径问题等等。Depth First Search (DFS): It is an algorithm for traversing or searching a tree or graph. Traverse the nodes of the tree along the depth of the tree, searching the branches of the tree as deep as possible. When all edges of node v have been explored, the search will backtrack to the start node of the edge where node v was found. This process continues until all nodes reachable from the source node have been found. If there are undiscovered nodes, select one of them as the source node and repeat the above process, and the whole process is repeated until all nodes are visited. It belongs to the classic algorithm in graph theory. Using the depth-first search algorithm can generate the corresponding topological sorting table of the target graph. Using the topological sorting table can easily solve many related graph theory problems, such as the maximum path problem and so on.

如图1所示,本发明基于空间变换的自然场景下文本检测方法包括以下步骤:As shown in Figure 1, the text detection method under the natural scene based on spatial transformation of the present invention comprises the following steps:

(1)训练文字段连接检测网络模型,包括如下子步骤:(1) Training the text field connection detection network model, including the following sub-steps:

(1.1)以词条级别标记训练图像集中所有文本图像的文本内容,标签为词条的矩形初始包围盒的四个点坐标,得到训练数据集;(1.1) Mark the text content of all text images in the training image set with the entry level, and the label is the four point coordinates of the rectangular initial bounding box of the entry to obtain the training data set;

(1.2)定义用于根据词条标注可以预测输出文字段和连接的文字段检测模型,所述网络模型由级联卷积神经网络和卷积预测器组成,根据上述训练数据集计算得到文字段和连接的标签,设计损失函数,结合在线扩增和在线负样本难例挖掘技术手段,利用反向传导方法训练该网络,得到文字段检测模型,包括如下子步骤:(1.2) Define the text field detection model that can predict the output text field and connection according to the entry label, the network model is composed of a cascaded convolutional neural network and a convolution predictor, and the text field is calculated according to the above-mentioned training data set and connected tags, design a loss function, combine online amplification and online negative sample hard example mining techniques, use the reverse conduction method to train the network, and obtain a text field detection model, including the following sub-steps:

(1.2.1)构建文字段检测卷积神经网络模型:提取特征的前几层卷积单元来自预训练的VGG-16网络,前几层卷积单元为卷积层1到池化层5,全连接层6和全连接层7分别转换为卷积层6和卷积层7,连接在其后的是一些额外加入的卷积层,用于提取更深度的特征进行检测,包括卷积层8、卷积层9、卷积层10,最后一层是卷积层11;后6个不同的卷积层分别输出不同尺寸的特征图,便于提取出多种尺度的高质量特征,检测文字段和连接是在这六个不同尺寸的特征图上进行的;对于这6个卷积层,每一层之后都添加尺寸为3×3的滤波器作为卷积预测器,来共同检测文字段和连接;(1.2.1) Constructing a text field detection convolutional neural network model: the first few layers of convolutional units for extracting features come from the pre-trained VGG-16 network, and the first few layers of convolutional units are convolutional layer 1 to pooling layer 5, Fully connected layer 6 and fully connected layer 7 are converted to convolutional layer 6 and convolutional layer 7 respectively, followed by some additional convolutional layers for extracting deeper features for detection, including convolutional layers 8. Convolutional layer 9, convolutional layer 10, and the last layer is convolutional layer 11; the last 6 different convolutional layers output feature maps of different sizes, which is convenient for extracting high-quality features of multiple scales and detecting text Segments and connections are performed on these six feature maps of different sizes; for these six convolutional layers, a filter of size 3×3 is added after each layer as a convolutional predictor to jointly detect text fields and connect;

(1.2.2)从标注的词包围盒产生文字段包围盒标签:对于原始训练图像集Itr,记缩放后的训练图像集为Itr′,wI、hI分别为Itr′的宽度和高度,可以取384×384或512×512像素,第i张图片Itri′作为模型输入,Itri′上标注的所有词包围盒记作Wi=[Wi1,...,Wip],其中Wij为第i张图片上的第j个词包围盒,词包围盒可以是单词级别也可以是词条级别,j=1,...,p,p为第i张图片上词包围盒的总数量;记后6层卷积层分别输出的特征图构成集合Itroi′=[Itroi1′,...,Itroi6′],其中Itroil′为后6层卷积层中第l层输出的特征图,wl、hl分别为该特征图的宽度和高度,Itroil′上的坐标(x,y)对应Itri′上以(xa,ya)为中心点坐标的水平初始包围盒Bilq,它们满足下列公式:(1.2.2) Generate text field bounding box labels from marked word bounding boxes: For the original training image set Itr, record the scaled training image set as Itr′, w I , h I are the width and height of Itr′ respectively, Can take 384×384 or 512×512 pixels, the i-th picture Itr i ′ is used as the model input, and all word bounding boxes marked on Itr i ′ are recorded as W i =[W i1 ,...,W ip ], where W ij is the jth word bounding box on the i-th picture, the word bounding box can be word level or entry level, j=1,..., p, p is the word bounding box on the i-th picture The total number of ; Note that the feature maps output by the last 6 convolutional layers constitute a set Itro i ′=[Itro i1 ′,...,Itro i6 ′], where Itro il ′ is the lth in the last 6 convolutional layers The feature map output by the layer, w l and h l are the width and height of the feature map respectively, and the coordinates (x, y) on Itro il ′ correspond to the coordinates of the center point on Itr i ′ (x a , y a ) Horizontal initial bounding boxes B ilq , they satisfy the following formula:

初始包围盒Bilq的宽和高都被设置成一个常数al,用于控制输出文字段的比例,l=1,...,6;记第l层输出的特征图Itroil′对应的初始包围盒集合为Bil=[Bil1,...,Bilm],q=1,...,m,其中m为第l层输出的特征图上初始包围盒的数目;只要初始包围盒Bilq的中心被包含在Itr′上任一标注的词包围盒Wij内部,且Bilq的尺寸al和该标注的词包围盒Wij的高度h满足:那么这个初始包围盒Bilq被标记为正类,标签取值为1,并与高度最为接近的那个词包围盒Wij匹配;否则,当Bilq与所有词包围盒Wi都不满足上述两个条件时,Bilq就被标记为负类,标签取值为0;文字段在初始包围盒上产生,与初始包围盒标签类别相同;其中,比例常数1.5为经验值;The width and height of the initial bounding box B ilq are both set to a constant a l to control the proportion of the output text field, l=1,...,6; record the feature map Itro il 'corresponding to the l-th layer output The set of initial bounding boxes is B il =[B il1 ,...,B ilm ], q=1,...,m, where m is the number of initial bounding boxes on the feature map output by layer l; as long as the initial bounding boxes The center of the box B ilq is contained inside any marked word bounding box W ij on Itr′, and the size a l of B ilq and the height h of the marked word bounding box W ij satisfy: Then this initial bounding box B ilq is marked as a positive class, the value of the label is 1, and it matches the word bounding box W ij with the closest height; otherwise, when B ilq and all word bounding boxes W i do not satisfy the above two When a condition is met, B ilq is marked as a negative class, and the value of the label is 0; the text field is generated on the initial bounding box, which is the same as the label category of the initial bounding box; wherein, the proportional constant 1.5 is an empirical value;

(1.2.3)在所述步骤(1.2.2)产生的带标签的初始包围盒上产生文字段并计算正类文字段偏移量:负类文字段包围盒s-为负类初始包围盒B-;正类文字段包围盒s+由正类初始包围盒B+经过以下步骤得到:a)记正类初始包围盒B+匹配到的标注词包围盒W与水平方向夹角为θs,以B+的中心点为中心,将W顺时针旋转θs角;b)裁剪W,去除超出B+左边和右边的部分;c)以B+的中心点为中心,将裁剪后的词包围盒W′逆时针旋转θs角,得到文字段s+真实标签的几何参数xs、ys、ws、hs、θs;d)计算得到文s+相对于B+的偏移量(Δxs,Δys,Δws,Δhs,Δθs),计算公式如下:(1.2.3) Generate a text field on the labeled initial bounding box generated in the step (1.2.2) and calculate the offset of the positive text field: the negative text field bounding box s - is the negative initial bounding box B - ; the bounding box s of the positive text field + is obtained from the initial bounding box B + of the positive class through the following steps: a) Record the initial bounding box B of the positive class + the matched tag word bounding box W and the horizontal direction The angle is θ s , with the central point of B + as the center, rotate W clockwise by θ s angle; b) crop W, and remove the part beyond the left and right of B + ; c) center the center point of B + , the cropped word The bounding box W′ is rotated counterclockwise by the θ s angle, and the geometric parameters x s , y s , w s , h s , θ s of the text field s + the real label are obtained; d) Calculate the offset of the text s + relative to B + Quantity (Δx s , Δy s , Δw s , Δh s , Δθ s ), the calculation formula is as follows:

xs=alΔxs+xa x s =a l Δx s +x a

ys=alΔys+ya y s = a l Δy s + y a

ws=alexp(Δws)w s = a l exp(Δw s )

hs=alexp(Δhs)h s = a l exp(Δh s )

θs=Δθs θ s = Δθ s

其中,xs、ys、ws、hs、θs分别为文字段包围盒s+的中心点横坐标、中心点纵坐标、宽度、高度以及与水平方向之间的夹角;xa、ya、wa、ha分别为水平初始包围盒B+的中心点横坐标、中心点纵坐标、宽度、高度;Δxs、Δys、Δws、Δhs、Δθs分别为文字段包围盒s+中心点横坐标xs相对初始包围盒B+的偏移量、纵坐标ys相对初始包围盒的偏移量、宽度ws的偏移变化量、高度hs的偏移变化量、角度θs的偏移量;Among them, x s , y s , w s , h s , and θ s are respectively the abscissa of the center point, the ordinate of the center point, width, height, and the angle between the text field bounding box s + and the horizontal direction; x a , y a , w a , h a are the abscissa, ordinate, width, and height of the center point of the horizontal initial bounding box B + respectively; Δx s , Δy s , Δw s , Δh s , and Δθ s are text fields The offset of bounding box s + central point abscissa x s relative to initial bounding box B + , the offset of vertical coordinate y s relative to initial bounding box, the offset change of width w s , and the offset change of height h s amount, the offset of angle θ s ;

(1.2.4)对于步骤(1.2.3)产生的文字段包围盒计算连接标签:文字段s是在初始包围盒B上产生的,因此s之间的连接标签和它们对应的初始包围盒B之间的连接标签相同;对于特征图集合Itroi′=[Itroi1′,...,Itroi6′],如果在同一张特征图Itroil′的初始包围盒集合Bil里,两个初始包围盒的标签都是正类,且匹配到同一个词,那么之间的层内连接被标记为正类,否则标记为负类;如果在特征图Itroil′对应的初始包围盒集合Bil里的初始包围盒和Itroi(l-1)′对应的的初始包围盒集合Bi(l-1)里的初始包围盒的标签都是正类且匹配到同一个词包围盒Wij,那么之间的跨层连接被标记为正类,否则标记为负类;(1.2.4) Calculate the connection label for the bounding box of the text field generated in step (1.2.3): the text field s is generated on the initial bounding box B, so the connection labels between s and their corresponding initial bounding box B The connection labels between are the same; for the feature map set Itro i ′=[Itro i1 ′,...,Itro i6 ′], if in the initial bounding box set B il of the same feature map Itro il ′, two initial bounding box The labels of are all positive classes, and matches the same word, then The intra-layer connection between is marked as a positive class, otherwise it is marked as a negative class; if the initial bounding box in the initial bounding box set B il corresponding to the feature map Itro il The initial bounding box in the initial bounding box set B i(l-1) corresponding to Itro i(l-1) labels are all positive and match the same word bounding box W ij , then The cross-layer connections between are marked as positive class, otherwise marked as negative class;

(1.2.5)以缩放后的训练图像集Itr′作为文字段检测模型输入,预测文字段s输出:对模型初始化权重和偏置,前6万次训练迭代步骤学习率设置为10-3,之后学习率衰减到10-4;对于后6层卷积层,在第l层特征图Itroil′上的坐标(x,y)处,(x,y)对应到输入图像Itri′上以(xa,ya)为中心点坐标、以al为尺寸的初始包围盒Bilq,3×3的卷积预测器都会预测出Bilq被分别划分成正、负类的得分cs,cs为二维向量,取值范围为0-1之间的小数。同时也预测出5个数字作为被划分到正类文字段s+时的几何偏移量,其中分别为预测的文字段包围盒s+中心点横坐标相对正类初始包围盒B+的偏移量、纵坐标的相对正类初始包围盒B+的偏移量、高度的偏移变化量、宽度的偏移变化量、角度偏移量;(1.2.5) Take the scaled training image set Itr′ as the input of the text field detection model, predict the text field s output: initialize the weight and bias of the model, set the learning rate of the first 60,000 training iteration steps to 10 -3 , Then the learning rate decays to 10 -4 ; for the last 6 convolutional layers, at the coordinates (x, y) on the feature map Itro il ′ of the first layer, (x, y) corresponds to the input image Itr i ′ with (x a , y a ) is the center point coordinates, the initial bounding box B ilq with a l as the size, the 3×3 convolutional predictor will predict B ilq is divided into positive and negative scores c s , c s is a two-dimensional vector, and its value range is a decimal between 0-1. Also predicted 5 numbers As the geometric offset when divided into the positive text field s + , where Respectively, the predicted text field bounding box s + center point abscissa relative to the offset of the positive initial bounding box B + , the vertical coordinate relative to the offset of the positive initial bounding box B + , the offset change in height, Width offset variation, angle offset;

(1.2.6)在已预测的文字段基础上预测层内连接和跨层连接输出:对于层内连接,在同一张特征图Itroil′上坐标点(x,y)处,取x-1≤x′≤x+1、y-1≤y′≤y+1范围内近邻的点(x′,y′),这8个点对应到输入图像Itri′时,便获得了与(x,y)对应的基准文字段s(x,y,l)相连接的8个层内近邻文字段s(x′,y′,l),8个层内近邻文字段可表示为集合:(1.2.6) Predict the output of intra-layer connection and cross-layer connection based on the predicted text field: for intra-layer connection, at the coordinate point (x, y) on the same feature map Itro il ′, take x-1 ≤x′≤x+1, y-1≤y′≤y+1, the neighbor points (x′, y′) in the range of y-1≤y′≤y+1, when these 8 points correspond to the input image Itr i ′, the result of , y) The 8 intra-layer neighbor text fields s (x′, y′, l) connected with the reference text field s (x, y, l) corresponding to y), the 8 intra-layer neighbor text fields can be expressed as a set:

3×3卷积预测器会预测出s(x,y,l)与层内近邻集合的连接的正、负得分cl1,cl1为16维向量,其中,w为上标,表示层内连接;A 3×3 convolutional predictor predicts s (x, y, l) and the set of neighbors in the layer The positive and negative scores c l1 of the connection, c l1 is a 16-dimensional vector, where w is a superscript, indicating the connection within the layer;

对于跨层连接,一个跨层连接将两个连续卷积层输出的特征图上两个点处对应的文字段相连;由于每经过一层卷积层,特征图的的宽度和高度都会缩小一半,第l层输出特征图Itroil′的宽度wl和高度hl是第l-1层特征图Itroi(l-1)′的宽度wl-1和高度hl-1的一半,而Itroil′对应的初始包围盒尺度al是Itroi(l-1)′对应的初始包围盒尺度al-1的2倍,对于在第l层输出特征图Itroil′上的(x,y),在特征图Itroi(l-1)′上取2x≤x′≤2x+1、2y≤y′≤2y+1范围内的4个跨层近邻点(x′,y′),Itroil′上(x,y)对应到输入图像Itri′上的初始包围盒刚好与Itroi(l-1)′上4个跨层近邻点对应到输入图像Itri′上的4个初始包围盒空间位置重合,4个跨层近邻文字段s(x′,y′,l-1)可表示为集合:For cross-layer connection, a cross-layer connection connects the text fields corresponding to two points on the feature map output by two consecutive convolutional layers; because each layer of convolutional layer passes through, the width and height of the feature map will be reduced by half , the width w l and height h l of the l-th layer output feature map Itro il ′ are half of the width w l-1 and height h l-1 of the l-1 layer feature map Itro i(l-1) ′, and The initial bounding box scale a l corresponding to Itro il ′ is twice the initial bounding box scale a l-1 corresponding to Itro i(l-1) ′, for (x, y), take 4 cross-layer neighbor points (x', y') in the range of 2x≤x'≤2x+1, 2y≤y'≤2y+1 on the feature map Itro i(l-1) ', (x, y) on Itro il ′ corresponds to the initial bounding box on the input image Itr i ′, and the 4 cross-layer neighbor points on Itro i(l-1) ′ correspond to the 4 initial bounding boxes on the input image Itr i ′ The spatial positions of the bounding boxes coincide, and the four cross-layer neighbor text fields s (x′, y′, l-1) can be expressed as a set:

3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上近邻文字段集合之间跨层连接的正、负得分cl2,cl2为8维向量:The 3×3 convolutional predictor will predict the base text field s (x, y, l) of layer l and the set of adjacent text fields on layer l-1 The positive and negative scores c l2 of the cross-layer connection, c l2 is an 8-dimensional vector:

其中,表示预测器预测出s(x,y,l)与其所有4个近邻文字段之间的连接的正、负得分,c为上标,表示跨层连接;in, Indicates that the predictor predicts the positive and negative scores of the connection between s (x, y, l) and all 4 adjacent text fields, and c is a superscript, indicating a cross-layer connection;

所有的层内连接和所有的跨层连接构成连接集合NsAll intralayer connections and all cross-layer connections Constitute the connection set N s ;

(1.2.7)以步骤(1.2.3)和步骤(1.2.4)获得的文字段标签、正类文字段真实偏移量、连接标签作为输出基准,以步骤(1.2.5)预测的文字段类别及得分、预测的文字段偏移量、步骤(1.2.6)预测的连接得分为预测输出,设计预测输出与输出基准之间的目标损失函数,对文字段连接检测模型利用反向传导法进行不断地训练,来最小化文字段分类、文字段偏移回归和连接分类的损失,针对所述文字段连接检测模型设计目标损失函数,目标损失函数是三个损失的加权和:(1.2.7) Take the text field label obtained in step (1.2.3) and step (1.2.4), the real offset of the positive text field, and the connection label as the output reference, and the text predicted by step (1.2.5) The category and score of the segment, the offset of the predicted text field, and the connection score predicted in step (1.2.6) are the predicted output, and the target loss function between the predicted output and the output benchmark is designed, and the reverse conduction is used for the text field connection detection model The method is continuously trained to minimize the loss of text field classification, text field offset regression and connection classification, and the target loss function is designed for the text field connection detection model. The target loss function is the weighted sum of three losses:

其中ys是所有文字段的标签,cs是预测的文字段得分,yl是预测的连接标签,cl是预测的连接得分,由层内连接得分cl1和跨层得分cl2组成;如果第i个初始包围盒标记为正类,那么ys(i)=1,否则为0;Lconf(ys,cs)是预测的文字段得分cs的softmax损失,Lconf(ys,cl)是预测的连接得分cl的softmax损失,是预测的文字段几何参数s和真实标签之间的平滑L1回归损失;ns是正类初始包围盒的数量,用来对文字段分类和回归损失进行归一化;nl是正类连接总数,用来对连接分类损失进行归一化;λ1和λ2为权重常数,在实际中取1。where y s is the label of all text fields, c s is the predicted text field score, y l is the predicted connection label, c l is the predicted connection score, which is composed of intra-layer connection score c l1 and cross-layer score c l2 ; If the i-th initial bounding box is marked as a positive class, then y s (i) = 1, otherwise 0; L conf (y s , c s ) is the softmax loss of the predicted text field score c s , L conf (y s , c l ) is the softmax loss of the predicted connection score c l , is the predicted text field geometry parameter s and the ground truth label The smoothed L between 1 regression loss; n s is the number of positive initial bounding boxes, used to normalize text field classification and regression loss; n l is the total number of positive class connections, used to normalize the connection classification loss ; λ 1 and λ 2 are weight constants, which are 1 in practice.

(1.2.8)在步骤(1.2.7)的训练过程中,采用在线扩增方法对训练数据Itr进行在线扩增,并采用在线负样本难例挖掘策略来平衡正样本和负样本。在训练图片Itr被缩放到相同的尺寸并批量加载之前,它们被随机地裁剪成一个个图像块,每个图像块与文字段的真实包围盒的jaccard重叠系数o最小;对于多方向文字,数据扩增是在多方向文字包围盒的最小包围矩形上进行的,每个样本的重叠系数o从0、0.1、0.3、0.5、0.7和0.9中随机选择,图像块的大小为原始图片尺寸的0.1-1倍之间;训练图像不水平翻转;另外,文字段和连接负样本占据训练样本的大部分,采用在线负样本难例挖掘策略来平衡正样本和负样本,对文字段和连接分开进行挖掘,控制负样本与正样本之间的比例不超过3∶1。(1.2.8) During the training process of step (1.2.7), the online amplification method is used to amplify the training data I tr online, and the online negative sample hard case mining strategy is used to balance positive samples and negative samples. Before the training images I tr are scaled to the same size and loaded in batches, they are randomly cropped into image blocks, and the jaccard overlap coefficient o of each image block and the true bounding box of the text field is the smallest; for multi-directional text, The data augmentation is carried out on the minimum enclosing rectangle of the multi-directional text bounding box, the overlap coefficient o of each sample is randomly selected from 0, 0.1, 0.3, 0.5, 0.7 and 0.9, and the size of the image block is the size of the original image Between 0.1 and 1 times; the training image is not flipped horizontally; in addition, text fields and connection negative samples occupy most of the training samples, and the online negative sample hard case mining strategy is used to balance the positive samples and negative samples, and the text field and connection are separated For mining, control the ratio between negative samples and positive samples not to exceed 3:1.

(2)利用上述训练好的卷积神经网络对待检测文本图像进行文字段和连接检测,包括如下子步骤:(2) Use the above-mentioned trained convolutional neural network to perform text field and connection detection on the text image to be detected, including the following sub-steps:

(2.1)对待检测文本图像进行文字段检测,由不同卷积层输出的特征图可以预测出不同尺度的文字段,由同一卷积层输出的特征图预测出相同尺度的文字段:对待检测图像集Itst中的第i张待检测文本图像Itsti,缩放到统一尺寸,具体尺寸可随待检测文本图像的情况人为设定,记缩放后的待检测文本图像为Itsti′。将图像Itsti′输入到步骤(1.2)中训练好的文字段连接检测模型,得到后6层卷积层分别输出的特征图构成的集合Itstoi′=[Itstoi1′,...,Itstoi6′],其中Itstoil′为后6层卷积层中第l层输出的特征图,l=1,...,6,在每张输出特征图Itstoil′上的坐标(x,y)处,3×3的卷积预测器都会预测出(x,y)对应的初始包围盒Bilq被预测为正、负类文字段的得分cs,同时也预测出5个数字作为被预测为正类文字段s+时的几何偏移量;(2.1) Perform text field detection on the text image to be detected. The feature maps output by different convolutional layers can predict text fields of different scales, and the feature maps output by the same convolutional layer can predict text fields of the same scale: the image to be detected The i-th text image Itst i to be detected in the collection Itst is scaled to a uniform size, and the specific size can be set manually according to the situation of the text image to be detected, and the zoomed text image to be detected is Itst i ′. Input the image Itst i ′ into the text field connection detection model trained in step (1.2), and obtain the set Itsto i ′=[Itsto i1 ′,...,Itsto i6 ′], where Itsto il ′ is the feature map output by the l-th layer in the last 6 convolutional layers, l=1,...,6, the coordinates (x, y on each output feature map Itsto il ′ ), the 3×3 convolution predictor will predict the initial bounding box B ilq corresponding to (x, y) is predicted to be the score c s of positive and negative text fields, and also predicts 5 numbers As the geometric offset when it is predicted as a positive text field s + ;

(2.2)对待检测文本图像检测出的所有特征层上的文字段进行连接检测,所述连接包括层内连接和跨层连接:在(2.1)预测的文字段基础上预测层内连接和跨层连接,对于层内连接,在同一张特征图Itstoil′上坐标点(x,y)处,3×3卷积预测器预测出s(x,y,l)与它8个近邻文字段之间层内连接的正、负得分cl1;对于跨层连接,3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上4个近邻文字段的跨层连接正、负得分cl2,cl1和cl2构成预测的连接得分cl(2.2) The text fields on all feature layers detected by the text image to be detected are connected and detected, and the connections include intra-layer connections and cross-layer connections: predict intra-layer connections and cross-layer connections based on the text fields predicted in (2.1) Connection, for intra-layer connections, at the coordinate point (x, y) on the same feature map Itsto il ′, the 3×3 convolutional predictor predicts s (x, y, l) and its 8 neighbor text fields The positive and negative scores c l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor will predict the reference text field s (x, y, l) of the l-th layer and the 4 on the l-1 layer neighbor text field The cross-layer connection positive and negative scores c l2 , c l1 and c l2 constitute the predicted connection score c l ;

(2.3)将检测得到的文字段置信度得分和连接置信度得分组合,其中文字段置信度得分包括文字段正负类别得分和偏移量得分,利用卷积预测器输出softmax标准化得分:在(2.1)预测的文字段基础上预测层内连接和跨层连接,对于层内连接,在同一张特征图Itstoil′上坐标点(x,y)处,3×3卷积预测器预测出s(x,y,l)与8个近邻文字段之间层内连接的正、负得分cl1;对于跨层连接,3×3卷积预测器预测出第l层基准文字段s(x,y,l)与第l-1层上4个近邻文字段的跨层连接正、负得分cl2,cl1和cl2构成预测的连接得分cl(2.3) Combine the detected text field confidence score and connection confidence score, where the text field confidence score includes the positive and negative category score and offset score of the text field, and use the convolution predictor to output the softmax standardized score: in ( 2.1) Based on the predicted text field, the intra-layer connection and cross-layer connection are predicted. For the intra-layer connection, at the coordinate point (x, y) on the same feature map Itsto il ′, the 3×3 convolutional predictor predicts s (x, y, l) and 8 neighbor text fields The positive and negative scores c l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolutional predictor predicts the reference text field s (x, y, l) of the l-th layer and the 4 on the l-1 layer Neighbor Text Field The cross-layer connection positive and negative scores c l2 , c l1 and c l2 constitute the predicted connection score c l .

(3)组合文字段和连接,得到输出包围盒,包括如下子步骤:(3) Combining text fields and connections to obtain an output bounding box, including the following sub-steps:

(3.1)根据(2.3)中检测得到的标准化得分,过滤卷积预测器输出的文字段和连接,以过滤后的文字段作为节点,以连接作为边,构建连接图:对于步骤(2)待检测文本图像输入到文字段检测模型而产生的固定数量的文字段s和连接Ns,通过它们的得分进行过滤;为文字段s和连接Ns设置不同的过滤阈值,分别为α和β;过滤阈值可以根据不同的数据人为设定不同的值,在实际中进行多方向文本图像检测时,可以取α=0.9,β=0.7,进行多语种长文本图像检测时,可以取α=0.9,β=0.5,进行水平文本检测时,可以取α=0.6,β=0.3;将过滤后的文字段s′作为节点,过滤后的连接Ns′作为边,利用它们构建一个图;(3.1) According to the standardized score detected in (2.3), filter the text field and connection output by the convolutional predictor, use the filtered text field as a node, and use the connection as an edge to construct a connection graph: For step (2) to be Detect a fixed number of text fields s and connection N s generated by inputting the text image to the text field detection model, and filter through their scores; set different filtering thresholds for the text field s and connection N s , respectively α and β; The filtering threshold can be artificially set different values according to different data. In practice, when performing multi-directional text image detection, α=0.9, β=0.7 can be used, and when multilingual long text image detection is performed, α=0.9 can be used. β=0.5, when performing horizontal text detection, you can take α=0.6, β=0.3; use the filtered text field s' as a node, and the filtered connection N s ' as an edge, and use them to construct a graph;

(3.2)在图上执行深度优先搜索以找到相互连接的组件,每个组件记作集合B,包含由连接相连起来的文字段;(3.2) Perform a depth-first search on the graph to find connected components. Each component is denoted as a set B, which contains text fields connected by connections;

(3.3)对步骤(3.2)在图上进行深度优先搜索得到的文字段集合S,通过下述步骤组合成一个完整的词,包括:(3.3) The text field set S obtained by performing depth-first search on the graph in step (3.2) is combined into a complete word through the following steps, including:

(3.3.1)输入:|S|为集合S里的文字段数量,其中为第i个文字段,i为上标,分别为第i个文字段包围盒s(i)的中心横坐标和纵坐标,分别为文字段包围盒s(i)的宽度和高度,为文字段包围盒s(i)与水平方向之间的夹角;(3.3.1) Input: |S| is the number of text fields in the set S, where is the i-th text field, i is the superscript, are respectively the central abscissa and ordinate of the i-th text field bounding box s (i) , are the width and height of text field bounding box s (i) , respectively, is the angle between the text field bounding box s (i) and the horizontal direction;

(3.3.2)其中θb为输出包围盒的偏移角度,为集合里第i个文字段包围盒s的偏移角度,由集合S里所有文字段的平均偏移角度得到;(3.3.2) where θ b is the offset angle of the output bounding box, is the offset angle of the i-th text field bounding box s in the set, obtained from the average offset angle of all text fields in the set S;

(3.3.3)找到直线tan(θb)x+b的截距b,使得集合S中所有的文字段到中心点的距离的总和最小;(3.3.3) Find the intercept b of the straight line tan(θ b )x+b, so that all text fields in the set S reach the center point The sum of the distances is the smallest;

(3.3.4)找到直线的两个端点(xp,yp)和(xq,yq),p表示第一个端点,q表示第二个端点,xp、yp分别为第一个端点的横、纵坐标,xq、yq分别为第二个端点的横、纵坐标;(3.3.4) Find the two endpoints (x p , y p ) and (x q , y q ) of the straight line, p represents the first endpoint, q represents the second endpoint, x p and y p are the first The abscissa and ordinate of the first endpoint, x q and y q are respectively the abscissa and ordinate of the second endpoint;

(3.3.5)b表示输出包围盒,xb、yb分别为输出包围盒中心的横、纵坐标;(3.3.5) b represents the output bounding box, x b and y b are the horizontal and vertical coordinates of the center of the output bounding box, respectively;

(3.3.6)其中wb为输出包围盒的宽度,wp、wq分别为以点p为中心的包围盒的宽度和以q为中心的包围盒的宽度;(3.3.6) Where w b is the width of the output bounding box, w p and w q are respectively the width of the bounding box centered on point p and the width of the bounding box centered on q;

(3.3.7)hb为输出包围盒的高度,为集合里第i个文字段包围盒s的高度,由由文字段集合S里所有文字段的平均高度得到;(3.3.7) h b is the height of the output bounding box, is the height of the i-th text field bounding box s in the set, obtained from the average height of all text fields in the text field set S;

(3.3.8)b:=(xb,yb,wb,hb,θb),b为输出包围盒,由坐标参数、尺寸参数、角度参数表示;(3.3.8)b:=(x b , y b , w b , h b , θ b ), b is the output bounding box, represented by coordinate parameters, size parameters, and angle parameters;

(3.3.9)输出组合而成的包围盒b。(3.3.9) Output the combined bounding box b.

Claims (7)

1.一种基于连接文字段的自然图片中多方向文本检测方法,其特征在于,所述方法包括下述步骤:1. a kind of multidirectional text detection method in the natural picture based on connection text field, it is characterized in that, described method comprises the steps: (1)训练文字段连接检测网络模型,包括如下子步骤:(1) Training the text field connection detection network model, including the following sub-steps: (1.1)以词条级别标记训练图像集中所有文本图像的文本内容,标签为词条的矩形初始包围盒的四个点坐标,得到训练数据集;(1.1) Mark the text content of all text images in the training image set with the entry level, and the label is the four point coordinates of the rectangular initial bounding box of the entry to obtain the training data set; (1.2)定义用于根据词条标签可以预测输出文字段和连接的文字段连接检测网络模型,所述文字段连接检测网络模型由级联卷积神经网络和卷积预测器组成,根据上述训练数据集计算得到文字段和连接的标签,设计损失函数,结合在线扩增和在线负样本难例挖掘方法,利用反向传导方法训练该文字段连接检测网络,得到文字段连接检测网络模型;(1.2) Define the text field connection detection network model that can predict the output text field and connection according to the entry label, the text field connection detection network model is composed of a cascade convolutional neural network and a convolution predictor, according to the above training The data set is calculated to obtain the label of the text field and the connection, and the loss function is designed. Combined with the online amplification and online negative sample mining method, the text field connection detection network is trained by the reverse conduction method, and the text field connection detection network model is obtained; (2)利用训练好的上述文字段连接检测网络模型对待检测文本图像进行文字段和连接检测,包括如下子步骤:(2) Utilize the well-trained above-mentioned text field connection detection network model to carry out text field and connection detection on the text image to be detected, including the following sub-steps: (2.1)对待检测文本图像进行文字段检测,由不同卷积层输出的特征图预测出不同尺度的文字段,由同一卷积层输出的特征图预测出相同尺度的文字段;(2.1) Perform text field detection on the text image to be detected, predict text fields of different scales from the feature maps output by different convolution layers, and predict text fields of the same scale from the feature maps output by the same convolution layer; (2.2)对待检测文本图像检测出的所有特征层上的文字段进行连接检测,所述连接包括层内连接和跨层连接;(2.2) text fields on all feature layers detected by the text image to be detected are connected and detected, and the connections include intra-layer connections and cross-layer connections; (2.3)将检测得到的文字段的置信度得分和连接置信度得分组合,其中文字段置信度得分包括文字段正负类别得分和偏移量得分,利用卷积预测器输出softmax标准化得分;(2.3) Combining the confidence score and the connection confidence score of the detected text field, wherein the Chinese field confidence score includes the text field positive and negative category score and offset score, and uses the convolution predictor to output the softmax standardized score; (3)组合文字段和连接,得到输出包围盒,包括如下子步骤:(3) Combining text fields and connections to obtain an output bounding box, including the following sub-steps: (3.1)根据(2.3)中检测得到的标准化得分,过滤卷积预测器输出的文字段和连接,以过滤后的文字段作为节点,以连接作为边,构建连接图;(3.1) According to the standardized score detected in (2.3), filter the text field and connection output by the convolutional predictor, use the filtered text field as a node, and use the connection as an edge to construct a connection graph; (3.2)在图上执行深度优先搜索以找到相互连接的组件,每个组件记作集合S,包含由连接相连起来的文字段;(3.2) Perform a depth-first search on the graph to find interconnected components. Each component is denoted as a set S, which contains text fields connected by connections; (3.3)将一个集合中的文字段组合成一个完整的词条,计算完整的词条包围盒并输出。(3.3) Combine the text fields in a collection into a complete term, calculate the bounding box of the complete term and output it. 2.根据权利要求1所述的基于连接文字段的自然图片中多方向文本检测方法,其特征在于,所述步骤(1.2)具体为:2. the multi-directional text detection method in the natural picture based on connecting text field according to claim 1, is characterized in that, described step (1.2) is specially: (1.2.1)构建文字段检测卷积神经网络模型:提取特征的前几层卷积单元来自预训练的VGG-16网络,前几层卷积单元为卷积层1到池化层5,全连接层6和全连接层7分别转换为卷积层6和卷积层7,连接在其后的是一些额外加入的卷积层,用于提取更深度的特征进行检测,包括卷积层8、卷积层9、卷积层10,最后一层是卷积层11;后6个不同的卷积层分别输出不同尺寸的特征图,便于提取出多种尺度的高质量特征,检测文字段和连接是在这六个不同尺寸的特征图上进行的;对于这6个卷积层,每一层之后都添加尺寸为3×3的滤波器作为卷积预测器,来共同检测文字段和连接;(1.2.1) Constructing a text field detection convolutional neural network model: the first few layers of convolutional units for extracting features come from the pre-trained VGG-16 network, and the first few layers of convolutional units are convolutional layer 1 to pooling layer 5, Fully connected layer 6 and fully connected layer 7 are converted to convolutional layer 6 and convolutional layer 7 respectively, followed by some additional convolutional layers for extracting deeper features for detection, including convolutional layers 8. Convolutional layer 9, convolutional layer 10, and the last layer is convolutional layer 11; the last 6 different convolutional layers output feature maps of different sizes, which is convenient for extracting high-quality features of multiple scales and detecting text Segments and connections are performed on these six feature maps of different sizes; for these six convolutional layers, a filter of size 3×3 is added after each layer as a convolutional predictor to jointly detect text fields and connect; (1.2.2)从标注的词包围盒产生文字段包围盒标签:对于原始训练图像集Itr,记缩放后的训练图像集为Itr',wI、hI分别为Itr'的宽度和高度,以第i张图片Itri'作为模型输入,Itri'上标注的所有词包围盒记作Wi=[Wi1,...,Wip],其中Wij为第i张图片上的第j个词包围盒,词包围盒是单词级别或者词条级别,j=1,...,p,p为Itri'上词包围盒的总数量;记后6层卷积层分别输出的特征图构成集合Itroi'=[Itroi1',...,Itroi6'],其中Itroil'为后6层卷积层中第l层输出的特征图,wl、hl分别为该特征图的宽度和高度,Itroil'上的坐标(x,y)对应Itri'上以(xa,ya)为中心点坐标的水平初始包围盒Bilq,它们满足下列公式:(1.2.2) Generate a text field bounding box label from the labeled word bounding box: For the original training image set Itr, record the scaled training image set as Itr', w I , h I are the width and height of Itr' respectively, Taking the i-th picture Itr i ' as the model input, all word bounding boxes marked on Itr i ' are written as W i =[W i1 ,...,W ip ], where W ij is the i-th picture on the j word bounding boxes, the word bounding boxes are word level or entry level, j=1,...,p, p is the total number of word bounding boxes on Itr i '; the output of the 6 layers of convolutional layers after the note The feature map constitutes a set Itro i '=[Itro i1 ',...,Itro i6 '], where Itro il ' is the feature map output by the lth layer in the last 6 convolutional layers, w l and h l are the The width and height of the feature map, the coordinates (x, y) on Itro il ' correspond to the horizontal initial bounding box B ilq with (x a , y a ) as the center point coordinates on Itr i ', and they satisfy the following formula: 初始包围盒Bilq的宽和高都被设置成一个常数al,用于控制输出文字段的比例,l=1,...,6;记第l层输出的特征图Itroil'对应的初始包围盒集合为Bil=[Bil1,...,Bilm],q=1,...,m,其中m为第l层输出的特征图上初始包围盒的数目;只要初始包围盒Bilq的中心被包含在Itr'上任一标注的词包围盒Wij内部,且Bilq的尺寸al和该标注的词包围盒Wij的高度h满足:那么这个初始包围盒Bilq被标记为正类,标签取值为1,并与高度最为接近的那个词包围盒Wij匹配;否则,当Bilq与所有词包围盒Wi都不满足以上两个条件时,Bilq就被标记为负类,标签取值为0;文字段在初始包围盒上产生,与初始包围盒标签类别相同;The width and height of the initial bounding box B ilq are set to a constant a l to control the proportion of the output text field, l=1,...,6; record the feature map Itro il 'corresponding to the l-th layer output The set of initial bounding boxes is B il =[B il1 ,...,B ilm ], q=1,...,m, where m is the number of initial bounding boxes on the feature map output by layer l; as long as the initial bounding boxes The center of the box B ilq is contained inside any marked word bounding box W ij on Itr', and the size a l of B ilq and the height h of the marked word bounding box W ij satisfy: Then this initial bounding box B ilq is marked as a positive class, the value of the label is 1, and matches the word bounding box W ij with the closest height; otherwise, when B ilq and all word bounding boxes W i do not satisfy the above two When a condition is met, B ilq is marked as a negative class, and the value of the label is 0; the text field is generated on the initial bounding box, which is the same as the label category of the initial bounding box; (1.2.3)在所述步骤(1.2.2)产生的带标签的初始包围盒上产生文字段并计算正类文字段偏移量:负类文字段包围盒s-为负类初始包围盒B-;正类文字段包围盒s+由正类初始包围盒B+经过以下步骤得到:a)记正类初始包围盒B+匹配到的标注词包围盒W与水平方向夹角为θs,以B+的中心点为中心,将W顺时针旋转θs角;b)裁剪W,去除超出B+左边和右边的部分;c)以B+的中心点为中心,将裁剪后的词包围盒W'逆时针旋转θs角,得到文字段s+真实标签的几何参数xs、ys、ws、hs、θs;d)计算得到文s+相对于B+的偏移量(Δxs,Δys,Δws,Δhs,Δθs),计算公式如下:(1.2.3) Generate a text field on the labeled initial bounding box generated in the step (1.2.2) and calculate the offset of the positive text field: the negative text field bounding box s - is the negative initial bounding box B - ; the bounding box s of the positive text field + is obtained from the initial bounding box B + of the positive class through the following steps: a) Record the initial bounding box B of the positive class + the matched tag word bounding box W and the horizontal direction The angle is θ s , with the central point of B + as the center, rotate W clockwise by θ s angle; b) crop W, and remove the part beyond the left and right of B + ; c) center the center point of B + , the cropped word The bounding box W' is rotated counterclockwise by the angle θ s , and the geometric parameters x s , y s , w s , h s , θ s of the text field s + the real label are obtained; d) Calculate the offset of the text s + relative to B + Quantity (Δx s , Δy s , Δw s , Δh s , Δθ s ), the calculation formula is as follows: xs=alΔxs+xa x s =a l Δx s +x a ys=alΔys+ya y s = a l Δy s + y a ws=alexp(Δws)w s = a l exp(Δw s ) hs=alexp(Δhs)h s = a l exp(Δh s ) θs=Δθs θ s = Δθ s 其中,xs、ys、ws、hs、θs分别为文字段包围盒s+的中心点横坐标、中心点纵坐标、宽度、高度以及与水平方向之间的夹角;xa、ya、wa、ha分别为水平初始包围盒B+的中心点横坐标、中心点纵坐标、宽度、高度;Δxs、Δys、Δws、Δhs、Δθs分别为文字段包围盒s+中心点横坐标xs相对初始包围盒B+的偏移量、纵坐标ys相对初始包围盒的偏移量、宽度ws的偏移变化量、高度hs的偏移变化量、角度θs的偏移量;Among them, x s , y s , w s , h s , and θ s are respectively the abscissa of the center point, the ordinate of the center point, width, height, and the angle between the text field bounding box s + and the horizontal direction; x a , y a , w a , h a are the abscissa, ordinate, width, and height of the center point of the horizontal initial bounding box B + respectively; Δx s , Δy s , Δw s , Δh s , and Δθ s are text fields The offset of bounding box s + central point abscissa x s relative to initial bounding box B + , the offset of vertical coordinate y s relative to initial bounding box, the offset change of width w s , and the offset change of height h s amount, the offset of angle θ s ; (1.2.4)对于步骤(1.2.3)产生的文字段包围盒计算连接标签:文字段s是在初始包围盒B上产生的,因此s之间的连接标签和它们对应的初始包围盒B之间的连接标签相同;对于特征图集合Itroi'=[Itroi1',...,Itroi6'],如果在同一张特征图Itroil'的初始包围盒集合Bil里,两个初始包围盒的标签都是正类,且匹配到同一个词,那么之间的层内连接被标记为正类,否则标记为负类;如果在特征图Itroil'对应的初始包围盒集合Bil里的初始包围盒和Itroi(l-1)'对应的的初始包围盒集合Bi(l-1)里的初始包围盒的标签都是正类且匹配到同一个词包围盒Wij,那么之间的跨层连接被标记为正类,否则标记为负类;(1.2.4) Calculate the connection label for the bounding box of the text field generated in step (1.2.3): the text field s is generated on the initial bounding box B, so the connection labels between s and their corresponding initial bounding box B The connection labels between are the same; for the feature map set Itro i '=[Itro i1 ',...,Itro i6 '], if in the initial bounding box set B il of the same feature map Itro il ', two initial bounding box The labels of are all positive classes, and matches the same word, then The intra-layer connection between is marked as a positive class, otherwise it is marked as a negative class; if the initial bounding box in the initial bounding box set B il corresponding to the feature map Itro il ' The initial bounding box in the initial bounding box set B i(l-1) corresponding to Itro i(l-1) ' labels are all positive and match the same word bounding box W ij , then The cross-layer connections between are marked as positive class, otherwise marked as negative class; (1.2.5)以缩放后的训练图像集Itr'作为文字段检测模型输入,预测文字段s输出:对模型初始化权重和偏置,前6万次训练迭代步骤学习率设置为10-3,之后学习率衰减到10-4;对于后6层卷积层,在第l层特征图Itroil'上的坐标(x,y)处,(x,y)对应到输入图像Itri'上以(xa,ya)为中心点坐标、以al为尺寸的初始包围盒Bilq,3×3的卷积预测器都会预测出Bilq被分别划分成正、负类的得分cs,cs为二维向量,取值范围为0-1之间的小数;同时也预测出5个数字作为被划分到正类文字段s+时的几何偏移量,其中分别为预测的文字段包围盒s+中心点横坐标相对正类初始包围盒B+的偏移量、纵坐标的相对正类初始包围盒B+的偏移量、高度的偏移变化量、宽度的偏移变化量、角度偏移量;(1.2.5) Take the scaled training image set Itr' as the input of the text field detection model, and predict the text field s output: initialize the weight and bias of the model, and set the learning rate of the first 60,000 training iteration steps to 10 -3 , Afterwards, the learning rate decays to 10 -4 ; for the last 6 convolutional layers, at the coordinates (x, y) on the l-th layer feature map Itro il ', (x, y) corresponds to the input image Itr i ' with (x a , y a ) is the center point coordinates, the initial bounding box B ilq with a l as the size, the 3×3 convolutional predictor will predict B ilq is divided into positive and negative scores c s , c s is a two-dimensional vector, the value range is a decimal between 0-1; at the same time, 5 numbers are predicted As the geometric offset when divided into the positive text field s + , where Respectively, the predicted text field bounding box s + center point abscissa relative to the offset of the positive initial bounding box B + , the vertical coordinate relative to the offset of the positive initial bounding box B + , the offset change in height, Width offset variation, angle offset; (1.2.6)在已预测的文字段基础上预测层内连接和跨层连接输出:对于层内连接,在同一张特征图Itroil'上坐标点(x,y)处,取x-1≤x'≤x+1、y-1≤y'≤y+1范围内近邻的点(x',y'),这8个点对应到输入图像Itri'时,便获得了与(x,y)对应的基准文字段s(x,y,l)相连接的层内近邻文字段s(x',y',l),8个层内近邻文字段可表示为集合:(1.2.6) Predict intra-layer connection and cross-layer connection output based on the predicted text field: For intra-layer connection, at the coordinate point (x, y) on the same feature map Itro il ', take x-1 ≤x'≤x+1, y-1≤y'≤y+1 neighbor points (x', y'), when these 8 points correspond to the input image Itr i ', then obtained with (x , y) is an intra-layer neighbor text field s (x',y',l) connected to the reference text field s (x,y,l ), and the eight intra-layer neighbor text fields can be expressed as a set: 3×3卷积预测器会预测出s(x,y,l)与层内近邻集合的连接的正、负得分cl1,cl1为16维向量,其中,w为上标,表示层内连接;The 3×3 convolution predictor will predict s (x, y, l) and the set of neighbors in the layer The positive and negative scores c l1 of the connection, c l1 is a 16-dimensional vector, where w is a superscript, indicating the connection within the layer; 对于跨层连接,一个跨层连接将两个连续卷积层输出的特征图上两个点处对应的文字段相连;由于每经过一层卷积层,特征图的的宽度和高度都会缩小一半,第l层输出特征图Itroil'的宽度wl和高度hl是第l-1层特征图Itroi(l-1)'的宽度wl-1和高度hl-1的一半,而Itroil'对应的初始包围盒尺度al是Itroi(l-1)'对应的初始包围盒尺度al-1的2倍,对于在第l层输出特征图Itroil'上的(x,y),在特征图Itroi(l-1)'上取2x≤x'≤2x+1、2y≤y'≤2y+1范围内的4个跨层近邻点(x',y'),Itroil'上(x,y)对应到输入图像Itri'上的初始包围盒刚好与Itroi(l-1)'上4个跨层近邻点对应到输入图像Itri'上的4个初始包围盒空间位置重合,4个跨层近邻文字段s(x',y',l-1)可表示为集合:For cross-layer connection, a cross-layer connection connects the text fields corresponding to two points on the feature map output by two consecutive convolutional layers; because each layer of convolutional layer passes through, the width and height of the feature map will be reduced by half , the width w l and height h l of the l-th layer output feature map Itro il ' is half of the width w l-1 and height h l-1 of the l-1 layer feature map Itro i(l-1) ', and The initial bounding box scale a l corresponding to Itro il 'is 2 times the initial bounding box scale a l-1 corresponding to Itro i(l-1) ', for the (x, y), take 4 cross-layer neighbor points (x', y') in the range of 2x≤x'≤2x+1, 2y≤y'≤2y+1 on the feature map Itro i(l-1) ', The (x, y) on Itro il 'corresponds to the initial bounding box on the input image Itr i ', and the 4 cross-layer neighbor points on Itro i(l-1) ' correspond to the 4 initial bounding boxes on the input image Itr i ' The spatial positions of the bounding boxes coincide, and the four cross-layer neighbor text fields s (x', y', l-1) can be expressed as a set: 3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上近邻文字段集合之间跨层连接的正、负得分cl2,cl2为8维向量:The 3×3 convolutional predictor will predict the reference text field s (x, y, l) of layer l and the set of adjacent text fields on layer l-1 The positive and negative scores c l2 of the cross-layer connection, c l2 is an 8-dimensional vector: 其中,表示预测器预测出s(x,y,l)与其所有4个近邻文字段之间的连接的正、负得分,c为上标,表示跨层连接;in, Indicates that the predictor predicts the positive and negative scores of the connection between s (x, y, l) and all 4 adjacent text fields, and c is a superscript, indicating a cross-layer connection; 所有的层内连接和所有的跨层连接构成连接集合NsAll intralayer connections and all cross-layer connections Constitute the connection set N s ; (1.2.7)以步骤(1.2.3)和步骤(1.2.4)获得的文字段标签、正类文字段真实偏移量、连接标签作为输出基准,以步骤(1.2.5)预测的文字段类别及得分、预测的文字段偏移量、步骤(1.2.6)预测的连接得分为预测输出,设计预测输出与输出基准之间的目标损失函数,对文字段连接检测模型利用反向传导法进行不断地训练,来最小化文字段分类、文字段偏移回归和连接分类的损失,针对所述文字段连接检测模型设计目标损失函数,目标损失函数是三个损失的加权和:(1.2.7) Take the text field label obtained in step (1.2.3) and step (1.2.4), the real offset of the positive text field, and the connection label as the output reference, and the text predicted by step (1.2.5) The category and score of the segment, the offset of the predicted text field, and the connection score predicted in step (1.2.6) are the predicted output, and the target loss function between the predicted output and the output benchmark is designed, and the reverse conduction is used for the text field connection detection model The method is continuously trained to minimize the loss of text field classification, text field offset regression and connection classification, and the target loss function is designed for the text field connection detection model. The target loss function is the weighted sum of three losses: 其中ys是所有文字段的标签,cs是预测的文字段得分,yl是预测的连接标签,cl是预测的连接得分,由层内连接得分cl1和跨层得分cl2组成;如果第i个初始包围盒标记为正类,那么ys(i)=1,否则为0;Lconf(ys,cs)是预测的文字段得分cs的softmax损失,Lconf(ys,cl)是预测的连接得分cl的softmax损失,是预测的文字段几何参数s和真实标签之间的平滑L1回归损失;ns是正类初始包围盒的数量,用来对文字段分类和回归损失进行归一化;nl是正类连接总数,用来对连接分类损失进行归一化;λ1和λ2为权重常数;where y s is the label of all text fields, c s is the predicted text field score, y l is the predicted connection label, c l is the predicted connection score, which is composed of intra-layer connection score c l1 and cross-layer score c l2 ; If the i-th initial bounding box is marked as a positive class, then y s (i)=1, otherwise it is 0; L conf (y s ,c s ) is the softmax loss of the predicted text field score c s , L conf (y s , c l ) is the softmax loss of the predicted connection score c l , is the predicted text field geometry parameter s and the ground truth label The smoothing L between 1 regression loss; n s is the number of positive initial bounding boxes, used to normalize the text field classification and regression loss; n l is the total number of positive connections, used to normalize the connection classification loss ; λ 1 and λ 2 are weight constants; (1.2.8)在步骤(1.2.7)的训练过程中,采用在线扩增方法对训练数据Itr进行在线扩增,并采用在线负样本难例挖掘策略来平衡正样本和负样本,在训练图片Itr被缩放到相同的尺寸并批量加载之前,它们被随机地裁剪成一个个图像块,每个图像块与文字段的真实包围盒的jaccard重叠系数o最小;对于多方向文字,数据扩增是在多方向文字包围盒的最小包围矩形上进行的,每个样本的重叠系数o从0、0.1、0.3、0.5、0.7和0.9中随机选择,图像块的大小为原始图片尺寸的0.1-1倍之间;训练图像不水平翻转;另外,文字段和连接负样本占据训练样本的大部分,采用在线负样本难例挖掘策略来平衡正样本和负样本,对文字段和连接分开进行挖掘,控制负样本与正样本之间的比例不超过3:1。(1.2.8) In the training process of step (1.2.7), the online amplification method is used to amplify the training data I tr online, and the online negative sample difficult example mining strategy is used to balance the positive samples and negative samples. Before the training images I tr are scaled to the same size and loaded in batches, they are randomly cropped into image blocks, and the jaccard overlap coefficient o of each image block and the real bounding box of the text field is the smallest; for multi-directional text, the data Amplification is performed on the minimum enclosing rectangle of the multi-directional text bounding box, the overlap coefficient o of each sample is randomly selected from 0, 0.1, 0.3, 0.5, 0.7, and 0.9, and the size of the image block is 0.1 of the original image size Between -1 times; the training image is not flipped horizontally; in addition, the text field and connection negative samples occupy most of the training samples, and the online negative sample hard case mining strategy is used to balance the positive samples and negative samples, and the text field and connection are separated. Mining, control the ratio between negative samples and positive samples not to exceed 3:1. 3.根据权利要求1或2所述的基于连接文字段的自然图片中多方向文本检测方法,其特征在于,所述步骤(2.1)具体为:3. the multidirectional text detection method in the natural picture based on connection text field according to claim 1 or 2, is characterized in that, described step (2.1) is specially: 对待检测图像集Itst中的第i张待检测文本图像Itsti,缩放到统一尺寸,具体尺寸可随待检测文本图像的情况人为设定,记缩放后的待检测文本图像为Itsti';将图像Itsti'输入到步骤(1.2)中训练好的文字段连接检测模型,得到后6层卷积层分别输出的特征图构成的集合Itstoi'=[Itstoi1',...,Itstoi6'],其中Itstoil'为后6层卷积层中第l层输出的特征图,l=1,...,6,在每张输出特征图Itstoil'上的坐标(x,y)处,3×3的卷积预测器都会预测出(x,y)对应的初始包围盒Bilq被预测为正、负类文字段的得分cs,同时也预测出5个数字作为被预测为正类文字段s+时的几何偏移量。The i-th text image Itst i to be detected in the image set Itst to be detected is scaled to a uniform size, and the specific size can be artificially set according to the situation of the text image to be detected, and the text image to be detected after the scaling is recorded as Itst i '; The image Itst i ' is input to the text field connection detection model trained in step (1.2), and the set Itsto i '=[Itsto i1 ',...,Itsto i6 composed of the feature maps respectively output by the last 6 convolutional layers is obtained '], where Itsto il 'is the feature map output by the lth layer in the last 6 convolutional layers, l=1,...,6, the coordinates (x,y) on each output feature map Itsto il ' , the 3×3 convolution predictor will predict the initial bounding box B ilq corresponding to (x, y) is predicted as the score c s of the positive and negative text fields, and also predicts 5 numbers As the geometric offset when it is predicted as the positive text field s + . 4.根据权利要求1或2所述的基于连接文字段的自然图片中多方向文本检测方法,其特征在于,所述步骤(2.2)具体为:4. the multidirectional text detection method in the natural picture based on connecting text field according to claim 1 or 2, is characterized in that, described step (2.2) is specially: 在(2.1)预测的文字段基础上预测层内连接和跨层连接,对于层内连接,在同一张特征图Itstoil'上坐标点(x,y)处,3×3卷积预测器预测出s(x,y,l)与它8个近邻文字段之间层内连接的正、负得分cl1;对于跨层连接,3×3卷积预测器会预测出第l层基准文字段s(x,y,l)与第l-1层上4个近邻文字段的跨层连接正、负得分cl2,cl1和cl2构成预测的连接得分clPredict the intra-layer connection and cross-layer connection based on the text field predicted in (2.1). For the intra-layer connection, at the coordinate point (x, y) on the same feature map Itsto il ', the 3×3 convolutional predictor predicts Output s (x, y, l) and its 8 neighbor text fields The positive and negative scores c l1 of the intra-layer connection between them; for the cross-layer connection, the 3×3 convolution predictor will predict the reference text field s (x,y,l) of the l-th layer and the 4 on the l-1 layer neighbor text field The cross-layer connection positive and negative scores c l2 , c l1 and c l2 constitute the predicted connection score c l . 5.根据权利要求1或2所述的基于连接文字段的自然图片中多方向文本检测方法,其特征在于,所述步骤(2.3)具体为:5. the multidirectional text detection method in the natural picture based on connecting text field according to claim 1 or 2, is characterized in that, described step (2.3) is specially: 根据步骤(2.1)和步骤(2.2)的结果,在每一张特征图Itstoil'上坐标(x,y)处,将预测的文字段的得分cs、文字段的偏移层内连接得分cl1、跨层连接得分cl2这四项串接成一个33维的向量,卷积预测器的输出通道后增加一层额外的softmax层以分别标准化文字段得分和连接得分。According to the results of step (2.1) and step (2.2), at the coordinates (x, y) on each feature map Itsto il ', the score c s of the predicted text field and the offset of the text field The four items of intra-layer connection score c l1 and cross-layer connection score c l2 are concatenated into a 33-dimensional vector, and an additional softmax layer is added after the output channel of the convolutional predictor to standardize the text field score and connection score respectively. 6.根据权利要求1或2所述的基于连接文字段的自然图片中多方向文本检测方法,其特征在于,所述步骤(3.1)具体为;6. the multidirectional text detection method in the natural picture based on connecting text field according to claim 1 or 2, is characterized in that, described step (3.1) is specifically; 对于步骤(2)待检测文本图像输入到文字段检测模型而产生的固定数量的文字段s和连接Ns,通过它们的得分进行过滤;为文字段s和连接Ns设置不同的过滤阈值,分别为α和β;将过滤后的文字段s'作为节点,过滤后的连接Ns'作为边,利用它们构建一个图。For the fixed number of text fields s and connection N s generated by inputting the text image to be detected into the text field detection model in step (2), filter by their scores; set different filtering thresholds for the text field s and connection N s , are α and β respectively; take the filtered text field s' as a node, and the filtered connection N s ' as an edge, and use them to construct a graph. 7.根据权利要求1或2所述的基于连接文字段的自然图片中多方向文本检测方法,其特征在于,所述步骤(3.3)具体为:对步骤(3.2)在图上进行深度优先搜索得到的文字段集合S,通过下述步骤组合成一个完整的词,包括:7. The method for detecting multi-directional text in natural pictures based on connected text fields according to claim 1 or 2, wherein the step (3.3) is specifically: performing a depth-first search on the graph for step (3.2) The obtained text field set S is combined into a complete word through the following steps, including: (3.3.1)输入:|S|为集合S里的文字段数量,其中为第i个文字段,i为上标,分别为第i个文字段包围盒s(i)的中心横坐标和纵坐标,分别为文字段包围盒s(i)的宽度和高度,为文字段包围盒s(i)与水平方向之间的夹角;(3.3.1) Input: |S| is the number of text fields in the set S, where is the i-th text field, i is the superscript, are respectively the central abscissa and ordinate of the i-th text field bounding box s (i) , are the width and height of text field bounding box s (i) , respectively, is the angle between the text field bounding box s (i) and the horizontal direction; (3.3.2)其中θb为输出包围盒的偏移角度,为集合里第i个文字段包围盒s的偏移角度,由集合S里所有文字段的平均偏移角度得到;(3.3.2) where θ b is the offset angle of the output bounding box, is the offset angle of the i-th text field bounding box s in the set, obtained from the average offset angle of all text fields in the set S; (3.3.3)找到直线tan(θb)x+b的截距b,使得集合S中所有的文字段到中心点的距离的总和最小;(3.3.3) Find the intercept b of the straight line tan(θ b )x+b, so that all text fields in the set S reach the center point The sum of the distances is the smallest; (3.3.4)找到直线的两个端点(xp,yp)和(xq,yq),p表示第一个端点,q表示第二个端点,xp、yp分别为第一个端点的横、纵坐标,xq、yq分别为第二个端点的横、纵坐标;(3.3.4) Find the two endpoints (x p , y p ) and (x q , y q ) of the straight line, p represents the first endpoint, q represents the second endpoint, x p and y p are the first The abscissa and ordinate of the first endpoint, x q and y q are respectively the abscissa and ordinate of the second endpoint; (3.3.5)b表示输出包围盒,xb、yb分别为输出包围盒中心的横、纵坐标;(3.3.5) b represents the output bounding box, x b and y b are the horizontal and vertical coordinates of the center of the output bounding box, respectively; (3.3.6)其中wb为输出包围盒的宽度,wp、wq分别为以点p为中心的包围盒的宽度和以q为中心的包围盒的宽度;(3.3.6) Where w b is the width of the output bounding box, w p and w q are respectively the width of the bounding box centered on point p and the width of the bounding box centered on q; (3.3.7)hb为输出包围盒的高度,为集合里第i个文字段包围盒s的高度,由由文字段集合S里所有文字段的平均高度得到;(3.3.7) h b is the height of the output bounding box, is the height of the i-th text field bounding box s in the set, obtained from the average height of all text fields in the text field set S; (3.3.8)b:=(xb,yb,wb,hbb),b为输出包围盒,由坐标参数、尺寸参数、角度参数表示;(3.3.8)b:=(x b , y b , w b , h b , θ b ), b is the output bounding box, represented by coordinate parameters, size parameters, and angle parameters; (3.3.9)输出组合而成的包围盒b。(3.3.9) Output the combined bounding box b.
CN201710010596.7A 2017-01-06 2017-01-06 A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields Active CN106897732B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710010596.7A CN106897732B (en) 2017-01-06 2017-01-06 A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710010596.7A CN106897732B (en) 2017-01-06 2017-01-06 A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields

Publications (2)

Publication Number Publication Date
CN106897732A CN106897732A (en) 2017-06-27
CN106897732B true CN106897732B (en) 2019-10-08

Family

ID=59197865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710010596.7A Active CN106897732B (en) 2017-01-06 2017-01-06 A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields

Country Status (1)

Country Link
CN (1) CN106897732B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11797774B2 (en) * 2019-07-16 2023-10-24 Ancestry.Com Operations Inc. Extraction of genealogy data from obituaries

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304761A (en) * 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 Method for text detection, device, storage medium and computer equipment
CN107766860A (en) * 2017-10-31 2018-03-06 武汉大学 Natural scene image Method for text detection based on concatenated convolutional neutral net
CN107977620B (en) * 2017-11-29 2020-05-19 华中科技大学 Multi-direction scene text single detection method based on full convolution network
CN107844785B (en) * 2017-12-08 2019-09-24 浙江捷尚视觉科技股份有限公司 A kind of method for detecting human face based on size estimation
CN108304835B (en) * 2018-01-30 2019-12-06 百度在线网络技术(北京)有限公司 character detection method and device
CN108427924B (en) * 2018-03-09 2020-06-23 华中科技大学 A Text Regression Detection Method Based on Rotation Sensitive Features
CN108549893B (en) * 2018-04-04 2020-03-31 华中科技大学 An End-to-End Recognition Method for Scene Texts of Arbitrary Shapes
CN109086663B (en) * 2018-06-27 2021-11-05 大连理工大学 Scale-adaptive natural scene text detection method based on convolutional neural network
CN109583367A (en) * 2018-11-28 2019-04-05 网易(杭州)网络有限公司 Image text row detection method and device, storage medium and electronic equipment
CN109685718B (en) * 2018-12-17 2020-11-10 中国科学院自动化研究所 Picture squaring zooming method, system and device
CN109886286B (en) * 2019-01-03 2021-07-23 武汉精测电子集团股份有限公司 Target detection method, target detection model and system based on cascade detectors
CN109886264A (en) * 2019-01-08 2019-06-14 深圳禾思众成科技有限公司 A kind of character detecting method, equipment and computer readable storage medium
CN109977997B (en) * 2019-02-13 2021-02-02 中国科学院自动化研究所 Image target detection and segmentation method based on convolutional neural network rapid robustness
CN110032969B (en) * 2019-04-11 2021-11-05 北京百度网讯科技有限公司 Method, apparatus, device, and medium for detecting text region in image
CN110490232B (en) * 2019-07-18 2021-08-13 北京捷通华声科技股份有限公司 Method, device, equipment and medium for training character row direction prediction model
CN113065544B (en) * 2020-01-02 2024-05-10 阿里巴巴集团控股有限公司 Character recognition method and device and electronic equipment
CN111259764A (en) * 2020-01-10 2020-06-09 中国科学技术大学 Text detection method and device, electronic equipment and storage device
CN111291759A (en) * 2020-01-17 2020-06-16 北京三快在线科技有限公司 Character detection method and device, electronic equipment and storage medium
CN111444674B (en) * 2020-03-09 2022-07-01 稿定(厦门)科技有限公司 Character deformation method, medium and computer equipment
CN113515920B (en) * 2020-04-09 2024-06-21 北京庖丁科技有限公司 Method, electronic device and computer readable medium for extracting formulas from tables
CN111967463A (en) * 2020-06-23 2020-11-20 南昌大学 Method for detecting curve fitting of curved text in natural scene
CN111914822B (en) * 2020-07-23 2023-11-17 腾讯科技(深圳)有限公司 Text image labeling method, device, computer readable storage medium and equipment
CN113888759A (en) * 2021-10-13 2022-01-04 广东金赋科技股份有限公司 Key-value pair extraction method and system based on deep learning model
CN115620081B (en) * 2022-09-27 2023-07-07 北京百度网讯科技有限公司 Training method of target detection model and target detection method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050471B (en) * 2014-05-27 2017-02-01 华中科技大学 Natural scene character detection method and system
CN105989330A (en) * 2015-02-03 2016-10-05 阿里巴巴集团控股有限公司 Picture detection method and apparatus
CN106156711B (en) * 2015-04-21 2020-06-30 华中科技大学 Text line positioning method and device
CN105184312B (en) * 2015-08-24 2018-09-25 中国科学院自动化研究所 A kind of character detecting method and device based on deep learning
CN105469047B (en) * 2015-11-23 2019-02-22 上海交通大学 Chinese detection method and system based on unsupervised learning deep learning network
CN105608456B (en) * 2015-12-22 2017-07-18 华中科技大学 A kind of multi-direction Method for text detection based on full convolutional network
CN105574513B (en) * 2015-12-22 2017-11-24 北京旷视科技有限公司 Character detecting method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11797774B2 (en) * 2019-07-16 2023-10-24 Ancestry.Com Operations Inc. Extraction of genealogy data from obituaries

Also Published As

Publication number Publication date
CN106897732A (en) 2017-06-27

Similar Documents

Publication Publication Date Title
CN106897732B (en) A Multi-Oriented Text Detection Method in Natural Images Based on Linked Text Fields
Lyu et al. Multi-oriented scene text detection via corner localization and region segmentation
CN107977620B (en) Multi-direction scene text single detection method based on full convolution network
Yuliang et al. Detecting curve text in the wild: New dataset and new solution
Ma et al. Rpt: Learning point set representation for siamese visual tracking
Barroso-Laguna et al. Key. net: Keypoint detection by handcrafted and learned cnn filters revisited
CN106650725B (en) Candidate text box generation and text detection method based on fully convolutional neural network
Donati et al. Deep orientation-aware functional maps: Tackling symmetry issues in shape matching
CN108427924A (en) A kind of text recurrence detection method based on rotational sensitive feature
CN111079739B (en) Multi-scale attention feature detection method
Xia et al. Loop closure detection for visual SLAM using PCANet features
Yu et al. Robust thermal infrared object tracking with continuous correlation filters and adaptive feature fusion
Zhao et al. Adversarial deep tracking
Han et al. Research on remote sensing image target recognition based on deep convolution neural network
CN113011359B (en) Method for simultaneously detecting plane structure and generating plane description based on image and application
Wang et al. Adaptive temporal feature modeling for visual tracking via cross-channel learning
He et al. Aggregating local context for accurate scene text detection
Sun et al. Pseudo-LiDAR-based road detection
Wang et al. A robust approach for scene text detection and tracking in video
Gu et al. Linear time offline tracking and lower envelope algorithms
Dai et al. RGB‐D SLAM with moving object tracking in dynamic environments
Pu et al. Learning temporal regularized correlation filter tracker with spatial reliable constraint
Geng et al. SANet: A novel segmented attention mechanism and multi-level information fusion network for 6D object pose estimation
Li et al. Learning spatial self‐attention information for visual tracking
Li et al. Centroid-based graph matching networks for planar object tracking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant