CN113065404A

CN113065404A - Method and system for detecting train ticket content based on equal-width character segments

Info

Publication number: CN113065404A
Application number: CN202110249213.8A
Authority: CN
Inventors: 刘义江; 姜琳琳; 李云超; 辛锐; 陈曦; 侯栋梁; 魏明磊; 杨青; 池建昆; 范辉; 陈蕾; 阎鹏飞; 吴彦巧; 姜敬; 檀小亚; 师孜晗
Original assignee: Xiongan New Area Power Supply Company State Grid Hebei Electric Power Co; State Grid Hebei Electric Power Co Ltd
Current assignee: Xiongan New Area Power Supply Company State Grid Hebei Electric Power Co; State Grid Hebei Electric Power Co Ltd
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2021-07-02
Anticipated expiration: 2041-03-08
Also published as: CN113065404B

Abstract

The invention belongs to the field of bill text detection, and particularly relates to a content detection system and method based on equal-width character fragments. The system comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for reading an independent picture and outputting a feature map of the independent picture; the first prediction module is used for reading the characteristic diagram and outputting first candidate area information; the first output module is used for reading the first candidate area information and outputting a first text area; the second prediction module is used for reading the characteristic diagram and outputting second candidate area information; and the second output module is used for reading the first text area and outputting a second text area after the first text area is corrected by using the second candidate area information. The method comprises the steps that independent pictures with text core labels form a training set to train a deep learning neural network called by a feature extraction module, a first prediction module and a second prediction module in a system; only the first text region obtained by the first prediction module is used as system output in the application stage.

Description

Method and system for detecting train ticket content based on equal-width character segments

Technical Field

The invention belongs to the field of bill text detection, and particularly relates to a content detection method and system based on equal-width character fragments.

Background

The recording of Travel expense Reimbursement certificates (Travel extensions Reimbursement Documents) such as train tickets and the like is an important component in the financial information processing system, one ticket uses a character symbol form to simultaneously bear and carry a plurality of items of content information, taking the train ticket as an example, the ticket at least comprises content information required by the financial information processing system such as a starting site, a trip date, money amount and the like. At present, the input information is mainly filled in manually by an applicant, and the method is very time-consuming and labor-consuming.

The existing deep learning technology can realize automatic extraction of information, and can greatly save human resource cost. However, when applied to a specific learning scenario such as train ticket reimbursement, some specific difficulties are faced. The automatic train ticket information extraction process needs two stages of detection and identification. The detection link is used as a basic stage of the whole process, and more problems are faced: on the one hand, current (train) tickets are mainly printed in an ink mode, and problems of paper penetration, unclear fonts, content inclination and the like can occur. On the other hand, improper storage of these tickets by the user may cause wrinkles, which affect the detection. On the other hand, the imaging effect of the scanning device under the actual application condition, the illumination condition during scanning and the like all cause the uploaded image to be unclear, and the difficulty is further increased for detection.

Aiming at the detection problem of text content information, the existing methods based on deep learning are mainly divided into two types, namely segmentation-based and regression-based. The method based on segmentation can process any shape, but a complex post-processing method is needed for clustering, so that the efficiency is low; the regression-based method can directly obtain the regional information of the text, the speed is high, however, the regression-based method cannot obtain an accurate boundary, and under the detection scene of train ticket bills, missed detection and incorrect boundary can greatly influence other subsequent processing.

Disclosure of Invention

The invention aims to provide a method and a system for detecting the content of a train ticket based on equal-width character segments, which can apply a deep learning technology to a paperless financial processing flow of the reimbursement of the train ticket to realize the automatic extraction of content information, and are particularly suitable for solving the problem of text detection in the scene that the boundaries of various content information of tickets such as the train ticket are fuzzy and the character distribution shapes are more regular.

The invention provides a train ticket content detection system based on equal-width character segments, which comprises:

the characteristic extraction module is used for reading the independent picture and outputting a characteristic diagram of the independent picture;

the first prediction module is used for reading the feature map and outputting first candidate area information;

a first output module, configured to read the first candidate region information and output a first text region;

the second prediction module is used for reading the characteristic diagram and outputting second candidate area information; and the number of the first and second groups,

the second output module is used for reading the first text area and outputting a second text area after the first text area is corrected by using the second candidate area information;

wherein,

the feature map is output by the last convolution layer containing parameters of VGGNet; the first candidate area information is output through a full connection layer after feature extraction is carried out on the feature map by the BilSTM; the first candidate region information includes coordinate information and score information extracted from the feature map according to a regression idea;

the second candidate region information comprises a text center kernel extracted from the feature map by multilayer convolution based on a segmentation idea; the second prediction module is configured with a text center core prediction branch network, and the second prediction module detects the feature map output by the feature reading module through the text center core prediction branch network to obtain the text center core, extracts each first text region, and then corrects the boundary of the first text region to obtain the second text region.

The skilled in the art understands that, in the detection of the text content of the picture, the feature extraction based on the segmentation idea means that feedback learning is performed by taking the probability of predicting whether the pixel points of the picture are the text or not as the target when the target function of the network model selects and processes the pixel points of the picture in the optimization process so as to cluster and connect the picture, and the extraction based on the regression idea means that the target function of the network model selects and uses the boundary of the text region of the picture in the optimization process to segment the picture

It will be readily appreciated that the various system embodiments provided by the first aspect of the present invention comprise at least three main components: the first part is a feature extraction module based on VGGNet so as to obtain a feature map of an independent picture; the second part adopts a first prediction module of the BilSTM so as to judge the text/non-text segment and the segment boundary information of the feature map based on the regression idea; and the third part is a second prediction module comprising a text center core detection branch network so as to correct text boundaries by segmenting the inclusion according to the feature map based on the segmentation idea. In one embodiment, the feature extraction module uses VGG-16 as a backbone network, and removes the last pooling layer and the full-link layer to obtain the spatial features of the input picture; the first prediction module inputs the feature map extracted by the backbone network into a bidirectional LSTM network to obtain sequence features, and then the sequence features are output through a full connection layer so as to classify the foreground and the background in the feature map and predict the initial coordinates and the height of a candidate area; the multilayer convolution layer of the second prediction module is used for determining a central region of the text, and the region can obtain a complete text region after expansion so as to assist in boundary judgment; the second prediction module and the second output module only participate in parameter iterative learning when the system trains the deep learning network of the system, and the first output module directly outputs the external identification system outside the working data stream cut out in the detection stage.

In a preferred embodiment, the second output module is configured with a merging unit, and the merging unit is implemented by a Vatti algorithm and a merging algorithm; the second output module expands the text center core region contained in the second candidate region information output by the second prediction module through the merging unit to obtain a text region with a complete text center core, and then calculates the intersection of the text region and the first text region generated by the first output module to output as the final second text region.

In a preferred embodiment, the train ticket content detection system comprises a preprocessing module, wherein the preprocessing module is used for preprocessing an original picture so as to generate the independent picture; the preprocessing comprises binarization processing and angle correction processing.

In a preferred embodiment, the train ticket content detection system comprises a label making module, wherein the label making module calculates a central area of a corresponding text according to four vertex information of the text area according to a configured Vatti formula, and then obtains a central core coordinate point of the text area by using contour information of the central area.

The second aspect of the invention provides a train ticket content detection method based on equal-width character segments, which comprises the following steps:

step 100, preprocessing a scanning picture for creating a sample to obtain an independent picture with a fixed size;

200, using independent pictures with text core labels to form a training set to train the deep learning neural network called by the feature extraction module, the first prediction module and the second prediction module of the train ticket text detection system in the first aspect;

step 300, obtaining the independent picture for text detection with fixed size from the scanned picture for detection and identification;

step 400, obtaining a first text area of each text content information in the independent picture in step 300 by using the train ticket text detection system; the first text area is used for character recognition of an external system.

In some preferred embodiments, the picture of a single train ticket is uploaded to the system by using a scanning device or a photographing device. When the train ticket is not horizontally placed, the extraction of the text segments can be greatly influenced, so that the picture correction is needed firstly. In these embodiments, a contour-based correction method is used. The operation process can be further divided into the following steps, wherein in the first step, the uploaded picture is subjected to graying treatment, and the brightness information of each pixel point is recorded by utilizing one byte; secondly, carrying out binarization on the image by using the set fixed threshold value to obtain a black-and-white image; thirdly, searching the outline of the whole picture by using an external visual identification library such as opencv and the like, acquiring an enclosing matrix of the outline, and calculating a corresponding angle; fourthly, performing rotation correction on the image by using the rotation angle; and fifthly, intercepting the area in the outline of the rotated image into an independent picture, and inputting the independent picture into a neural network for subsequent processing. Preferably, step 100 in these embodiments comprises the steps of:

101, performing graying processing on an uploaded original scanning picture, and recording the brightness information of each pixel point by using one byte;

102, utilizing a fixed threshold value set by a preprocessing module to carry out binarization on the image brightness information obtained in the last step to obtain a black-and-white image;

103, searching the outline of the whole picture, acquiring an enclosing matrix of the outline, and calculating an angle corresponding to a datum line;

104, calculating a rotation angle by using the angle of the previous step, and performing rotation correction on the image;

and 105, intercepting the area in the outline of the rotated image into an independent picture.

It is easily understood that these pre-processing procedures for pictures can also be used for processing the original picture to be detected, which carries text content information, when step 300 is implemented.

In some preferred embodiments, the deep learning neural network of the train ticket text detection system in the step 200 is trained by the following steps:

step 201, constructing all neural network structures and configuring optimization objective functions of the train ticket text detection system;

step 202, creating a sample set, and labeling a text center core area in a training sample through a label making module;

step 203, training a feature extraction module, a first prediction module, a first output module, a second prediction module and a second output module of the train ticket text detection system;

and 204, selecting a training result to store the deep learning neural network parameters.

In a preferred embodiment, the processed train ticket picture is firstly subjected to feature extraction through a VGG-16 convolutional neural network. The last fully connected and pooled layers are removed in the VGG-16 and only the first five blocks are used. A feature map (feature map) of size C × H × W is obtained, C, H, W representing the channel, height, and width of the image, respectively. And then, taking a 3 × 3 window on the feature map for processing, namely, performing feature extraction on each point and a surrounding 3 × 3 area to obtain a feature vector with the length of 3 × 3 × C, and obtaining a feature map with the size of 9C × H × W. The shape of the feature map is then adjusted to H × W × 9C.

The features extracted by the VGG-16 are further input into the LSTM network, wherein the maximum time length is the width of the feature map. The LSTM network can well process serialized objects, and in order to enhance the ability of the network to learn the sequence characteristics of each row, the method adopts the bidirectional LSTM network. The feature map after passing through the LSTM network contains spatial features and sequence features. The output characteristics of the bidirectional LSTM network are respectively output the prediction of a text candidate box and the prediction conditions of characters and non-characters after passing through a 512-dimensional full-connection layer.

The generated text candidate box comprises an area with a low partial response value and an area which is overlapped with the area, so that the first prediction module is configured with a filtering unit which firstly adopts an NMS algorithm to filter the text candidate box by a set threshold value. And taking the predicted probability values of the characters and the non-characters generated by the BilSTM as scores of the corresponding regions, and filtering out if the scores are smaller than a threshold value. The generated text candidate boxes belong to segments of the complete text and therefore need to be concatenated to form the detected text region. In a further preferred embodiment, the text segments are connected by a bidirectional detection method. Firstly, each text segment is sequenced in the horizontal direction, the distance between each text segment and other text segments is calculated from left to right, and when the distance is smaller than a set threshold value, the text segments are considered to belong to the same text. The spacing interval of the segments is then calculated again from right to left, again compared to the threshold. And for the vertical direction, calculating the overlapping area of the text segments, and when the overlapping degree exceeds an overlapping threshold value, attributing to the same text. And finally, taking the midpoints of the upper boundary and the lower boundary of all the text segments belonging to the same text to form a complete text boundary.

In addition, in order to enable the text segments learned by the network to be more accurate, a text center core prediction branch network is introduced in the training process. The branch is based on the idea of segmentation, consisting of three convolutional layers. After the features extracted by the VGG-16 are input, the central core area of the text is calculated, then the central core area is expanded to the size of the corresponding text area to form a second text area, the text boundary of the second text area and the text boundary of the first text area are compared in a second output module, and finally the final text detection result of the union area is obtained. It is easy to understand that the text center core prediction branch network can be used for effectively assisting in correcting the boundary of the generated text candidate frame.

In some preferred embodiments, the optimization objective function in step 201 is configured to:

wherein s is_iRepresenting the probability value of predicting a pixel to be textual/non-textual,

representing the corresponding true value, N_sThe number of pixel points involved in the calculation is represented,

representing the loss function used; v. of_jAnd

respectively representing predicted and true coordinate values, N_vIndicating participation in calculating the number of pixels, L^reRepresenting the loss function used; y is_mAnd x_mRespectively representing the predicted values and the true values of the pixel points, S_lExpressing a sampled pixel point set for controlling the proportion of positive and negative samples; i. j and m represent the order of each pixel traversed, respectively.

In some preferred embodiments, after the training of the neural network in the train ticket text detection system is completed, the second text region output obtained by the second prediction module and the second output module branch is not used, and the first text region is directly used as a final output result. And when detecting, no longer distributing calculation and communication resources for the second prediction module and the second output module branch, and only taking the first text region as system output.

The technical scheme of the invention has the beneficial effects that: and detecting a regular text carrying content information in the picture of the train ticket bill. The text is divided into text segments with equal width through a backbone network for prediction, the training difficulty is reduced, and the sequence features of the text are efficiently extracted by using the BilSTM so as to form an effective detection result after combination. The technical concept also facilitates the use of the text center core prediction branch network to strengthen the predicted text box boundary in the training process, thereby realizing more accurate detection output. The system and the method provided by the invention can effectively solve the problem of detecting the content in the train ticket bill.

Drawings

FIG. 1 is a system block diagram of one embodiment of a train ticket text detection system of the present invention;

FIG. 2 is a system block diagram of another embodiment of the train ticket text detection system of the present invention;

FIG. 3 is a system block diagram of another embodiment of the train ticket text detection system of the present invention;

FIG. 4 is a system block diagram of another embodiment of the train ticket text detection system of the present invention;

FIG. 5 is a system block diagram of another embodiment of the train ticket text detection system of the present invention;

FIG. 6 is a schematic structural diagram of a backbone network in an embodiment of a train ticket text detection system according to the present invention;

FIG. 7 is a schematic structural diagram of a text-centric prediction branch network in an embodiment of a train ticket text detection system according to the present invention;

FIG. 8 is a schematic structural diagram of a train ticket text detection system including a labeling module according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a processing procedure for obtaining an independent picture from an original picture in an embodiment of a text detection method for a train ticket according to the present invention;

FIG. 10 is a schematic diagram of a data processing flow when step 200 is implemented in an embodiment of a method for detecting a text of a train ticket according to the present invention;

fig. 11 is a schematic diagram of a first text region and a second text region on an independent picture obtained during training according to an embodiment of the train ticket text detection method of the present invention.

Detailed Description

It should be noted that, in this document, VGGNet is a deep convolutional neural network proposed by the computer vision Group (Visual Geometry Group) of oxford university and the company Google deep mind, and all the VGGNet uses a 3 × 3 small convolutional kernel and a 2 × 2 maximum pooling kernel, so as to improve performance by continuously deepening a network structure, as the network deepens, the width and height of an image are continuously reduced by a certain rule, and after each pooling, the width and height of the image are reduced by half, and the number of channels is continuously increased by one time. The backbone network deletes the last pooling layer in the VGGNet and three full-connection layers after the last pooling layer, and outputs a normalized feature map after processing the feature value obtained by the last convolution-containing layer by using Softmax.

The core concept of the invention is to use a new deep learning neural network system to train the deep learning neural network system to detect the text content information arranged in the same direction in the bill, and during detection, the text line is divided into text segments to perform equal-width detection aiming at each text segment, thereby reducing the difficulty of detection. Because the detection is constructed according to a method of a neural network based on a regression idea, in order to overcome the problem of inaccurate text boundary in the conventional training of the method, a text center core prediction branch network is added in the training, the boundary is corrected, and then model parameters are optimized, so that the effect of further identification by using the detection result is improved. The main neural network module of the neural network system in the technical scheme relates to the following three parts: the first part is a feature extraction module based on a VGGNet structure so as to perform preliminary feature map extraction on the picture; the second part is a first prediction module which adopts a BilSTM structure to judge text/non-text segments and boundary information; the third part is a second prediction module which takes a text center core prediction branch network based on a segmentation idea as a core and is used for correcting the text boundary extracted by the second part. In the feature extraction module, the last pooling layer and the full-link layer of the VGGNet are removed, and the network before the last pooling of the VGGNet is used as a backbone network, so that the spatial features of the input picture can be obtained. In the first prediction module, the feature map extracted by the backbone network is input into a BilSTM network to obtain the spatial features and sequence features of the text part, and then classification of the foreground and the background and the initial coordinates and height of the prediction candidate area are output through a full connection layer. In the second prediction module, a text center core detection branch network is constructed by utilizing multilayer convolution layers, the network continuously extracts features from the output of the feature extraction module through convolution operation, the features are used for determining a center region of text content, and the region can obtain a complete text region after expansion so as to assist in boundary judgment in training. The embodiments of the invention are particularly suitable for detecting the bill content information such as train tickets, and the like, the areas and the positions of all texts in the train ticket bills have larger difference, but the text extending directions are basically consistent, for example, all text directions can be basically horizontal after one rotation process; in the embodiments, the texts in the same direction are divided into the text segments, and each text segment is detected, so that the detection difficulty is reduced. According to the detection method based on deep learning, the problem that the detection accuracy of the information boundary is poor is corrected according to a regression and segmentation method in the training stage, the training difficulty is reduced, meanwhile, the accuracy is improved, only a regression part-based network model is used for processing data in the application stage, and the expenditure distribution of operation, storage and communication resources in the application stage is reduced. The principle is that text center kernel prediction is added in the training step so as to correct information boundaries obtained by simple regression processing and optimize detection results. In some specific embodiments, the whole system adopts VGG-16 to perform feature extraction, bidirectional LSTM is used for predicting character candidate regions and background information, regression-based text region boundaries are obtained, in training, a kernel branch is used for predicting a central core region of content to be detected, and the boundaries are corrected by adding text central core prediction.

In a first embodiment of a first aspect of the present invention, as shown in fig. 1, a train ticket text detection system based on a text segment with a uniform width is provided. The train ticket text detection system is a neural network system established in a computer system, is used for outputting text areas containing various specific content information according to an input train ticket picture, and comprises the following components: the image processing apparatus includes a feature extraction module 1001 configured to read an independent picture and output a feature map of the independent picture, a first prediction module 1002 configured to read the feature map and output first candidate region information, and a first output module 1003 configured to read the first candidate region information and output a first text region.

The feature extraction module 1001 is configured with a whole neural network structure before the last convolution-containing layer of VGGNet, referred to herein as a backbone network, which includes an input layer and each convolution section (block) of VGGNet standard, but does not include a pooling layer after the last convolution section and all full-connected layers after the pooling layer, that is, the feature extraction module 1001 reads an independent picture through the input layer of the backbone network and directly uses the output of the last convolution-containing layer of the backbone network to output a whole feature map of the independent picture. In the present embodiment, the Softmax function used for normalization in output is,

y＝a_i/∑a_i

wherein, a_iIndicating the value of the ith position. The function is mainly used for normalizing the output result of the previous step and limiting the normalized output result to be 0-1, so that the whole network model in the system can be converged more quickly in the training process, and the method isThe method splices directly behind Block5 of VGGNet. After the last convolutional layer is subjected to Softmax, a multi-dimensional vector is directly obtained, so that the vector is directly input into the BilSTM and/or text center core detection branch network.

The first prediction module 1002 is configured with a bi-directional lstm (BiLSTM) and a fully connected layer behind the BiLSTM. The first prediction module 1002 reads all feature maps output by the feature extraction module 1001, performs feature extraction on the feature maps by BilSTM, and outputs first candidate region information through a full connection layer; the first candidate region information includes coordinate information and score information.

It is easy to understand that, when the backbone network used by the feature extraction module 1001 is used for outputting the last convolution-containing layer, each time a sliding convolution kernel window with a specified step size obtains a feature vector of a fixed size, and these feature vectors are used for predicting several candidate text segments in the BiLSTM. BilSTM is configured to output a prediction suggestion for a candidate region of a continuous fixed-width fine-grained text segment, i.e., the constant-width text segment described in the present invention.

Preferably, in some embodiments, the feature vectors output by the backbone network are processed by the BiLSTM into a large number of text segment candidate regions, where the text segment candidate regions include a large number of redundant text segment candidate boxes, so in these embodiments, the first prediction module 1002 is configured with a filtering unit, and the filtering unit implemented by the NMS algorithm filters the text segment candidate regions to remove the text segment candidate regions that overlap with each other, and finally merges and adjusts the text segment candidate regions into a text detection result, that is, the first text region in this embodiment, by using methods such as bidirectional text detection in the first output module 1003.

Because the BiLSTM extraction features include spatial features and sequence features, the first output module 1003 splices and screens each candidate region according to the acquired first candidate region information to output each first text region containing each item of specific content information.

Referring to fig. 2, in a second embodiment of the first aspect of the present invention, there is provided a train ticket text detection system based on a text segment with a same width, which further includes, compared with the first embodiment: a second prediction module 1004 for reading the feature extraction module 1001 and outputting a feature map and second candidate region information, and a second output module 1005 for reading the first text region and outputting a second text region after being corrected by using the second candidate region information.

The second prediction module 1004 is configured with a text center core prediction branch network, and the second prediction module 1004 detects the feature map output by the feature reading module 1001 through the text center core prediction branch network based on a segmentation idea to obtain a text center core, extracts each text region, and then corrects the boundary of each text region to obtain second candidate region information. Obviously, the second candidate region information includes a text center kernel extracted by multilayer convolution from the feature map.

The second output module 1005 is configured with a merging unit implemented by a Vatti algorithm and a merging algorithm. The second output module 1005, for the text center kernel included in the second candidate region information output by the second prediction module 1004, performs expansion on the text center kernel region by using a Vatti algorithm to obtain a complete text region of the text center kernel, and then obtains an intersection of the text region and the first text region generated by the first output module 1003 as a final second text region by using a merge algorithm.

It is easy to understand that, in this embodiment, the boundary problem cannot be well handled based on the prediction result of the text segment, and whether the detection result is accurate or not greatly affects the subsequent identification step, so the method continues to introduce the central core prediction branch, and corrects the boundary after detecting the text region based on the idea of segmentation. The branch is composed of three convolutional layers, and fig. 7 shows an exemplary structure of a text-centric core prediction branch according to the present invention. The central core area of the corresponding text of the train ticket can be predicted through the branch by means of the features extracted by the VGG-16. After the text center kernel is obtained, the Vatti algorithm can be used again to expand the region to obtain a complete text region, and the region and the LSTM network generate a detection result to obtain an intersection as a final detection result.

In a third embodiment of the first aspect of the present invention, as shown in fig. 3 and 4, a train ticket text detection system based on text segments of equal width is provided, which further includes a preprocessing module 10000 compared with the first embodiment and the second embodiment, where the preprocessing module 1000 is configured to preprocess an original picture so as to generate the independent picture; the preprocessing comprises binarization processing and angle correction processing. It is easy to understand that the image and processing are conventional technical means in the field, but the train ticket text detection system including the binarization processing and the angle correction processing can improve the model training efficiency, and the preprocessing module can be multiplexed with the data enhancement in the training process if necessary.

Referring to fig. 5, in a fourth embodiment of the first aspect of the present invention, a train ticket text detection system based on uniform-width text segments is provided, which further includes a new labeling module 1006, compared with the first embodiment and the second embodiment, for labeling a text center region of a training set in advance according to a text region. Specifically, in this embodiment, the tag production module 1006 is configured with a Vatti algorithm, and when the tag production module 1006 is used, the tag production module 1006 calculates a central area of a corresponding text according to four vertex information of the text content according to a configured formula, and extracts outline information by using an OpenCV library or the like to obtain a central kernel coordinate point of the text.

The following further explains the implementation and specific functions of each module in the train ticket text detection of the present invention by using the train ticket text detection method for the equal-width text segments provided in the embodiments of the second aspect of the present invention.

In one embodiment of the second aspect, the method for detecting the text of the train ticket comprises the following main steps:

and step 100, preprocessing the picture. In an exemplary embodiment, a scanning device or a photographing device is used to upload a single picture of a train ticket to a train ticket text detection system of the present invention. Because the extraction of the text segments can be greatly influenced when the train ticket is placed in a non-horizontal mode, the preprocessing module 1000 of the train ticket text detection system is used for firstly correcting the picture of the train ticket text detection system in the embodiment.

Specifically, the preprocessing module of the present embodiment employs a contour-based correction method. The processing flow can be further divided into the following steps: 101, performing graying processing on an uploaded original picture, and recording the brightness information of each pixel point by using one byte; 102, utilizing a fixed threshold value set by a preprocessing module to carry out binarization on the image brightness information obtained in the last step to obtain a black-and-white image; 103, searching the outline of the whole picture by using OpenCV, acquiring an enclosing matrix of the outline, and calculating an angle corresponding to a datum line; 104, calculating a rotation angle by using the angle of the previous step, and performing rotation correction on the image; and 105, intercepting the area in the outline of the rotated image into an independent picture. Referring to fig. 9, after an original picture obtained by shooting or scanning is input, the original picture is processed in steps 101 to 104 and waits for an intermediate picture, then an outline is extracted by using an externally called visual processing library such as OpenCV, and the like, the picture in the outline is extracted and processed by using a black-white boundary line, the area of the independent picture is a ticket area in the original picture, and the independent picture is a complete ticket picture containing various text content information so as to be input to a subsequent neural network module for processing.

And 200, obtaining a characteristic diagram of the independent picture through a backbone network. Specifically, the independent picture obtained after the processing in step 100 is subjected to feature extraction through a backbone network. Exemplarily, as shown in fig. 6, the present embodiment is used in a VGG-16-based backbone network, which is a network structure of only the first five blocks, except that the last three full connection layers and one pooling layer in the VGG-16 are removed. The output of the last convolutional layer in Block5 is a feature map (feature map) of size C × H × W, where C, H and W represent the feature channel, height and width of the image, respectively. And then, taking a 3 × 3 window on the feature map for processing, namely, performing feature extraction on each point and a surrounding 3 × 3 area to obtain a feature vector with the length of 3 × 3 × C, and obtaining a feature map with the size of 9C × H × W. And then adjusting the shape of the feature map to H multiplied by W multiplied by 9C and outputting. Obviously, the feature map carries spatial feature information of the independent picture.

In step 300, the features extracted by the backbone network of the VGG-16 structure are further input into the LSTM network, wherein the maximum time length is the width of the feature map. The LSTM network can well process serialized objects, and in order to enhance the ability of the network to learn the sequence characteristics of each row, the method adopts the bidirectional LSTM network. The feature map after passing through the LSTM network contains spatial features and sequence features.

The output characteristics of the bidirectional LSTM network are respectively output the prediction of a text candidate box and the prediction conditions of characters and non-characters after passing through a 512-dimensional full-connection layer.

In step 400, the generated text candidate box includes the coordinate region with a low partial response value score and the regions that overlap with each other, so in this embodiment, the NMS algorithm is first adopted in the filtering unit configured by the first prediction module 1002 to filter the text candidate box by the set threshold. The predicted probability values of the text and the non-text generated in step 300 are used as scores of the corresponding regions, and are filtered out when the score is lower than a threshold value.

The generated text candidate boxes belong to segments of the complete text and therefore need to be concatenated to form the detected text region. Preferably, the present embodiment uses a bidirectional detection method for processing. Firstly, each text segment is sequenced in the horizontal direction, the distance between each text segment and other text segments is calculated from left to right, and when the distance is smaller than a set threshold value, the text segments are considered to belong to the same text. The same operation is done by calculating the spacing interval of the segments again from right to left. And for the vertical direction, calculating the overlapping area of the text segments, and when the overlapping degree exceeds an overlapping threshold value, attributing to the same text. And finally, taking the midpoints of the upper boundary and the lower boundary of all the text segments belonging to the same text to form a complete text boundary.

Preferably, in this embodiment, in order to make the text boundary predicted by the network more accurate, a text-centric core prediction branch network is also introduced in the training process. The branch is based on the idea of segmentation, consisting of three convolutional layers. After the features extracted by the VGG-16 are input, the central core area of the text is calculated, then the central core area is expanded to the size of the corresponding text area, the size of the text area is compared with the text boundary obtained in the step 400, and finally the final text detection result of the union area is obtained. The branch can effectively assist in correcting the boundary of the generated text candidate frame.

Exemplarily, in this embodiment, the Vatti algorithm for generating the text center kernel region information is as follows:

where a ' represents the area of the text region, L ' represents the circumference of the text region, and r ' is a value set according to actual requirements. According to the formula, a central core area D' of the corresponding text can be calculated, and then OpenCV is used for extracting outline information to obtain a central core coordinate point of the text.

It is easy to understand that, when a training set of a graph convolution network is created in the prior art, labels of the training set only provide information of four vertices of contents in a ticket, and none of the labels provides information about a text center kernel. The method adopts a Vatti algorithm to generate the information of the text center core region.

The following is a further detailed description of the concept of the present invention, with reference to the neural network system established in the computer system provided in the first aspect, and the training and detection method provided in the second aspect, and with reference to specific data, providing application examples including the method and system of the present invention. Referring to fig. 6, 7, 8, and 9, in this application example, a computer system serving as a basis for operation of a neural network provided by the first aspect of the present invention includes a preprocessing module 1000, a feature extraction module 1001, a first prediction module 1002, a first output module 1003, a second prediction module 1004, a second output module 1005, and a label creation module 1006, where a backbone network structure called by the feature extraction module 1001 is shown in fig. 6, a text-centric core prediction branch network structure called by the second prediction module 1004 is shown in fig. 7, and fig. 10 illustrates a principle of an overall data processing structure of a neural network model operating in a computer system in this application example. As a second aspect of the invention provides a complete demonstration of the technical method, the application example describes the using process of the computer system according to steps.

Step 201, constructing all neural network structures of the train ticket text detection system and configuring an optimization objective function.

In particular, in the present application example, the optimization objective function is configured to:

it can be seen that the penalty L of this formula consists of three parts, the first representing the classification penalty of foreground and background, where s_iRepresenting the probability value of predicting a pixel to be textual/non-textual,

the Loss function used is shown, and the Dice Loss function is preferred in the embodiment; the second part represents the prediction error of the text segment, v_jAnd

respectively representing predicted and true coordinate values, N_vIndicating participation in calculating the number of pixels, L^reIndicating the Loss function used, the preferred embodiment is the L1 Loss function; the third part represents the prediction error of the text center kernel, y_mAnd x_mRespectively representing predicted and true values, S, at the pixel level_lThe sampling pixel point set is used to control the ratio of positive and negative samples, preferably, the ratio of positive and negative samples is controlled to be 1:3 in this embodiment. Wherein i, j, and m respectively represent the order of traversing each pixel. It is easy to understand that the first part and the third part are loss function settings based on the segmentation idea, and the second part is a loss function setting based on the regression idea.

It is easy to understand that when the application example system is used for detection after training is finished, the independent picture after the scanning picture is preprocessed is sent into a neural network of the system to generate a detection result through the BilSTM, meanwhile, the central core prediction branch is used for correcting and outputting the second, and the process is optimized and learned through the loss function.

Step 202, a sample set is created, and a text center core area in the training sample is labeled through a label making module 1006.

In the application example, the size of the train ticket is fixed, so all the picture samples finally input into the detection system are fixed to be 680 × 450, the size is not satisfied, and the conversion is performed by using a bilinear interpolation method. The data enhancement method used by the sample in the training process comprises the following steps: random brightness adjustment, saturation/hue adjustment.

Exemplarily, the sample in the present application undergoes the following preprocessing: firstly, carrying out gray level processing on a picture, fixing the brightness of each pixel point between 0 and 255 by using 8-bit data, carrying out binarization processing on the picture only by using a fixed threshold value, setting the pixel value to be 1 when the pixel value is larger than the threshold value, and setting the pixel value to be 0 when the pixel value is smaller than the threshold value, and converting the picture into a black-and-white picture. And then extracting the outline by using opencv to obtain side information and then calculating the inclination angle. The image can be corrected according to the inclination angle. It is easy to understand that human factors may cause the whole inclination of the train ticket due to the fact that the train ticket needs to be scanned or photographed and then uploaded, and the inclination may affect the extraction of text segments, so the method adopts a contour-based correction method to preprocess the picture.

Step 203, training the feature extraction module 1001, the first prediction module 1002, the first output module 1003, the second prediction module 1004, and the second output module 1005.

Illustratively, during training, the optimizer chooses adadelelta to calculate the gradient and performs back propagation. The trained batch size is set to 8, for a total of 1200 epochs.

In the training process, the scanned pictures are sent to a backbone network for feature extraction after being preprocessed, wherein the last pooling layers and the full-connection layer of the VGG-16 are removed. Then, 3 × 3 sliding window processing with step size of 1 is performed on the fifth block of VGG-16, and each window can obtain a feature vector with length of 3 × 3 × C. These feature vectors will be used to predict the text segments, i.e. 10 candidate text segments will be predicted at the center of each window. Then the feature vector of the feature map is sent to a BilSTM network to obtain a 256-dimensional feature vector containing sequence features, and then output through a 512-dimensional full connection layer. The output first candidate region information includes two parts, one part is predicted candidate box coordinates (center point coordinates and height), and the other part is corresponding score (text/non-text probability).

The first candidate region information of the output result contains a large number of redundant text segment candidate boxes, so that the text segments with overlarge overlap need to be filtered by an NMS algorithm first. And finally combining the text segments into a text detection result by a bidirectional text detection method. Firstly, sequencing all filtered text segments in the horizontal direction, firstly, calculating the distance between each text segment and other text segments from left to right, and if the distance is less than 40, determining that the text segments belong to the same text; and repeating the same process from right to left. In the vertical direction, if the intersection ratio of two text segments is greater than 0.7, the two text segments are considered to belong to the same text in the vertical direction. And then respectively taking the middle points of the upper and lower boundaries as the boundary points of the outer frame for all the segments belonging to the same text. The largest upper boundary midpoint and the smallest lower boundary midpoint are taken for the text segments that produce overlap in the vertical direction. And finally, smoothing all the points to generate a detection result.

Fig. 11 illustrates the boundaries of the respective first and second text regions obtained in the training on the separate picture, wherein the thicker box line a illustrates the boundaries of the respective first text regions and the thinner box line B illustrates the boundaries of the respective second text regions. It can be seen that the shape of the frame line B extracted by segmentation is irregular and variable, because the frame line a extracted by regression is relatively close to the text font and relatively smooth. The detection result supervised in training is the union of the first text region and the second text region.

Exemplarily, after 1200 epochs of training, a plurality of sets of model parameters can be obtained, and preferably, the optimal model parameter (the objective function value is minimum) is selected for practical application. In the application process, the train ticket still needs to be preprocessed, and the picture is corrected and adjusted to 680 × 450. At this point, data enhancement of the picture is no longer required.

Exemplarily, in the present application, in view of that the network model is trained well enough, and in order to increase the speed of the neural network in the system in practical application, the second prediction module 1004 and the second output module 1005 are removed in the application process of this step, that is, the second text region output after the text center core prediction branch fusion is no longer used, but the detection result of the bi-directional LSTM of the first prediction module 1002 and the first output module 1003, that is, the first text region is directly used as the final output result for the recognition input of the external processing stage. Other parameters of the entire network model remain unchanged in the text detection application of this step.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working processes of the system, the modules and the units described in the system embodiment may refer to corresponding processes in the above method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In another aspect, the shown or discussed couplings or direct couplings or communication connections between each other may be through interfaces, indirect couplings or communication connections of devices or units, such as calls to external neural network units, and may be in a local, remote or mixed resource configuration form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing device, or each module may exist alone physically, or two or more modules are integrated into one processing device. The integrated module can be realized in a form of hardware or a form of a software functional unit.

The integrated module, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-0 nlymetry Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above detailed description is provided for the train ticket text detection system and method, and the specific examples are applied in this document to explain the principle and the implementation of the present invention, and the description of the above examples is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A train ticket text detection system based on equal-width character fragments comprises:

wherein,

the feature map is output by the last convolution layer containing parameters of VGGNet; the first candidate area information is output through a full connection layer after feature extraction is carried out on the feature map by the BilSTM; the first candidate region information includes coordinate information and score information extracted from the feature map according to a regression idea.

2. The train ticket text detection system of claim 1, wherein:

And the train ticket text detection system uses the second text region to carry out parameter optimization in a training stage, and outputs the first text region for an external system to recognize in an application stage.

3. The train ticket text detection system of claim 1, wherein the second output module is configured with a merging unit implemented by a Vatti algorithm and a merging algorithm; the second output module expands the text center core region contained in the second candidate region information output by the second prediction module through the merging unit to obtain a text region with a complete text center core, and then calculates the intersection of the text region and the first text region generated by the first output module to output as the final second text region.

4. The train ticket text detection system of claim 1 comprising a pre-processing module for pre-processing an original picture to generate the independent picture; the preprocessing comprises binarization processing and angle correction processing.

5. The system of claim 1, wherein the system includes a labeling module, and the labeling module calculates a central region of a corresponding text according to the configured Vatti formula and the four vertex information of the text region, and then obtains the central coordinates of the text region by using the contour information.

6. A method for detecting a train ticket text based on equal-width character segments comprises the following steps:

step 200, using independent pictures with text core labels to form a training set to train a deep learning neural network called by a feature extraction module, a first prediction module and a second prediction module in the train ticket text detection system of claim 1;

7. The method of claim 6, wherein the step 100 comprises the steps of:

8. The method of claim 6, wherein the deep learning neural network of the train ticket text detection system in step 200 is trained by:

9. The method of claim 8, wherein the optimization objective function in step 201 is configured to:

representing the loss function used; v. of_jAnd

respectively representing predicted and true coordinate values, N_vIndicating participation in calculating the number of pixels, L^reRepresenting the loss function used; y is_mAnd x_mRespectively representing the predicted value and the true value of the pixel point, and Sl representing the sampled pixel point set for controlling the proportion of positive and negative samples; i. j and m represent the order of each pixel traversed, respectively.

10. The train ticket text detection method of claim 8, wherein: after the training of the neural network in the train ticket text detection system is completed, the second text region output obtained by the second prediction module and the second output module branch is not used, and the first text region is directly used as a final output result.