CN114419630A

CN114419630A - Text recognition method based on neural network search in automatic machine learning

Info

Publication number: CN114419630A
Application number: CN202111420015.XA
Authority: CN
Inventors: 王希佳
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-04-29

Abstract

The invention discloses a text recognition method based on neural network search in automatic machine learning, which comprises four steps of text picture preprocessing, feature extraction module space structure design, double-layer neural network search algorithm, feature extraction and prediction; the method comprises the steps of searching for sequence feature extraction related to data by using neural network structure search to complete a scene text recognition task, wherein the concept is ingenious, firstly, a novel search space is designed for a text recognition problem, and the search space effectively covers the type and the step length of convolution; subsequent experiments show that the backbone network searched based on the method can greatly improve the performance of the text recognition network.

Description

Text recognition method based on neural network search in automatic machine learning

Technical Field

The invention relates to the technical field of computer intelligent recognition, in particular to a text recognition method based on neural network search in automatic machine learning.

Background

With the development of technology, in the field of computer information technology in the prior art, a natural scene text recognition algorithm can be divided into three major parts, namely a preprocessing correction Module (retrieval Module), a picture Feature extraction Module (background Module) and a Feature prediction transcription Module (Feature Translator Module).

Due to factors such as the shooting angle of the camera and the shaking of the device, the text is often distorted. The preprocessing correction module is responsible for correcting the irregular text into a horizontal text, so that the recognition difficulty of a subsequent recognition network is reduced; the picture feature extraction module is used for extracting a horizontal text image to obtain a feature vector; and the feature prediction transcription module is responsible for decoding the high-dimensional feature vector obtained by encoding to obtain an identification target sequence. Existing research methods tend to improve on the correction module and the transcription module, and omit the feature extraction module.

However, in the actual process, the image information feature extraction module has a large influence on the text recognition result, and the calculation amount and storage load of the module in the whole text recognition frame have a large proportion. Therefore, the design of the feature extraction module is also important for improving the algorithm recognition accuracy and improving the model efficiency.

The feature extraction module is sensitive to input pictures, and parameters of the existing manually designed module need to be adjusted through a large amount of experiments so as to adapt to different application scenes. The search of the neural network structure in automatic machine learning is developed rapidly in recent years, and the core of the search is that the network structure and the module which accord with various tasks are designed in a targeted manner through automatic search, so that the purposes of liberating manpower and saving time and computing resources are achieved.

Compared with the traditional artificial structure design, the method has the advantages that the requirements of precision and efficiency can be comprehensively considered by designing the scene character recognition network based on the neural network search technology in automatic machine learning. It requires researchers to design a well-designed network search space and choose a basic set of candidate operations. After defining the network space, it is also critical how to select a suitable search algorithm.

Disclosure of Invention

Aiming at the problems and defects of the prior art, the invention provides a text recognition method based on neural network search in automatic machine learning, which is developed around the combination of scene character recognition and neural network structure search, and mainly comprises a search space design based on a feature extraction module and a two-stage search algorithm design based on the combination of reinforcement learning and differentiation, so that a good feature extraction module can be obtained by searching in a short time, and the indexes of scene text recognition are improved.

The technical scheme of the invention is as follows:

a text recognition method based on neural network search in automatic machine learning adopts four steps of text picture preprocessing, feature extraction module space structure design, double-layer neural network search algorithm and feature extraction and prediction;

s1, preprocessing a text picture, specifically comprising character recognition detection, text region interception, picture binarization, picture noise reduction and text picture correction; the method comprises the following steps of carrying out picture binarization, namely setting gray values of all pixels in an image to be 0-255, and enabling the whole picture to have an obvious black and white effect, wherein the picture can be simpler and the outline of characters can be highlighted by the step; the image noise reduction mainly removes image noise interference and reduces the interference of noise received by imaging equipment and external environment in the process of digitalization and transmission of an image; the image correction aims at correcting the characters in the picture, so that the recognition is convenient;

s2, designing a spatial structure of a feature extraction module, wherein the core is to define a search space, extracting features of the preprocessed text picture by adopting a deep learning model, and mapping the corrected text picture to a high-dimensional vector space by using a feature extraction network so as to represent character information in an original picture and obtain a feature vector for representing original character symbol information;

s3, obtaining a more suitable model mechanism and parameters by adopting an efficient main neural network search mode in automatic machine learning through a double-layer neural network search algorithm, wherein the double-layer neural network search algorithm comprises a characteristic extraction module space structure design and a double-layer neural network search algorithm utilizing reinforcement learning and a differentiable structure;

and S4, extracting and predicting the features, and extracting and predicting the picture features by using the automatically searched feature extraction network.

In the technical scheme, the part for searching the character recognition task is the space structure design of a feature extraction module, and the input of the space structure design is an original image acquired by equipment or an image corrected by a correction module and is output as a feature vector sequence with uniform size; the search space is expanded around the convolutional layer design, wherein the type of convolution and the step size of convolution are included; the type of convolution will be provided by a table of candidate operations, and the step size of the convolution is fed back in the down-sampling phase of the model.

In the above technical solution, in the step S3, the decoupling of the double-layer neural network search algorithm according to the setting of the search space is two steps: downsampling path search based on reinforcement learning and convolution operation mode search based on differentiable;

for the downsampling path search, all convolution operations will be fixed to be 3 × 3 residual net layers; the method specifically comprises the following steps: each down-sampling position can be interchanged with convolution blocks at the upper position and the lower position of the down-sampling position, a controller based on reinforcement learning automatically screens the positions of the down-sampling convolution blocks, and in order to accelerate network search, a re-parameterization technique is used, and parameters of the two convolution blocks are subjected to exchange mapping to meet the requirement of quickly evaluating the performance of a candidate structure;

for the search in the convolution operation mode, a full-differentiable optimization method is adopted, through introducing the directed acyclic graph DAG, the video memory is effectively saved, the problems of high memory occupation and overlong calculation time of the GPU are solved, the search space is divided into a plurality of different types of cells, the whole network structure is formed by stacking and connecting the different types of cells, and each cell has different convolution step lengths.

In the above technical solution, the automatic searching step in the step S4 is specifically as follows:

s41, randomly initializing the operation weight of each edge in the undirected graph in an alpha matrix;

s42, generating a sub-network structure according to the matrix parameters to carry out limited training to obtain a feedback index;

s43, jointly optimizing parameters of an alpha matrix and weight information of a neural network through dual-objective optimization;

s44, obtaining a final network structure from the learned mixed probability information after the convergence effect is achieved;

s45, carrying out complete training again based on the network structure;

in addition, in order to effectively balance the performance of the network on precision and complexity, the invention also introduces a complexity constraint term which is added into a loss function to constrain the search process; the complexity constraint term is as follows:

l represents the number of layers of the network, C represents the number of convolution operation modes, and the rest parameters represent task-related hyper-parameters;

after a feature extraction model is obtained by a mode of searching based on an efficient neural network structure in automatic machine learning, a corresponding prediction probability result can be obtained by training extracted feature vectors through a full connection layer.

By adopting the scheme, the invention provides a text recognition method based on neural network search in automatic machine learning, which uses the neural network structure search to search the sequence feature extraction related to the data so as to complete the scene text recognition task, the conception is ingenious, firstly, a novel search space is designed for the text recognition problem, and the search space effectively covers the convolution type and the step length. We then propose a new two-stage search algorithm that can efficiently search for feature downsampling paths and operations. Subsequent experiments show that the backbone network searched based on the method can greatly improve the performance of the text recognition network.

Drawings

Fig. 1 is a schematic flow chart illustrating steps of a text recognition method based on neural network search in automatic machine learning according to the present invention.

Fig. 2 is a schematic diagram of a computing resource allocation structure at different stages of a text recognition method based on neural network search in automatic machine learning according to the present invention.

FIG. 3 is a schematic diagram of a weight alpha matrix of a text recognition method based on neural network search in automatic machine learning according to the present invention.

Detailed Description

In order to facilitate an understanding of the invention, the invention is described in more detail below with reference to the accompanying drawings and specific examples. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

As shown in fig. 1, in the text recognition method based on neural network search in automatic machine learning according to the present application, the whole text recognition system firstly passes through a text image preprocessing stage, and secondly performs feature extraction on the preprocessed text image by using a deep learning model, and the feature extraction network is used to map the corrected text image to a high-dimensional vector space, so as to represent character information in an original image, and thus obtain a feature vector for representing original character symbol information. The method adopts a mode of efficient main neural network search in automatic machine learning to obtain more suitable model mechanisms and parameters, wherein the method comprises the space structure design of a feature extraction module and a double-layer neural network search algorithm utilizing reinforcement learning and a differentiable structure. And finally, extracting and predicting picture features by using the automatically searched feature extraction network.

Comprises the following steps:

(1) preprocessing correction module

The text picture preprocessing comprises specific contents of picture binaryzation, noise reduction, image correction and the like. The picture binarization method includes setting gray values of all pixels in an image to be 0-255, and enabling the whole picture to have an obvious black-white effect. The image noise reduction is mainly to remove image noise interference and reduce the interference of noise of imaging equipment and external environment received by an image in the digitization and transmission processes. The image correction aims to correct characters in the picture and facilitate recognition.

(2) Spatial structure design of feature extraction module

The core of the neural network structure search is to define a search space. The part of the invention for searching the character recognition task is the space structure design of the feature extraction module, and the input of the invention is the original image collected by the equipment or the image corrected by the correction module and the output is the feature vector sequence with uniform size. The search space is spread around the convolutional layer design, which includes the type of convolution and the step size of the convolution. The type of convolution will be provided by a table of candidate operations, and the step size of the convolution is fed back in the down-sampling phase of the model.

According to the two-stage splitting of the convolutional layer, the corresponding search space is divided into a downsampled path search space and an operation mode search space. In the downsampled path search space, the candidate convolution steps include three classes, i.e., [ (2,2), (2,1), (1,1) ]. Assuming that the network contains L convolutional layers, the input image has a height and width of (32, w). Then two (2,2), three (2,1) and L-5 (1,1) convolution steps should be included when the target vector output size is (1, w/4). The essence of this search space is the arrangement of the search for the three types of convolution steps described above. It is noted that in conventional networks, convolution with a step size larger than 1 is often used as a marker to distinguish between different sampling phases, and a set of operations with the same feature dimension is often referred to as a phase. Therefore, each convolution operation is understood as a basic computing resource, and the essence of searching the convolution step arrangement mode is to search the optimal allocation mode of computing resources among different stages on the premise that the total computing resources are not changed. Fig. 2 shows a schematic diagram of the resource allocation between different phases, each dashed box and the blue box included therein representing a single phase.

For the operation mode search space, the invention designs C different convolution operations, denoted as Op, for each layer of convolution_i. Simple repeated heap using the same convolution operation directly as in a hand-designed networkIn addition, C candidate convolutions are arranged on each stacking layer, so that the C candidate convolutions can be matched randomly, and the candidate search space is obviously expanded. In a 2D search space, the elements of each row set to 1 are connected together, i.e. constitute a structural configuration in the search space. The aim of the search is to find a valid path from the 2D space, so that the effect of the character recognition is optimal.

(3) Double-layer neural network search algorithm

Aiming at the search algorithm part, the invention decouples the search algorithm into two steps according to the setting of the search space: the method comprises the steps of downsampling path search based on reinforcement learning and convolution operation mode search based on differentiability.

For the downsampled path search, all convolution operations will be fixed to be 3 × 3 residual net layers. Here the down-sampling positions are set at levels 2, 5, 8, 11 and 14 artificially based on a priori knowledge. Based on the relatively better path, the invention multiplexes the path weight, finely adjusts the convolution quantity contained in different stages based on reinforcement learning, and finally obtains the optimal down-sampling path. The specific scheme is as follows: each downsampled position can be interchanged with the convolution block in its two upper and lower positions, and a reinforcement learning based controller (typically instantiated as LSTM) will automatically screen the positions of the downsampled convolution blocks. In order to accelerate network search, the invention uses the parameterization skill, through carrying on the exchange mapping to the parameter of two volume blocks, reach the requirement of evaluating the candidate structure performance fast.

For the search of the convolution operation mode, a fully differentiable optimization method is adopted, and through introducing the directed acyclic graph DAG, the video memory is effectively saved, and the problems of high memory occupation and overlong calculation time of the GPU are solved. Specifically, the search space is divided into a plurality of different types of cells, the whole network structure is formed by stacking and connecting different types of cells, and each cell has a different convolution step. For convenience of description, a cell will be exemplified here. On the data structure, a cell will be represented by a form of a directed acyclic graph. The directed acyclic graph includes N ordered sequences of nodes. Each node is a potential representation of information.

For convolutional neural networks, this underlying information representation can be understood as feature information extracted for the picture. While the edge connecting two nodes in the graph represents a specified transformation operation. When designing, a sequence number is introduced for a node, which means that the input of each intermediate node comes from the output of a node with a sequence number smaller than its own, i.e. for node i, its input x⁽ⁱ⁾＝∑_j<i o^(i,j)(x^(j)). The design is such that the search space is reduced from the structure and specification of the search cell to the operation of searching each edge. To facilitate the search, an alpha matrix is introduced here. The matrix may be visually represented as the cube matrix of fig. 3. The x-axis of the cube is the starting point Node i of edge (i, j); the y-axis of the cube is the end Node j of edge (i, j); the z-axis of the cube is the operation op corresponding to the edge e (i, j). The value of each position in the matrix after softmax is understood as the probability of selecting a certain operation. During training, the information transmitted to the node j by the node i is obtained according to the following formula.

(4) Feature extraction and prediction

The most characteristic extraction network structure can be obtained through the network space result search, and the whole search method can be summarized into the following steps:

1. the operation weight of each edge in the undirected graph is randomly initialized in an alpha matrix;

2. generating a sub-network structure according to the matrix parameters to carry out limited training to obtain a feedback index;

3. jointly optimizing parameters of an alpha matrix and weight information of a neural network through dual-objective optimization;

4. obtaining a final network structure from the learned mixed probability information after the convergence effect is achieved;

5. performing complete training again based on the network structure;

in addition, in order to effectively balance the performance of the network on precision and complexity, the invention also introduces a complexity constraint term and adds the complexity constraint term into the loss function to constrain the searching process. The complexity constraint term is shown below, where L represents the number of layers of the network, C represents the number of convolution operations, and the remaining parameters represent task-related hyper-parameters.

Example 1

The following is a specific embodiment of the present invention:

the invention provides a text recognition method based on neural network search in automatic machine learning, which comprises the following specific processes:

(1) a preprocessing correction module:

firstly, the invoice text is subjected to operation processing such as binarization, noise reduction, image correction and the like. The picture binarization method includes setting gray values of all pixels in an image to be 0-255, and enabling the whole picture to have an obvious black-white effect. And then, the picture is sent to a noise reduction module for processing, mainly removing the noise interference of the picture, and reducing the noise interference of the imaging equipment and the external environment received by the image in the digitization and transmission processes. And finally, the characters in the picture are corrected by using an image correction algorithm, so that the identification is convenient.

(2) The space structure design of the feature extraction module:

secondly, the result after the image preprocessing is sent to a feature extraction module, the part of the character recognition task for searching is the space structure design of the feature extraction module, the input of the character recognition task is the original image collected by equipment or the image corrected by a correction module, and the character recognition task is output as a feature vector sequence with uniform size. The search space is spread around the convolutional layer design, which includes the type of convolution and the step size of the convolution. According to the two-stage splitting of the convolutional layer, the corresponding search space is divided into a downsampled path search space and an operation mode search space.

In the downsampled path search space, the candidate convolution step sizes include three classes, i.e., [ (2,2), (2,1), (1,1)]. The invention designs C different convolution operations for each layer of convolution, denoted as Op_i. Different from the method of directly using the same convolution operation to carry out simple repeated stacking in a manually designed network, the method of the invention sets C candidate convolutions in each stacking layer, so that the method can carry out random collocation, and obviously expands the candidate search space. In a 2D search space, the elements of each row set to 1 are connected together, i.e. constitute a structural configuration in the search space. The aim of the search is to find a valid path from the 2D space, so that the effect of the character recognition is optimal.

(3) Double-layer neural network search:

For the downsampled path search, all convolution operations will be fixed to be 3 × 3 residual net layers. Based on the relatively better path, the invention multiplexes the path weight, finely adjusts the convolution quantity contained in different stages based on reinforcement learning, and finally obtains the optimal down-sampling path. The specific scheme is as follows: each downsampled position can be interchanged with the convolution block in its two upper and lower positions, and a reinforcement learning based controller (typically instantiated as LSTM) will automatically screen the positions of the downsampled convolution blocks. In order to accelerate network search, the invention uses the parameterization skill, through carrying on the exchange mapping to the parameter of two volume blocks, reach the requirement of evaluating the candidate structure performance fast.

For the search of the convolution operation mode, a fully differentiable optimization method is adopted, and through introducing the directed acyclic graph DAG, the video memory is effectively saved, and the problems of high memory occupation and overlong calculation time of the GPU are solved.

(4) Feature extraction and prediction:

the most characteristic extraction network structure can be obtained through the network space result search. And after the optimal structure is obtained, retraining on the complete data set to obtain a final model and predicting.

The technical features mentioned above are combined with each other to form various embodiments which are not listed above, and all of them are regarded as the scope of the present invention described in the specification; also, modifications and variations may be suggested to those skilled in the art in light of the above teachings, and it is intended to cover all such modifications and variations as fall within the true spirit and scope of the invention as defined by the appended claims.

Claims

1. A text recognition method based on neural network search in automatic machine learning is characterized by comprising four steps of text picture preprocessing, feature extraction module space structure design, double-layer neural network search algorithm, feature extraction and prediction;

2. The text recognition method based on the neural network search in the automatic machine learning according to claim 1, characterized in that, the part for searching for the character recognition task is the design of a space structure of a feature extraction module, the input of the character recognition task is an original image collected by equipment or an image corrected by a correction module, and the input is a feature vector sequence with uniform size; the search space is expanded around the convolutional layer design, wherein the type of convolution and the step size of convolution are included; the type of convolution will be provided by a table of candidate operations, and the step size of the convolution is fed back in the down-sampling phase of the model.

3. The text recognition method based on neural network search in automatic machine learning of claim 1, wherein in the step of S3, the two-layer neural network search algorithm is decoupled into two steps according to the setting of the search space: downsampling path search based on reinforcement learning and convolution operation mode search based on differentiable;

4. The text recognition method based on neural network search in automatic machine learning according to claim 1, wherein the automatic search step in the step S4 is specifically:

s45, carrying out complete training again based on the network structure;