WO2016197381A1

WO2016197381A1 - Methods and apparatus for recognizing text in an image

Info

Publication number: WO2016197381A1
Application number: PCT/CN2015/081308
Authority: WO
Inventors: Xiaoou Tang; Weilin Huang; Yu Qiao; Chen Change Loy; Pan HE
Original assignee: Sensetime Group Limited
Priority date: 2015-06-12
Filing date: 2015-06-12
Publication date: 2016-12-15
Also published as: CN107636691A

Abstract

Methods and apparatus for recognizing text in an image are disclosed. According to an embodiment, the method comprises encoding the image into a first sequence with a convolutional neural network (CNN), wherein the first sequence is an output from a last second convolutional layer of the CNN; decoding the first sequence with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; and mapping the second sequence into a word string in which repeated labels and non-character labels are removed.

Description

Methods and Apparatus for Recognizing Text in an Image

Technical Field

This application relates to text recognizing； in particular, to methods and apparatus for recognizing text in an image.

Background

Text recognition in natural image has received increasing attention in computer vision, due to its numerous practical applications. This problem includes two sub tasks, namely text detection and text-line/word recognition. The main difficulty arises from the large diversity of text patterns (e. g. low resolution, low contrast, and blurring) , and highly complicated background clutters. Consequently, individual character segmentation or separation is extremely challenging.

Most previous studies focus on developing powerful character classifiers, some of which are incorporated with an additional language model, leading to the state-of-the-art performance. These approaches mainly follow the basic pipeline of conventional OCR techniques by first involving a character-level segmentation, then followed by an isolated character classifier and post-processing for recognition. Several approaches adopt deep neural networks for representation learning, but the recognition is still confined to character-level classification. All the current successful scene text recognition systems are mostly built on isolated character classifier. Their performance are thus severely harmed by the difficulty of character-level segmentation or separation. Importantly, recognizing each character independently discards meaningful context information of the words, significantly reducing its reliability and robustness.

Summary

According to an aspect of the present application, a method for recognizing text in an image comprises: encoding the image into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN； decoding the first sequence with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence； and mapping the second sequence into a word string in which repeated labels and non-character labels are removed.

According to another aspect of the present application, an apparatus for recognizing text in an image comprises: a convolutional neural network (CNN) encoding the image into a first sequence, wherein the first sequence is an output from a last second convolutional layer of the CNN； and a recurrent neural network (RNN) decoding the first sequence into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence； wherein the RNN further maps the second sequence into a word string in which repeated labels and non-character labels are removed.

Drawings

Fig. 1 is flowchart of a method for recognizing text in an image according to an embodiment of the present application.

Fig. 2 illustrates an overall pipeline of an apparatus or a system for recognizing text in an image according to an embodiment of the present application.

Fig. 3 illustrates a structure of a five-layer Maxout CNN mode used in an embodiment of the present application.

Fig. 4 illustrates a structure of a RNN mode used in an embodiment of the present application.

Fig. 5 illustrates a structure of the memory cells in a RNN mode used in an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings in detail.

Fig. 1 illustrates a flowchart of a method 100 for recognizing text in an image according to an embodiment of the present application. As shown in Fig. 1, at step S101, an image with characters is encoded into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN. At step S102, the first sequence is decoded with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence. At step S103, the second sequence is mapped by into a word string in which repeated labels and non-character labels are removed.

According to the embodiment, no character segmentation is needed. Instead, the output from the last second convolutional layer of the CNN is obtained and directly used as an input of the RNN for text recognizing, so that an advantage of the RNN in retaining meaningful interdependencies of continuous text is considered during the process.

In an embodiment, the CNN may perform convolution to the image as a whole. In this case, the result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence that is to be used as the input of the RNN.

Alternatively, the CNN may use a slide window to scan the word image from left to right densely and divide the image into continuous segments. Note that, such segments are not equal to those obtained by character segmentation, since it is just simply scanning and dividing, without needing to identify individual characters. The segments are convolutedindividually by the CNN. The results of the convolution obtained from the last second convolutional layer of the CNN are components which collectively form the first sequence that is to be used as the input of the RNN.

Fig. 2 illustrates an overall pipeline of an apparatus or a system for recognizing text in an image according to an embodiment of the present application. As shown in the top box, an image with a character string “apartment” is divided into segments of the same size, e. g. , 32*32. Such division is implemented with a sliding window by scanning. The division involves no character identifying work. Each segment may comprise one or more complete or incomplete characters, and may comprise no character at all, as can be seen from the second row in the top box.

The middle box in Fig. 2 shows a CNN network, which performs convolution to each of the segment sequentially. As is known, a CNN may comprise several layers. In prior art, some text recognizing technologies have used the CNN network for an isolated character recognition , in which a character label is output at the last layer of the CNN. According to the present application, the output from the last second layer of the CNN is used. The output has 128 feature maps, each of which includes a single neuron. For example, a segment with a size of 32*32 may generate a 128D output from the last second layer of the CNN. For an image which may be divided by a sliding window into T segments, T number of 128D outputs may be generated, wherein T is a positive integer varied by the aspect ratio of the image.. The output sequence represents high-level deep features of the input image.

The bottom box in Fig. 2 shows an RNN network, which decodes the output sequence from the CNN. As can be seen, the RNN has the same number of channels as the CNN. However, different from the CNN in which each channel works individually, the sequential channels in the RNN are connected and interacted by internal states of the RNN in the hidden layers. Based on such configuration, for each component in the sequence output from the CNN, estimated probabilities over all possible characters are output, with consideration of the relationship with both the previous component (if any) and the following component (if any) in the sequence. The estimated probabilities for each component in the sequence output from the CNN (thus for each segment of the image) are then considered together and mapped into a word string “apartment” , in which repeated labels and non-character labels are removed.

According to an embodiment, the image may be resized to be adaptive to the CNN or the sliding window so that it can be properly processed and recognized. For example, for a sliding window having a size of 32*32, the image may be resized to have a height of 32, while keeping its original aspect ratio unchanged.

Although Fig. 2 shows a situation when a sliding window is used. In another embodiment, the sliding window is not necessary. Under such circumstances, the image is convoluted as a whole, and the output from the last second layer of the CNN has a matrix of 128*T, which is equivalent to the sequence obtained by connecting T 128D outputs together.

Before describing the CNN and RNN in detail, the process of word image recognition is formulated as a sequence labeling problem as below. The probability of the correct word strings (S_w) is maximized given an input image (I) as follows,

where θ are the parameters of the recurrent system. (I, S_w) ∈Ω is a sample pair from a training set, Ω , where

is the ground truth word string (containing K characters) of the image I. Commonly, the chain rule is applied to model the joint probability over S_w,

Thus the sum of the log probabilities over all sample pairs in the training set (Ω) is optimized to learn the model parameters. A RNN is developed to model the sequential probabilities

where the variable number of the sequentially conditioned characters can be expressed by an internal state of the RNN in hidden layer, h_t. This internal state is updated when the next sequential input x_t is presented by computing a non-linear function H,

h_t+1＝H (h_t, x_t) (3)

where the non-linear function H defines exact form of the proposed recurrent system. X＝ {x₁, x₂, ..., x_T} is the sequential CNN features computed from the word image.

Designs of the

and H play crucial roles in the proposed system. A CNN model is developed to generate the sequential x_t, and define H with a long short-term memory (LSTM) architecture.

Both CNN and RNN are previously trained, which will be described in detail as below.

To better understand and implement embodiments of the application, a five-layer maxout CNN and a bidirectional long short-term memory (LSTM) based on RNN are used in the following illustrative example. The LSTM-based RNN may further comprise a connectionist temporal classification (CTC) layer. It is noted that other kinds of CNN and/or RNN are also possible to be used to implement the application.

Fig. 3 illustrates a structure of a five-layer Maxout CNN mode used in an embodiment of the present application. As shown, the basic pipeline is to compute point-wise maximum through a number of grouped feature maps or channels. For example, an input image/segment is of a size 32*32 corresponding the size of sliding window. The Maxout CNN network has five convolutional layers, each of which is followed by a two-group or four-group Maxout operation, with different numbers of feature maps, i.e. 48, 64, 128, 128 and 36 respectively. During the convolution, no pooling operation is involved, and the output map of last two convolutional layers are just one pixel. This allows the CNN to convolute the whole word images at once, leading to a significant computational efficiency. For each word image, it may be resized into the same height of 32, and keep its original aspect ratio unchanged. By applying the learned CNN to the resized image, a 128D CNN sequence may be directly from the output of last second convolutional layer. This operation is similar to computing deep feature independently from the sliding window by moving it densely through the image, but with much computational efficiency. The used Maxout CNN may be trained on 36 classes comprised of case insensitive character sample images, including 26 characters and 10 numbers.

As mentioned above, for an image divided by a sliding window into T segments, the output from the CNN to the RNN is a matrix X＝ {x₁, x₂, ..., x_T} , in which each of x₁, x₂, ..., x_T is a 128D vector.

Fig. 4 illustrates a structure of a RNN mode used in an embodiment of the present application. The RNN comprises bidirectional long short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence from the input CNN sequence, and the CTC layer generates the word string from the second sequence.

As shown, the bidirectional LSTM has two separate LSTM hidden layers that process the input sequence forward and backward, respectively. Both hidden layers are connected to the same input and output layers.

The main shortcoming of the standard RNN is the vanishing gradient problem, making it hard to transmit the gradient information consistently over long time. This is a crucial issue in designing a RNN model, and the long short-term memory (LSTM) was proposed specially to address this problem. The LSTM defines a new neuron or cell structure in the hidden layer with three additional multiplicative gates: the input gate, forget gate and output gate. These new cells are referred as memory cells, which allow the LSTM to learn meaningful long-range interdependencies. The structure of the memory cells is described in Fig. 5. σ is the logistic sigmoid function, realising the non-linearity of LSTM. The cell activation is a summation of the previous cell activation and the input modulation, which are controlled by the forget gate and input gate respectively. These two gates trade off the influences between the previous memory cell and current input information. The output gate controls how much of the cell activation to transfer to the final hidden state. Each LSTM hidden layer includes 128 LSTM memory cells, each of which has a structure shown in Fig. 5.

The sequence labelling of varying lengths is processed by recurrently implementing the LSTM memory for each sequential input x_t (t is an integer from 1 to T) , such that all LSTMs share the same parameters. The output of the LSTM h_t is used to fed to the LSTM at next input x_t+1. It is also used to compute the current output, which is transformed to the estimated probabilities over all possible characters. It finally generates a sequence of the estimations with the same length of input sequence, p＝ {p₁, p₂, p₃, ..., p_T} .

Due to the unsegmented nature of the word image at the character level, the length of the LSTM outputs (T) is not consistent with the length of a target word string, |S_w|＝K. This makes it difficult to train the RNN directly with the target strings. To this end, a connectionist temporal classification (CTC) is applied to approximately map the LSTM sequential output (p) into its target string as follow:

where the projection B removes the repeated labels and the non-character labels. For example, B ( "-gg-o-oo-dd-" ) ＝ "good" . The CTC looks for an approximately optimized path (π) with maximum probability through the LSTMs output sequence, which aligns the different lengths of LSTM sequence and the word string.

The CTC is specifically designed for the sequence labelling tasks where it is hard to pre-segment the input sequence to the segments that exactly match a target sequence. In our RNN model, the CTC layer is directly connected to the outputs of LSTMs, and works as the output layer of the whole RNN. It not only allows the model to avoid a number of complicated post-processing (e.g. transforming the LSTM output sequence into a word string) , but also makes it possible to be trained in an end-to-end fashion by minimizing a overall loss function over (X, S_w) ∈Ω . The loss for each sample pair is computed as sum of the negative log likelihood of the true word string,

Finally, the RNNs model according to the present application follows a bidirectional LSTM architecture, as shown in Fig. 4. It has two separate LSTM hidden layers that process the input sequence forward and backward, respectively. Both hidden layers are connected to the same output layers, allowing it to access to both past and future information in the sequence.

In an example, the CNN model according to the present application is trained on about 1.8×10⁵ character images and the CNN sequence is generated by applying the trained CNN with a sliding-window on the word images, followed by a column-wise normalization. The RNN model contains a bidirectional LSTM architecture. Each LSTM layer has 128 LSTM cell memory blocks. The input layer of our RNN model has 128 neurons (corresponding to the dimensions of the CNN sequence, x_t∈R¹²⁸) , which are fully connected to both hidden layers. The outputs of two hidden layers are concatenated, and then fully connected to the output layer of the LSTM with 37 output classes (including an additional non-character class) , by using a softmax function. Thus our RNN model has 273445 parameters in total, which are initialized with a Gaussian distribution of mean 0 and standard deviation 0.01 in the training process.

The recurrent model is trained with steepest descent. The parameters are updated per training sequence by using a learning rate of 10^-4 and a momentum of 0.9. Each input sequence is randomly selected from the training set. A forward-backward algorithm is performed to jointly optimize the bidirectional LSTM and CTC parameters, where a forward propagation is implemented through whole network, followed by a forward-backward algorithm that aligns the ground truth word strings to the LSTM output maps, π∈B^-1 (S_w) , π, p∈R^37×T. The loss function of E.q. (6) is computed approximately as,

Finally the approximated error is propagated backward to update the parameters. The RNN is trained on about 3000 word images, taken from the training sets of three benchmarks mentioned in the next section.

The performance of the proposed solution of text recognition is compared against start-of-the-art methods on three standard benchmarks for cropped word image recognition. The experiment result shows demonstrated that the method and apparatus of the present application has good performance in recognizing words in images with training by relative small number of samples.

Although the preferred embodiments of the present invention have been described, many modifications and changes may be possible once those skilled in the art get to know some basic inventive concepts. The appended claims are intended to be construed comprising these preferred embodiments and all the changes and modifications fallen within the scope of the present invention.

It will be apparent to those skilled in the art that various modifications and variations could be made to the present application without departing from the spirit and scope of the present invention. Thus, if any modifications and variations lie within the spirit and principle of the present application, the present invention is intended to include these modifications and variations.

Claims

A method for recognizing text in an image, comprising:

encoding the image into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN；

decoding the first sequence with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence； and

mapping the second sequence into a word string in which repeated labels and non-character labels are removed.
The method of claim 1, wherein encoding the image into a first sequence with a CNN comprises:

convoluting the image as a whole with the CNN, wherein a result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence.
The method of claim 1, wherein encoding the image into a first sequence with a CNN comprises:

applying a sliding window to the image to divide the image into segments having a same size； and

convoluting the segments, individually and sequentially, with the CNN, wherein results of the convolution obtained from the last second convolutional layer of the CNN are components forming the first sequence.
The method of claim 1, prior to the step of encoding, further comprising,

resizing the image to have a predetermined size.
The method of claim 4, wherein the CNN has been trained with image samples having the predetermined size and outputs 36 classes of different characters at a last layer of the CNN.
The method of claim 1, wherein the output from the last second convolutional layer of the CNN is just one neuron.
The method of claim 1, wherein the RNN has been trained with a set of convolutional sequence and corresponding word strings.
The method of claim 1, wherein the CNN comprises a maxout CNN.
The method of claim 1, wherein the RNN comprises a bidirectional long short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence, and the CTC layer generates the word string.
An apparatus for recognizing text in an image, comprising:

a convolutional neural network (CNN) encoding the image into a first sequence, wherein the first sequence is an output from a last second convolutional layer of the CNN； and

a recurrent neural network (RNN) decoding the first sequence into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence；

wherein the RNN further maps the second sequence into a word string in which repeated labels and non-character labels are removed.
The apparatus of claim 10, wherein the CNN encodes the image into a first sequence by:

convoluting the image as a whole with the CNN, wherein a result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence.
The apparatus of claim 10, wherein the CNN encodes the image into a first sequence by:

applying a sliding window to the image to divide the image into segments having a same size； and

convoluting the segments, individually and sequentially, with the CNN, wherein results of the convolution obtained from the last second convolutional layer of the CNN are components forming the first sequence.
The apparatus of claim 10, wherein the image has been resized to have a predetermined size before being input to the CNN.
The apparatus of claim 13, wherein the CNN has been trained with image samples having the predetermined size and outputs 36 classes of different characters at a last layer of the CNN.
The apparatus of claim 10, wherein the output from the last second convolutional layer of the CNN is just one neuron.
The apparatus of claim 10, wherein the RNN has been trained with a set of convolutional sequence and corresponding word strings.
The apparatus of claim 10, wherein the CNN comprises a maxout CNN.
The apparatus of claim 10, wherein the RNN comprises a bidirectional long short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence, and the CTC layer generates the word string.