WO2016197381A1 - Methods and apparatus for recognizing text in an image - Google Patents
Methods and apparatus for recognizing text in an image Download PDFInfo
- Publication number
- WO2016197381A1 WO2016197381A1 PCT/CN2015/081308 CN2015081308W WO2016197381A1 WO 2016197381 A1 WO2016197381 A1 WO 2016197381A1 CN 2015081308 W CN2015081308 W CN 2015081308W WO 2016197381 A1 WO2016197381 A1 WO 2016197381A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- cnn
- image
- layer
- last
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Definitions
- This application relates to text recognizing; in particular, to methods and apparatus for recognizing text in an image.
- Text recognition in natural image has received increasing attention in computer vision, due to its numerous practical applications.
- This problem includes two sub tasks, namely text detection and text-line/word recognition.
- the main difficulty arises from the large diversity of text patterns (e. g. low resolution, low contrast, and blurring) , and highly complicated background clutters. Consequently, individual character segmentation or separation is extremely challenging.
- a method for recognizing text in an image comprises: encoding the image into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN; decoding the first sequence with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; and mapping the second sequence into a word string in which repeated labels and non-character labels are removed.
- CNN convolutional neural network
- RNN recurrent neural network
- an apparatus for recognizing text in an image comprises: a convolutional neural network (CNN) encoding the image into a first sequence, wherein the first sequence is an output from a last second convolutional layer of the CNN; and a recurrent neural network (RNN) decoding the first sequence into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; wherein the RNN further maps the second sequence into a word string in which repeated labels and non-character labels are removed.
- CNN convolutional neural network
- RNN recurrent neural network
- Fig. 1 is flowchart of a method for recognizing text in an image according to an embodiment of the present application.
- Fig. 2 illustrates an overall pipeline of an apparatus or a system for recognizing text in an image according to an embodiment of the present application.
- Fig. 3 illustrates a structure of a five-layer Maxout CNN mode used in an embodiment of the present application.
- Fig. 4 illustrates a structure of a RNN mode used in an embodiment of the present application.
- Fig. 5 illustrates a structure of the memory cells in a RNN mode used in an embodiment of the present application.
- Fig. 1 illustrates a flowchart of a method 100 for recognizing text in an image according to an embodiment of the present application.
- an image with characters is encoded into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN.
- the first sequence is decoded with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence.
- RNN recurrent neural network
- the second sequence is mapped by into a word string in which repeated labels and non-character labels are removed.
- the output from the last second convolutional layer of the CNN is obtained and directly used as an input of the RNN for text recognizing, so that an advantage of the RNN in retaining meaningful interdependencies of continuous text is considered during the process.
- the CNN may perform convolution to the image as a whole.
- the result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence that is to be used as the input of the RNN.
- the CNN may use a slide window to scan the word image from left to right densely and divide the image into continuous segments. Note that, such segments are not equal to those obtained by character segmentation, since it is just simply scanning and dividing, without needing to identify individual characters.
- the segments are convolutedindividually by the CNN.
- the results of the convolution obtained from the last second convolutional layer of the CNN are components which collectively form the first sequence that is to be used as the input of the RNN.
- Fig. 2 illustrates an overall pipeline of an apparatus or a system for recognizing text in an image according to an embodiment of the present application.
- an image with a character string “apartment” is divided into segments of the same size, e. g. , 32*32.
- Such division is implemented with a sliding window by scanning.
- the division involves no character identifying work.
- Each segment may comprise one or more complete or incomplete characters, and may comprise no character at all, as can be seen from the second row in the top box.
- the middle box in Fig. 2 shows a CNN network, which performs convolution to each of the segment sequentially.
- a CNN may comprise several layers.
- some text recognizing technologies have used the CNN network for an isolated character recognition , in which a character label is output at the last layer of the CNN.
- the output from the last second layer of the CNN is used.
- the output has 128 feature maps, each of which includes a single neuron. For example, a segment with a size of 32*32 may generate a 128D output from the last second layer of the CNN.
- T number of 128D outputs may be generated, wherein T is a positive integer varied by the aspect ratio of the image..
- the output sequence represents high-level deep features of the input image.
- the bottom box in Fig. 2 shows an RNN network, which decodes the output sequence from the CNN.
- the RNN has the same number of channels as the CNN. However, different from the CNN in which each channel works individually, the sequential channels in the RNN are connected and interacted by internal states of the RNN in the hidden layers. Based on such configuration, for each component in the sequence output from the CNN, estimated probabilities over all possible characters are output, with consideration of the relationship with both the previous component (if any) and the following component (if any) in the sequence. The estimated probabilities for each component in the sequence output from the CNN (thus for each segment of the image) are then considered together and mapped into a word string “apartment” , in which repeated labels and non-character labels are removed.
- the image may be resized to be adaptive to the CNN or the sliding window so that it can be properly processed and recognized. For example, for a sliding window having a size of 32*32, the image may be resized to have a height of 32, while keeping its original aspect ratio unchanged.
- Fig. 2 shows a situation when a sliding window is used.
- the sliding window is not necessary.
- the image is convoluted as a whole, and the output from the last second layer of the CNN has a matrix of 128*T, which is equivalent to the sequence obtained by connecting T 128D outputs together.
- ⁇ are the parameters of the recurrent system.
- (I, S w ) ⁇ is a sample pair from a training set, ⁇ , where is the ground truth word string (containing K characters) of the image I.
- the chain rule is applied to model the joint probability over S w ,
- a RNN is developed to model the sequential probabilities where the variable number of the sequentially conditioned characters can be expressed by an internal state of the RNN in hidden layer, h t . This internal state is updated when the next sequential input x t is presented by computing a non-linear function H,
- X ⁇ x 1 , x 2 , ..., x T ⁇ is the sequential CNN features computed from the word image.
- a CNN model is developed to generate the sequential x t , and define H with a long short-term memory (LSTM) architecture.
- LSTM long short-term memory
- a five-layer maxout CNN and a bidirectional long short-term memory (LSTM) based on RNN are used in the following illustrative example.
- the LSTM-based RNN may further comprise a connectionist temporal classification (CTC) layer. It is noted that other kinds of CNN and/or RNN are also possible to be used to implement the application.
- CTC connectionist temporal classification
- Fig. 3 illustrates a structure of a five-layer Maxout CNN mode used in an embodiment of the present application.
- the basic pipeline is to compute point-wise maximum through a number of grouped feature maps or channels.
- an input image/segment is of a size 32*32 corresponding the size of sliding window.
- the Maxout CNN network has five convolutional layers, each of which is followed by a two-group or four-group Maxout operation, with different numbers of feature maps, i.e. 48, 64, 128, 128 and 36 respectively.
- no pooling operation is involved, and the output map of last two convolutional layers are just one pixel. This allows the CNN to convolute the whole word images at once, leading to a significant computational efficiency.
- a 128D CNN sequence may be directly from the output of last second convolutional layer. This operation is similar to computing deep feature independently from the sliding window by moving it densely through the image, but with much computational efficiency.
- the used Maxout CNN may be trained on 36 classes comprised of case insensitive character sample images, including 26 characters and 10 numbers.
- Fig. 4 illustrates a structure of a RNN mode used in an embodiment of the present application.
- the RNN comprises bidirectional long short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence from the input CNN sequence, and the CTC layer generates the word string from the second sequence.
- LSTM long short-term memory
- CTC connectionist temporal classification
- the bidirectional LSTM has two separate LSTM hidden layers that process the input sequence forward and backward, respectively. Both hidden layers are connected to the same input and output layers.
- the main shortcoming of the standard RNN is the vanishing gradient problem, making it hard to transmit the gradient information consistently over long time. This is a crucial issue in designing a RNN model, and the long short-term memory (LSTM) was proposed specially to address this problem.
- the LSTM defines a new neuron or cell structure in the hidden layer with three additional multiplicative gates: the input gate, forget gate and output gate. These new cells are referred as memory cells, which allow the LSTM to learn meaningful long-range interdependencies.
- the structure of the memory cells is described in Fig. 5.
- ⁇ is the logistic sigmoid function, realising the non-linearity of LSTM.
- the cell activation is a summation of the previous cell activation and the input modulation, which are controlled by the forget gate and input gate respectively. These two gates trade off the influences between the previous memory cell and current input information.
- the output gate controls how much of the cell activation to transfer to the final hidden state.
- Each LSTM hidden layer includes 128 LSTM memory cells, each of which has a structure shown in Fig. 5.
- the sequence labelling of varying lengths is processed by recurrently implementing the LSTM memory for each sequential input x t (t is an integer from 1 to T) , such that all LSTMs share the same parameters.
- connectionist temporal classification is applied to approximately map the LSTM sequential output (p) into its target string as follow:
- the CTC looks for an approximately optimized path ( ⁇ ) with maximum probability through the LSTMs output sequence, which aligns the different lengths of LSTM sequence and the word string.
- the CTC is specifically designed for the sequence labelling tasks where it is hard to pre-segment the input sequence to the segments that exactly match a target sequence.
- the CTC layer is directly connected to the outputs of LSTMs, and works as the output layer of the whole RNN. It not only allows the model to avoid a number of complicated post-processing (e.g. transforming the LSTM output sequence into a word string) , but also makes it possible to be trained in an end-to-end fashion by minimizing a overall loss function over (X, S w ) ⁇ .
- the loss for each sample pair is computed as sum of the negative log likelihood of the true word string,
- the RNNs model according to the present application follows a bidirectional LSTM architecture, as shown in Fig. 4. It has two separate LSTM hidden layers that process the input sequence forward and backward, respectively. Both hidden layers are connected to the same output layers, allowing it to access to both past and future information in the sequence.
- the CNN model according to the present application is trained on about 1.8 ⁇ 10 5 character images and the CNN sequence is generated by applying the trained CNN with a sliding-window on the word images, followed by a column-wise normalization.
- the RNN model contains a bidirectional LSTM architecture. Each LSTM layer has 128 LSTM cell memory blocks.
- the input layer of our RNN model has 128 neurons (corresponding to the dimensions of the CNN sequence, x t ⁇ R 128 ) , which are fully connected to both hidden layers.
- the outputs of two hidden layers are concatenated, and then fully connected to the output layer of the LSTM with 37 output classes (including an additional non-character class) , by using a softmax function.
- our RNN model has 273445 parameters in total, which are initialized with a Gaussian distribution of mean 0 and standard deviation 0.01 in the training process.
- the recurrent model is trained with steepest descent.
- the parameters are updated per training sequence by using a learning rate of 10 -4 and a momentum of 0.9.
- Each input sequence is randomly selected from the training set.
- a forward-backward algorithm is performed to jointly optimize the bidirectional LSTM and CTC parameters, where a forward propagation is implemented through whole network, followed by a forward-backward algorithm that aligns the ground truth word strings to the LSTM output maps, ⁇ B -1 (S w ) , ⁇ , p ⁇ R 37 ⁇ T .
- the loss function of E.q. (6) is computed approximately as,
- the RNN is trained on about 3000 word images, taken from the training sets of three benchmarks mentioned in the next section.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
- Character Discrimination (AREA)
Abstract
Methods and apparatus for recognizing text in an image are disclosed. According to an embodiment, the method comprises encoding the image into a first sequence with a convolutional neural network (CNN), wherein the first sequence is an output from a last second convolutional layer of the CNN; decoding the first sequence with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; and mapping the second sequence into a word string in which repeated labels and non-character labels are removed.
Description
This application relates to text recognizing; in particular, to methods and apparatus for recognizing text in an image.
Text recognition in natural image has received increasing attention in computer vision, due to its numerous practical applications. This problem includes two sub tasks, namely text detection and text-line/word recognition. The main difficulty arises from the large diversity of text patterns (e. g. low resolution, low contrast, and blurring) , and highly complicated background clutters. Consequently, individual character segmentation or separation is extremely challenging.
Most previous studies focus on developing powerful character classifiers, some of which are incorporated with an additional language model, leading to the state-of-the-art performance. These approaches mainly follow the basic pipeline of conventional OCR techniques by first involving a character-level segmentation, then followed by an isolated character classifier and post-processing for recognition. Several approaches adopt deep neural networks for representation learning, but the recognition is still confined to character-level classification. All the current successful scene text recognition systems are mostly built on isolated character classifier. Their performance are thus severely harmed by the difficulty of character-level segmentation or separation. Importantly, recognizing each character independently discards meaningful context information of the words, significantly reducing its reliability and robustness.
Summary
According to an aspect of the present application, a method for recognizing text in an image comprises: encoding the image into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN; decoding the first sequence with a
recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; and mapping the second sequence into a word string in which repeated labels and non-character labels are removed.
According to another aspect of the present application, an apparatus for recognizing text in an image comprises: a convolutional neural network (CNN) encoding the image into a first sequence, wherein the first sequence is an output from a last second convolutional layer of the CNN; and a recurrent neural network (RNN) decoding the first sequence into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; wherein the RNN further maps the second sequence into a word string in which repeated labels and non-character labels are removed.
Drawings
Fig. 1 is flowchart of a method for recognizing text in an image according to an embodiment of the present application.
Fig. 2 illustrates an overall pipeline of an apparatus or a system for recognizing text in an image according to an embodiment of the present application.
Fig. 3 illustrates a structure of a five-layer Maxout CNN mode used in an embodiment of the present application.
Fig. 4 illustrates a structure of a RNN mode used in an embodiment of the present application.
Fig. 5 illustrates a structure of the memory cells in a RNN mode used in an embodiment of the present application.
Embodiments of the present application are described below with reference to the accompanying drawings in detail.
Fig. 1 illustrates a flowchart of a method 100 for recognizing text in an image according to an embodiment of the present application. As shown in Fig. 1, at step S101, an image with characters is encoded into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN. At step S102, the first sequence is decoded with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence. At step S103, the second sequence is mapped by into a word string in which repeated labels and non-character labels are removed.
According to the embodiment, no character segmentation is needed. Instead, the output from the last second convolutional layer of the CNN is obtained and directly used as an input of the RNN for text recognizing, so that an advantage of the RNN in retaining meaningful interdependencies of continuous text is considered during the process.
In an embodiment, the CNN may perform convolution to the image as a whole. In this case, the result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence that is to be used as the input of the RNN.
Alternatively, the CNN may use a slide window to scan the word image from left to right densely and divide the image into continuous segments. Note that, such segments are not equal to those obtained by character segmentation, since it is just simply scanning and dividing, without needing to identify individual characters. The segments are convolutedindividually by the CNN. The results of the convolution obtained from the last second convolutional layer of the CNN are components which collectively form the first sequence that is to be used as the input of the RNN.
Fig. 2 illustrates an overall pipeline of an apparatus or a system for recognizing text in an image according to an embodiment of the present application. As shown in the top box, an image with a character string “apartment” is divided into
segments of the same size, e. g. , 32*32. Such division is implemented with a sliding window by scanning. The division involves no character identifying work. Each segment may comprise one or more complete or incomplete characters, and may comprise no character at all, as can be seen from the second row in the top box.
The middle box in Fig. 2 shows a CNN network, which performs convolution to each of the segment sequentially. As is known, a CNN may comprise several layers. In prior art, some text recognizing technologies have used the CNN network for an isolated character recognition , in which a character label is output at the last layer of the CNN. According to the present application, the output from the last second layer of the CNN is used. The output has 128 feature maps, each of which includes a single neuron. For example, a segment with a size of 32*32 may generate a 128D output from the last second layer of the CNN. For an image which may be divided by a sliding window into T segments, T number of 128D outputs may be generated, wherein T is a positive integer varied by the aspect ratio of the image.. The output sequence represents high-level deep features of the input image.
The bottom box in Fig. 2 shows an RNN network, which decodes the output sequence from the CNN. As can be seen, the RNN has the same number of channels as the CNN. However, different from the CNN in which each channel works individually, the sequential channels in the RNN are connected and interacted by internal states of the RNN in the hidden layers. Based on such configuration, for each component in the sequence output from the CNN, estimated probabilities over all possible characters are output, with consideration of the relationship with both the previous component (if any) and the following component (if any) in the sequence. The estimated probabilities for each component in the sequence output from the CNN (thus for each segment of the image) are then considered together and mapped into a word string “apartment” , in which repeated labels and non-character labels are removed.
According to an embodiment, the image may be resized to be adaptive to the CNN or the sliding window so that it can be properly processed and recognized.
For example, for a sliding window having a size of 32*32, the image may be resized to have a height of 32, while keeping its original aspect ratio unchanged.
Although Fig. 2 shows a situation when a sliding window is used. In another embodiment, the sliding window is not necessary. Under such circumstances, the image is convoluted as a whole, and the output from the last second layer of the CNN has a matrix of 128*T, which is equivalent to the sequence obtained by connecting T 128D outputs together.
Before describing the CNN and RNN in detail, the process of word image recognition is formulated as a sequence labeling problem as below. The probability of the correct word strings (Sw) is maximized given an input image (I) as follows,
where θ are the parameters of the recurrent system. (I, Sw) ∈Ω is a sample pair from a training set, Ω , whereis the ground truth word string (containing K characters) of the image I. Commonly, the chain rule is applied to model the joint probability over Sw,
Thus the sum of the log probabilities over all sample pairs in the training set (Ω) is optimized to learn the model parameters. A RNN is developed to model the sequential probabilitieswhere the variable number of the sequentially conditioned characters can be expressed by an internal state of the RNN in hidden layer, ht. This internal state is updated when the next sequential input xt is presented by computing a non-linear function H,
ht+1=H (ht, xt) (3)
where the non-linear function H defines exact form of the proposed recurrent
system. X= {x1, x2, ..., xT} is the sequential CNN features computed from the word image.
Designs of theand H play crucial roles in the proposed system. A CNN model is developed to generate the sequential xt, and define H with a long short-term memory (LSTM) architecture.
Both CNN and RNN are previously trained, which will be described in detail as below.
To better understand and implement embodiments of the application, a five-layer maxout CNN and a bidirectional long short-term memory (LSTM) based on RNN are used in the following illustrative example. The LSTM-based RNN may further comprise a connectionist temporal classification (CTC) layer. It is noted that other kinds of CNN and/or RNN are also possible to be used to implement the application.
Fig. 3 illustrates a structure of a five-layer Maxout CNN mode used in an embodiment of the present application. As shown, the basic pipeline is to compute point-wise maximum through a number of grouped feature maps or channels. For example, an input image/segment is of a size 32*32 corresponding the size of sliding window. The Maxout CNN network has five convolutional layers, each of which is followed by a two-group or four-group Maxout operation, with different numbers of feature maps, i.e. 48, 64, 128, 128 and 36 respectively. During the convolution, no pooling operation is involved, and the output map of last two convolutional layers are just one pixel. This allows the CNN to convolute the whole word images at once, leading to a significant computational efficiency. For each word image, it may be resized into the same height of 32, and keep its original aspect ratio unchanged. By applying the learned CNN to the resized image, a 128D CNN sequence may be directly from the output of last second convolutional layer. This operation is similar to computing deep feature independently from the sliding window by moving it densely
through the image, but with much computational efficiency. The used Maxout CNN may be trained on 36 classes comprised of case insensitive character sample images, including 26 characters and 10 numbers.
As mentioned above, for an image divided by a sliding window into T segments, the output from the CNN to the RNN is a matrix X= {x1, x2, ..., xT} , in which each of x1, x2, ..., xT is a 128D vector.
Fig. 4 illustrates a structure of a RNN mode used in an embodiment of the present application. The RNN comprises bidirectional long short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence from the input CNN sequence, and the CTC layer generates the word string from the second sequence.
As shown, the bidirectional LSTM has two separate LSTM hidden layers that process the input sequence forward and backward, respectively. Both hidden layers are connected to the same input and output layers.
The main shortcoming of the standard RNN is the vanishing gradient problem, making it hard to transmit the gradient information consistently over long time. This is a crucial issue in designing a RNN model, and the long short-term memory (LSTM) was proposed specially to address this problem. The LSTM defines a new neuron or cell structure in the hidden layer with three additional multiplicative gates: the input gate, forget gate and output gate. These new cells are referred as memory cells, which allow the LSTM to learn meaningful long-range interdependencies. The structure of the memory cells is described in Fig. 5. σ is the logistic sigmoid function, realising the non-linearity of LSTM. The cell activation is a summation of the previous cell activation and the input modulation, which are controlled by the forget gate and input gate respectively. These two gates trade off the influences between the previous memory cell and current input information. The output gate controls how much of the cell activation to transfer to the final hidden state. Each LSTM hidden layer includes 128 LSTM memory cells, each of which has
a structure shown in Fig. 5.
The sequence labelling of varying lengths is processed by recurrently implementing the LSTM memory for each sequential input xt (t is an integer from 1 to T) , such that all LSTMs share the same parameters. The output of the LSTM ht is used to fed to the LSTM at next input xt+1. It is also used to compute the current output, which is transformed to the estimated probabilities over all possible characters. It finally generates a sequence of the estimations with the same length of input sequence, p= {p1, p2, p3, ..., pT} .
Due to the unsegmented nature of the word image at the character level, the length of the LSTM outputs (T) is not consistent with the length of a target word string, |Sw|=K. This makes it difficult to train the RNN directly with the target strings. To this end, a connectionist temporal classification (CTC) is applied to approximately map the LSTM sequential output (p) into its target string as follow:
where the projection B removes the repeated labels and the non-character labels. For example, B ( "-gg-o-oo-dd-" ) = "good" . The CTC looks for an approximately optimized path (π) with maximum probability through the LSTMs output sequence, which aligns the different lengths of LSTM sequence and the word string.
The CTC is specifically designed for the sequence labelling tasks where it is hard to pre-segment the input sequence to the segments that exactly match a target sequence. In our RNN model, the CTC layer is directly connected to the outputs of LSTMs, and works as the output layer of the whole RNN. It not only allows the model to avoid a number of complicated post-processing (e.g. transforming the LSTM output sequence into a word string) , but also makes it possible to be trained in an end-to-end fashion by minimizing a overall loss function over (X, Sw) ∈Ω . The loss for each sample pair is computed as sum of the negative log likelihood of the true word string,
Finally, the RNNs model according to the present application follows a bidirectional LSTM architecture, as shown in Fig. 4. It has two separate LSTM hidden layers that process the input sequence forward and backward, respectively. Both hidden layers are connected to the same output layers, allowing it to access to both past and future information in the sequence.
In an example, the CNN model according to the present application is trained on about 1.8×105 character images and the CNN sequence is generated by applying the trained CNN with a sliding-window on the word images, followed by a column-wise normalization. The RNN model contains a bidirectional LSTM architecture. Each LSTM layer has 128 LSTM cell memory blocks. The input layer of our RNN model has 128 neurons (corresponding to the dimensions of the CNN sequence, xt∈R128) , which are fully connected to both hidden layers. The outputs of two hidden layers are concatenated, and then fully connected to the output layer of the LSTM with 37 output classes (including an additional non-character class) , by using a softmax function. Thus our RNN model has 273445 parameters in total, which are initialized with a Gaussian distribution of mean 0 and standard deviation 0.01 in the training process.
The recurrent model is trained with steepest descent. The parameters are updated per training sequence by using a learning rate of 10-4 and a momentum of 0.9. Each input sequence is randomly selected from the training set. A forward-backward algorithm is performed to jointly optimize the bidirectional LSTM and CTC parameters, where a forward propagation is implemented through whole network, followed by a forward-backward algorithm that aligns the ground truth word strings to the LSTM output maps, π∈B-1 (Sw) , π, p∈R37×T. The loss function of E.q. (6) is computed approximately as,
Finally the approximated error is propagated backward to update the parameters. The RNN is trained on about 3000 word images, taken from the training sets of three benchmarks mentioned in the next section.
The performance of the proposed solution of text recognition is compared against start-of-the-art methods on three standard benchmarks for cropped word image recognition. The experiment result shows demonstrated that the method and apparatus of the present application has good performance in recognizing words in images with training by relative small number of samples.
Although the preferred embodiments of the present invention have been described, many modifications and changes may be possible once those skilled in the art get to know some basic inventive concepts. The appended claims are intended to be construed comprising these preferred embodiments and all the changes and modifications fallen within the scope of the present invention.
It will be apparent to those skilled in the art that various modifications and variations could be made to the present application without departing from the spirit and scope of the present invention. Thus, if any modifications and variations lie within the spirit and principle of the present application, the present invention is intended to include these modifications and variations.
Claims (18)
- A method for recognizing text in an image, comprising:encoding the image into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN;decoding the first sequence with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; andmapping the second sequence into a word string in which repeated labels and non-character labels are removed.
- The method of claim 1, wherein encoding the image into a first sequence with a CNN comprises:convoluting the image as a whole with the CNN, wherein a result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence.
- The method of claim 1, wherein encoding the image into a first sequence with a CNN comprises:applying a sliding window to the image to divide the image into segments having a same size; andconvoluting the segments, individually and sequentially, with the CNN, wherein results of the convolution obtained from the last second convolutional layer of the CNN are components forming the first sequence.
- The method of claim 1, prior to the step of encoding, further comprising,resizing the image to have a predetermined size.
- The method of claim 4, wherein the CNN has been trained with image samples having the predetermined size and outputs 36 classes of different characters at a last layer of the CNN.
- The method of claim 1, wherein the output from the last second convolutional layer of the CNN is just one neuron.
- The method of claim 1, wherein the RNN has been trained with a set of convolutional sequence and corresponding word strings.
- The method of claim 1, wherein the CNN comprises a maxout CNN.
- The method of claim 1, wherein the RNN comprises a bidirectional long short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence, and the CTC layer generates the word string.
- An apparatus for recognizing text in an image, comprising:a convolutional neural network (CNN) encoding the image into a first sequence, wherein the first sequence is an output from a last second convolutional layer of the CNN; anda recurrent neural network (RNN) decoding the first sequence into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence;wherein the RNN further maps the second sequence into a word string in which repeated labels and non-character labels are removed.
- The apparatus of claim 10, wherein the CNN encodes the image into a first sequence by:convoluting the image as a whole with the CNN, wherein a result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence.
- The apparatus of claim 10, wherein the CNN encodes the image into a first sequence by:applying a sliding window to the image to divide the image into segments having a same size; andconvoluting the segments, individually and sequentially, with the CNN, wherein results of the convolution obtained from the last second convolutional layer of the CNN are components forming the first sequence.
- The apparatus of claim 10, wherein the image has been resized to have a predetermined size before being input to the CNN.
- The apparatus of claim 13, wherein the CNN has been trained with image samples having the predetermined size and outputs 36 classes of different characters at a last layer of the CNN.
- The apparatus of claim 10, wherein the output from the last second convolutional layer of the CNN is just one neuron.
- The apparatus of claim 10, wherein the RNN has been trained with a set of convolutional sequence and corresponding word strings.
- The apparatus of claim 10, wherein the CNN comprises a maxout CNN.
- The apparatus of claim 10, wherein the RNN comprises a bidirectional long short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence, and the CTC layer generates the word string.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201580080720.6A CN107636691A (en) | 2015-06-12 | 2015-06-12 | Method and apparatus for identifying the text in image |
PCT/CN2015/081308 WO2016197381A1 (en) | 2015-06-12 | 2015-06-12 | Methods and apparatus for recognizing text in an image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2015/081308 WO2016197381A1 (en) | 2015-06-12 | 2015-06-12 | Methods and apparatus for recognizing text in an image |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016197381A1 true WO2016197381A1 (en) | 2016-12-15 |
Family
ID=57502873
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/081308 WO2016197381A1 (en) | 2015-06-12 | 2015-06-12 | Methods and apparatus for recognizing text in an image |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107636691A (en) |
WO (1) | WO2016197381A1 (en) |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107194341A (en) * | 2017-05-16 | 2017-09-22 | 西安电子科技大学 | The many convolution neural network fusion face identification methods of Maxout and system |
CN107195295A (en) * | 2017-05-04 | 2017-09-22 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device based on Chinese and English mixing dictionary |
CN107301860A (en) * | 2017-05-04 | 2017-10-27 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device based on Chinese and English mixing dictionary |
CN107480682A (en) * | 2017-08-25 | 2017-12-15 | 重庆慧都科技有限公司 | A kind of commodity packaging date of manufacture detection method |
CN108228686A (en) * | 2017-06-15 | 2018-06-29 | 北京市商汤科技开发有限公司 | It is used to implement the matched method, apparatus of picture and text and electronic equipment |
CN108230413A (en) * | 2018-01-23 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image Description Methods and device, electronic equipment, computer storage media, program |
CN108427953A (en) * | 2018-02-26 | 2018-08-21 | 北京易达图灵科技有限公司 | A kind of character recognition method and device |
WO2018170671A1 (en) * | 2017-03-20 | 2018-09-27 | Intel Corporation | Topic-guided model for image captioning system |
CN108776779A (en) * | 2018-05-25 | 2018-11-09 | 西安电子科技大学 | SAR Target Recognition of Sequential Images methods based on convolution loop network |
CN109242796A (en) * | 2018-09-05 | 2019-01-18 | 北京旷视科技有限公司 | Character image processing method, device, electronic equipment and computer storage medium |
CN109753966A (en) * | 2018-12-16 | 2019-05-14 | 初速度(苏州)科技有限公司 | A kind of Text region training system and method |
CN109840524A (en) * | 2019-01-04 | 2019-06-04 | 平安科技(深圳)有限公司 | Kind identification method, device, equipment and the storage medium of text |
CN110175610A (en) * | 2019-05-23 | 2019-08-27 | 上海交通大学 | A kind of bill images text recognition method for supporting secret protection |
CN110210581A (en) * | 2019-04-28 | 2019-09-06 | 平安科技(深圳)有限公司 | A kind of handwritten text recognition methods and device, electronic equipment |
WO2019194356A1 (en) * | 2018-04-02 | 2019-10-10 | 삼성전자주식회사 | Electronic device and control method thereof |
CN110766017A (en) * | 2019-10-22 | 2020-02-07 | 国网新疆电力有限公司信息通信公司 | Mobile terminal character recognition method and system based on deep learning |
CN111160348A (en) * | 2019-11-20 | 2020-05-15 | 中国科学院深圳先进技术研究院 | Text recognition method for natural scene, storage device and computer equipment |
CN111325203A (en) * | 2020-01-21 | 2020-06-23 | 福州大学 | American license plate recognition method and system based on image correction |
CN111428727A (en) * | 2020-03-27 | 2020-07-17 | 华南理工大学 | Natural scene text recognition method based on sequence transformation correction and attention mechanism |
CN111461116A (en) * | 2020-03-25 | 2020-07-28 | 深圳市云恩科技有限公司 | Ship board text recognition model, modeling method and training method thereof |
CN111461105A (en) * | 2019-01-18 | 2020-07-28 | 顺丰科技有限公司 | Text recognition method and device |
CN111651980A (en) * | 2020-05-27 | 2020-09-11 | 河南科技学院 | Wheat cold resistance identification method with hybrid neural network fused with Attention mechanism |
US10817741B2 (en) | 2016-02-29 | 2020-10-27 | Alibaba Group Holding Limited | Word segmentation system, method and device |
CN111860460A (en) * | 2020-08-05 | 2020-10-30 | 江苏新安电器股份有限公司 | Application method of improved LSTM model in human behavior recognition |
CN111985484A (en) * | 2020-08-11 | 2020-11-24 | 云南电网有限责任公司电力科学研究院 | CNN-LSTM-based temperature instrument digital identification method and device |
CN112052852A (en) * | 2020-09-09 | 2020-12-08 | 国家气象信息中心 | Character recognition method of handwritten meteorological archive data based on deep learning |
CN112508023A (en) * | 2020-10-27 | 2021-03-16 | 重庆大学 | Deep learning-based end-to-end identification method for code-spraying characters of parts |
WO2021079347A1 (en) * | 2019-10-25 | 2021-04-29 | Element Ai Inc. | 2d document extractor |
CN112990208A (en) * | 2019-12-12 | 2021-06-18 | 搜狗(杭州)智能科技有限公司 | Text recognition method and device |
US11049018B2 (en) | 2017-06-23 | 2021-06-29 | Nvidia Corporation | Transforming convolutional neural networks for visual sequence learning |
CN113128490A (en) * | 2021-04-28 | 2021-07-16 | 湖南荣冠智能科技有限公司 | Prescription information scanning and automatic identification method |
CN113837282A (en) * | 2021-09-24 | 2021-12-24 | 上海脉衍人工智能科技有限公司 | Natural scene text recognition method and computing device |
US11481605B2 (en) | 2019-10-25 | 2022-10-25 | Servicenow Canada Inc. | 2D document extractor |
EP4191471A4 (en) * | 2020-07-30 | 2024-01-17 | Shanghai Goldway Intelligent Transportation System Co., Ltd. | Sequence recognition method and apparatus, image processing device, and storage medium |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108388896B (en) * | 2018-02-09 | 2021-06-22 | 杭州雄迈集成电路技术股份有限公司 | License plate identification method based on dynamic time sequence convolution neural network |
CN108682418B (en) * | 2018-06-26 | 2022-03-04 | 北京理工大学 | Speech recognition method based on pre-training and bidirectional LSTM |
CN109214378A (en) * | 2018-08-16 | 2019-01-15 | 新智数字科技有限公司 | A kind of method and system integrally identifying metering meter reading based on neural network |
TWI677826B (en) * | 2018-09-19 | 2019-11-21 | 國家中山科學研究院 | License plate recognition system and method |
CN109784340A (en) * | 2018-12-14 | 2019-05-21 | 北京市首都公路发展集团有限公司 | A kind of licence plate recognition method and device |
CN109726657B (en) * | 2018-12-21 | 2023-06-09 | 万达信息股份有限公司 | Deep learning scene text sequence recognition method |
CN109919150A (en) * | 2019-01-23 | 2019-06-21 | 浙江理工大学 | A kind of non-division recognition sequence method and system of 3D pressed characters |
CN110188761A (en) * | 2019-04-22 | 2019-08-30 | 平安科技(深圳)有限公司 | Recognition methods, device, computer equipment and the storage medium of identifying code |
CN113450433B (en) * | 2020-03-26 | 2024-08-16 | 阿里巴巴集团控股有限公司 | Picture generation method, device, computer equipment and medium |
CN112232195B (en) * | 2020-10-15 | 2024-02-20 | 北京临近空间飞行器系统工程研究所 | Handwritten Chinese character recognition method, device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1694130A (en) * | 2005-03-24 | 2005-11-09 | 上海大学 | Identification method of mobile number plate based on three-channel parallel artificial nerve network |
US20060045341A1 (en) * | 2004-08-31 | 2006-03-02 | Samsung Electronics Co., Ltd. | Apparatus and method for high-speed character recognition |
CN101957920A (en) * | 2010-09-08 | 2011-01-26 | 中国人民解放军国防科学技术大学 | Vehicle license plate searching method based on digital videos |
KR20130122842A (en) * | 2012-05-01 | 2013-11-11 | 한국생산기술연구원 | Encoding and decoding method of ls cord |
-
2015
- 2015-06-12 WO PCT/CN2015/081308 patent/WO2016197381A1/en active Application Filing
- 2015-06-12 CN CN201580080720.6A patent/CN107636691A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060045341A1 (en) * | 2004-08-31 | 2006-03-02 | Samsung Electronics Co., Ltd. | Apparatus and method for high-speed character recognition |
CN1694130A (en) * | 2005-03-24 | 2005-11-09 | 上海大学 | Identification method of mobile number plate based on three-channel parallel artificial nerve network |
CN101957920A (en) * | 2010-09-08 | 2011-01-26 | 中国人民解放军国防科学技术大学 | Vehicle license plate searching method based on digital videos |
KR20130122842A (en) * | 2012-05-01 | 2013-11-11 | 한국생산기술연구원 | Encoding and decoding method of ls cord |
Cited By (53)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10817741B2 (en) | 2016-02-29 | 2020-10-27 | Alibaba Group Holding Limited | Word segmentation system, method and device |
WO2018170671A1 (en) * | 2017-03-20 | 2018-09-27 | Intel Corporation | Topic-guided model for image captioning system |
CN107195295A (en) * | 2017-05-04 | 2017-09-22 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device based on Chinese and English mixing dictionary |
CN107301860A (en) * | 2017-05-04 | 2017-10-27 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device based on Chinese and English mixing dictionary |
CN107194341A (en) * | 2017-05-16 | 2017-09-22 | 西安电子科技大学 | The many convolution neural network fusion face identification methods of Maxout and system |
CN107194341B (en) * | 2017-05-16 | 2020-04-21 | 西安电子科技大学 | Face recognition method and system based on fusion of Maxout multi-convolution neural network |
CN108228686A (en) * | 2017-06-15 | 2018-06-29 | 北京市商汤科技开发有限公司 | It is used to implement the matched method, apparatus of picture and text and electronic equipment |
CN108228686B (en) * | 2017-06-15 | 2021-03-23 | 北京市商汤科技开发有限公司 | Method and device for realizing image-text matching and electronic equipment |
US11049018B2 (en) | 2017-06-23 | 2021-06-29 | Nvidia Corporation | Transforming convolutional neural networks for visual sequence learning |
US11645530B2 (en) | 2017-06-23 | 2023-05-09 | Nvidia Corporation | Transforming convolutional neural networks for visual sequence learning |
CN107480682A (en) * | 2017-08-25 | 2017-12-15 | 重庆慧都科技有限公司 | A kind of commodity packaging date of manufacture detection method |
CN108230413B (en) * | 2018-01-23 | 2021-07-06 | 北京市商汤科技开发有限公司 | Image description method and device, electronic equipment and computer storage medium |
CN108230413A (en) * | 2018-01-23 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image Description Methods and device, electronic equipment, computer storage media, program |
CN108427953A (en) * | 2018-02-26 | 2018-08-21 | 北京易达图灵科技有限公司 | A kind of character recognition method and device |
WO2019194356A1 (en) * | 2018-04-02 | 2019-10-10 | 삼성전자주식회사 | Electronic device and control method thereof |
US11482025B2 (en) | 2018-04-02 | 2022-10-25 | Samsung Electronics Co., Ltd. | Electronic device and control method thereof |
CN108776779A (en) * | 2018-05-25 | 2018-11-09 | 西安电子科技大学 | SAR Target Recognition of Sequential Images methods based on convolution loop network |
CN108776779B (en) * | 2018-05-25 | 2022-09-23 | 西安电子科技大学 | Convolutional-circulation-network-based SAR sequence image target identification method |
CN109242796A (en) * | 2018-09-05 | 2019-01-18 | 北京旷视科技有限公司 | Character image processing method, device, electronic equipment and computer storage medium |
CN109753966A (en) * | 2018-12-16 | 2019-05-14 | 初速度(苏州)科技有限公司 | A kind of Text region training system and method |
CN109840524B (en) * | 2019-01-04 | 2023-07-11 | 平安科技(深圳)有限公司 | Text type recognition method, device, equipment and storage medium |
CN109840524A (en) * | 2019-01-04 | 2019-06-04 | 平安科技(深圳)有限公司 | Kind identification method, device, equipment and the storage medium of text |
CN111461105A (en) * | 2019-01-18 | 2020-07-28 | 顺丰科技有限公司 | Text recognition method and device |
CN111461105B (en) * | 2019-01-18 | 2023-11-28 | 顺丰科技有限公司 | Text recognition method and device |
CN110210581B (en) * | 2019-04-28 | 2023-11-24 | 平安科技(深圳)有限公司 | Handwriting text recognition method and device and electronic equipment |
CN110210581A (en) * | 2019-04-28 | 2019-09-06 | 平安科技(深圳)有限公司 | A kind of handwritten text recognition methods and device, electronic equipment |
CN110175610A (en) * | 2019-05-23 | 2019-08-27 | 上海交通大学 | A kind of bill images text recognition method for supporting secret protection |
CN110175610B (en) * | 2019-05-23 | 2023-09-05 | 上海交通大学 | Bill image text recognition method supporting privacy protection |
CN110766017A (en) * | 2019-10-22 | 2020-02-07 | 国网新疆电力有限公司信息通信公司 | Mobile terminal character recognition method and system based on deep learning |
CN110766017B (en) * | 2019-10-22 | 2023-08-04 | 国网新疆电力有限公司信息通信公司 | Mobile terminal text recognition method and system based on deep learning |
EP4049167A4 (en) * | 2019-10-25 | 2022-12-21 | Servicenow Canada Inc. | 2d document extractor |
US11481605B2 (en) | 2019-10-25 | 2022-10-25 | Servicenow Canada Inc. | 2D document extractor |
WO2021079347A1 (en) * | 2019-10-25 | 2021-04-29 | Element Ai Inc. | 2d document extractor |
CN111160348A (en) * | 2019-11-20 | 2020-05-15 | 中国科学院深圳先进技术研究院 | Text recognition method for natural scene, storage device and computer equipment |
CN112990208B (en) * | 2019-12-12 | 2024-04-30 | 北京搜狗科技发展有限公司 | Text recognition method and device |
CN112990208A (en) * | 2019-12-12 | 2021-06-18 | 搜狗(杭州)智能科技有限公司 | Text recognition method and device |
CN111325203B (en) * | 2020-01-21 | 2022-07-05 | 福州大学 | American license plate recognition method and system based on image correction |
CN111325203A (en) * | 2020-01-21 | 2020-06-23 | 福州大学 | American license plate recognition method and system based on image correction |
CN111461116B (en) * | 2020-03-25 | 2024-02-02 | 深圳市云恩科技有限公司 | Ship board text recognition model structure, modeling method and training method thereof |
CN111461116A (en) * | 2020-03-25 | 2020-07-28 | 深圳市云恩科技有限公司 | Ship board text recognition model, modeling method and training method thereof |
CN111428727B (en) * | 2020-03-27 | 2023-04-07 | 华南理工大学 | Natural scene text recognition method based on sequence transformation correction and attention mechanism |
CN111428727A (en) * | 2020-03-27 | 2020-07-17 | 华南理工大学 | Natural scene text recognition method based on sequence transformation correction and attention mechanism |
CN111651980A (en) * | 2020-05-27 | 2020-09-11 | 河南科技学院 | Wheat cold resistance identification method with hybrid neural network fused with Attention mechanism |
EP4191471A4 (en) * | 2020-07-30 | 2024-01-17 | Shanghai Goldway Intelligent Transportation System Co., Ltd. | Sequence recognition method and apparatus, image processing device, and storage medium |
CN111860460A (en) * | 2020-08-05 | 2020-10-30 | 江苏新安电器股份有限公司 | Application method of improved LSTM model in human behavior recognition |
CN111985484A (en) * | 2020-08-11 | 2020-11-24 | 云南电网有限责任公司电力科学研究院 | CNN-LSTM-based temperature instrument digital identification method and device |
CN112052852A (en) * | 2020-09-09 | 2020-12-08 | 国家气象信息中心 | Character recognition method of handwritten meteorological archive data based on deep learning |
CN112052852B (en) * | 2020-09-09 | 2023-12-29 | 国家气象信息中心 | Character recognition method of handwriting meteorological archive data based on deep learning |
CN112508023A (en) * | 2020-10-27 | 2021-03-16 | 重庆大学 | Deep learning-based end-to-end identification method for code-spraying characters of parts |
CN113128490B (en) * | 2021-04-28 | 2023-12-05 | 湖南荣冠智能科技有限公司 | Prescription information scanning and automatic identification method |
CN113128490A (en) * | 2021-04-28 | 2021-07-16 | 湖南荣冠智能科技有限公司 | Prescription information scanning and automatic identification method |
CN113837282A (en) * | 2021-09-24 | 2021-12-24 | 上海脉衍人工智能科技有限公司 | Natural scene text recognition method and computing device |
CN113837282B (en) * | 2021-09-24 | 2024-02-02 | 上海脉衍人工智能科技有限公司 | Natural scene text recognition method and computing device |
Also Published As
Publication number | Publication date |
---|---|
CN107636691A (en) | 2018-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016197381A1 (en) | Methods and apparatus for recognizing text in an image | |
US11823050B2 (en) | Semi-supervised person re-identification using multi-view clustering | |
US11138441B2 (en) | Video action segmentation by mixed temporal domain adaption | |
US20210027098A1 (en) | Weakly Supervised Image Segmentation Via Curriculum Learning | |
US11776236B2 (en) | Unsupervised representation learning with contrastive prototypes | |
EP3690740B1 (en) | Method for optimizing hyperparameters of auto-labeling device which auto-labels training images for use in deep learning network to analyze images with high precision, and optimizing device using the same | |
WO2021135254A1 (en) | License plate number recognition method and apparatus, electronic device, and storage medium | |
US20190303535A1 (en) | Interpretable bio-medical link prediction using deep neural representation | |
US11797845B2 (en) | Model learning device, model learning method, and program | |
CN110321967B (en) | Image classification improvement method based on convolutional neural network | |
CN112733768B (en) | Natural scene text recognition method and device based on bidirectional characteristic language model | |
Zuo et al. | Challenging tough samples in unsupervised domain adaptation | |
WO2015192263A1 (en) | A method and a system for face verification | |
US20240242487A1 (en) | Transfer learning in image recognition systems | |
CN107945210B (en) | Target tracking method based on deep learning and environment self-adaption | |
CN112149526B (en) | Lane line detection method and system based on long-distance information fusion | |
Rajani et al. | A convolutional vision transformer for semantic segmentation of side-scan sonar data | |
CN116630369A (en) | Unmanned aerial vehicle target tracking method based on space-time memory network | |
CN115908806A (en) | Small sample image segmentation method based on lightweight multi-scale feature enhancement network | |
CN114913339A (en) | Training method and device of feature map extraction model | |
Liu et al. | Multi-digit recognition with convolutional neural network and long short-term memory | |
CN113792822B (en) | Efficient dynamic image classification method | |
Kumar et al. | Analysis and fast feature selection technique for real-time face detection materials using modified region optimized convolutional neural network | |
Passalis et al. | Multilayer probabilistic knowledge transfer for learning image representations | |
Rajput et al. | Handwritten Digit Recognition using Convolution Neural Networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15894651 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29/03/2018) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15894651 Country of ref document: EP Kind code of ref document: A1 |