[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2016197381A1 - Methods and apparatus for recognizing text in an image - Google Patents

Methods and apparatus for recognizing text in an image Download PDF

Info

Publication number
WO2016197381A1
WO2016197381A1 PCT/CN2015/081308 CN2015081308W WO2016197381A1 WO 2016197381 A1 WO2016197381 A1 WO 2016197381A1 CN 2015081308 W CN2015081308 W CN 2015081308W WO 2016197381 A1 WO2016197381 A1 WO 2016197381A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
cnn
image
layer
last
Prior art date
Application number
PCT/CN2015/081308
Other languages
French (fr)
Inventor
Xiaoou Tang
Weilin Huang
Yu Qiao
Chen Change Loy
Pan HE
Original Assignee
Sensetime Group Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sensetime Group Limited filed Critical Sensetime Group Limited
Priority to CN201580080720.6A priority Critical patent/CN107636691A/en
Priority to PCT/CN2015/081308 priority patent/WO2016197381A1/en
Publication of WO2016197381A1 publication Critical patent/WO2016197381A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • This application relates to text recognizing; in particular, to methods and apparatus for recognizing text in an image.
  • Text recognition in natural image has received increasing attention in computer vision, due to its numerous practical applications.
  • This problem includes two sub tasks, namely text detection and text-line/word recognition.
  • the main difficulty arises from the large diversity of text patterns (e. g. low resolution, low contrast, and blurring) , and highly complicated background clutters. Consequently, individual character segmentation or separation is extremely challenging.
  • a method for recognizing text in an image comprises: encoding the image into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN; decoding the first sequence with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; and mapping the second sequence into a word string in which repeated labels and non-character labels are removed.
  • CNN convolutional neural network
  • RNN recurrent neural network
  • an apparatus for recognizing text in an image comprises: a convolutional neural network (CNN) encoding the image into a first sequence, wherein the first sequence is an output from a last second convolutional layer of the CNN; and a recurrent neural network (RNN) decoding the first sequence into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; wherein the RNN further maps the second sequence into a word string in which repeated labels and non-character labels are removed.
  • CNN convolutional neural network
  • RNN recurrent neural network
  • Fig. 1 is flowchart of a method for recognizing text in an image according to an embodiment of the present application.
  • Fig. 2 illustrates an overall pipeline of an apparatus or a system for recognizing text in an image according to an embodiment of the present application.
  • Fig. 3 illustrates a structure of a five-layer Maxout CNN mode used in an embodiment of the present application.
  • Fig. 4 illustrates a structure of a RNN mode used in an embodiment of the present application.
  • Fig. 5 illustrates a structure of the memory cells in a RNN mode used in an embodiment of the present application.
  • Fig. 1 illustrates a flowchart of a method 100 for recognizing text in an image according to an embodiment of the present application.
  • an image with characters is encoded into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN.
  • the first sequence is decoded with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence.
  • RNN recurrent neural network
  • the second sequence is mapped by into a word string in which repeated labels and non-character labels are removed.
  • the output from the last second convolutional layer of the CNN is obtained and directly used as an input of the RNN for text recognizing, so that an advantage of the RNN in retaining meaningful interdependencies of continuous text is considered during the process.
  • the CNN may perform convolution to the image as a whole.
  • the result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence that is to be used as the input of the RNN.
  • the CNN may use a slide window to scan the word image from left to right densely and divide the image into continuous segments. Note that, such segments are not equal to those obtained by character segmentation, since it is just simply scanning and dividing, without needing to identify individual characters.
  • the segments are convolutedindividually by the CNN.
  • the results of the convolution obtained from the last second convolutional layer of the CNN are components which collectively form the first sequence that is to be used as the input of the RNN.
  • Fig. 2 illustrates an overall pipeline of an apparatus or a system for recognizing text in an image according to an embodiment of the present application.
  • an image with a character string “apartment” is divided into segments of the same size, e. g. , 32*32.
  • Such division is implemented with a sliding window by scanning.
  • the division involves no character identifying work.
  • Each segment may comprise one or more complete or incomplete characters, and may comprise no character at all, as can be seen from the second row in the top box.
  • the middle box in Fig. 2 shows a CNN network, which performs convolution to each of the segment sequentially.
  • a CNN may comprise several layers.
  • some text recognizing technologies have used the CNN network for an isolated character recognition , in which a character label is output at the last layer of the CNN.
  • the output from the last second layer of the CNN is used.
  • the output has 128 feature maps, each of which includes a single neuron. For example, a segment with a size of 32*32 may generate a 128D output from the last second layer of the CNN.
  • T number of 128D outputs may be generated, wherein T is a positive integer varied by the aspect ratio of the image..
  • the output sequence represents high-level deep features of the input image.
  • the bottom box in Fig. 2 shows an RNN network, which decodes the output sequence from the CNN.
  • the RNN has the same number of channels as the CNN. However, different from the CNN in which each channel works individually, the sequential channels in the RNN are connected and interacted by internal states of the RNN in the hidden layers. Based on such configuration, for each component in the sequence output from the CNN, estimated probabilities over all possible characters are output, with consideration of the relationship with both the previous component (if any) and the following component (if any) in the sequence. The estimated probabilities for each component in the sequence output from the CNN (thus for each segment of the image) are then considered together and mapped into a word string “apartment” , in which repeated labels and non-character labels are removed.
  • the image may be resized to be adaptive to the CNN or the sliding window so that it can be properly processed and recognized. For example, for a sliding window having a size of 32*32, the image may be resized to have a height of 32, while keeping its original aspect ratio unchanged.
  • Fig. 2 shows a situation when a sliding window is used.
  • the sliding window is not necessary.
  • the image is convoluted as a whole, and the output from the last second layer of the CNN has a matrix of 128*T, which is equivalent to the sequence obtained by connecting T 128D outputs together.
  • are the parameters of the recurrent system.
  • (I, S w ) ⁇ is a sample pair from a training set, ⁇ , where is the ground truth word string (containing K characters) of the image I.
  • the chain rule is applied to model the joint probability over S w ,
  • a RNN is developed to model the sequential probabilities where the variable number of the sequentially conditioned characters can be expressed by an internal state of the RNN in hidden layer, h t . This internal state is updated when the next sequential input x t is presented by computing a non-linear function H,
  • X ⁇ x 1 , x 2 , ..., x T ⁇ is the sequential CNN features computed from the word image.
  • a CNN model is developed to generate the sequential x t , and define H with a long short-term memory (LSTM) architecture.
  • LSTM long short-term memory
  • a five-layer maxout CNN and a bidirectional long short-term memory (LSTM) based on RNN are used in the following illustrative example.
  • the LSTM-based RNN may further comprise a connectionist temporal classification (CTC) layer. It is noted that other kinds of CNN and/or RNN are also possible to be used to implement the application.
  • CTC connectionist temporal classification
  • Fig. 3 illustrates a structure of a five-layer Maxout CNN mode used in an embodiment of the present application.
  • the basic pipeline is to compute point-wise maximum through a number of grouped feature maps or channels.
  • an input image/segment is of a size 32*32 corresponding the size of sliding window.
  • the Maxout CNN network has five convolutional layers, each of which is followed by a two-group or four-group Maxout operation, with different numbers of feature maps, i.e. 48, 64, 128, 128 and 36 respectively.
  • no pooling operation is involved, and the output map of last two convolutional layers are just one pixel. This allows the CNN to convolute the whole word images at once, leading to a significant computational efficiency.
  • a 128D CNN sequence may be directly from the output of last second convolutional layer. This operation is similar to computing deep feature independently from the sliding window by moving it densely through the image, but with much computational efficiency.
  • the used Maxout CNN may be trained on 36 classes comprised of case insensitive character sample images, including 26 characters and 10 numbers.
  • Fig. 4 illustrates a structure of a RNN mode used in an embodiment of the present application.
  • the RNN comprises bidirectional long short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence from the input CNN sequence, and the CTC layer generates the word string from the second sequence.
  • LSTM long short-term memory
  • CTC connectionist temporal classification
  • the bidirectional LSTM has two separate LSTM hidden layers that process the input sequence forward and backward, respectively. Both hidden layers are connected to the same input and output layers.
  • the main shortcoming of the standard RNN is the vanishing gradient problem, making it hard to transmit the gradient information consistently over long time. This is a crucial issue in designing a RNN model, and the long short-term memory (LSTM) was proposed specially to address this problem.
  • the LSTM defines a new neuron or cell structure in the hidden layer with three additional multiplicative gates: the input gate, forget gate and output gate. These new cells are referred as memory cells, which allow the LSTM to learn meaningful long-range interdependencies.
  • the structure of the memory cells is described in Fig. 5.
  • is the logistic sigmoid function, realising the non-linearity of LSTM.
  • the cell activation is a summation of the previous cell activation and the input modulation, which are controlled by the forget gate and input gate respectively. These two gates trade off the influences between the previous memory cell and current input information.
  • the output gate controls how much of the cell activation to transfer to the final hidden state.
  • Each LSTM hidden layer includes 128 LSTM memory cells, each of which has a structure shown in Fig. 5.
  • the sequence labelling of varying lengths is processed by recurrently implementing the LSTM memory for each sequential input x t (t is an integer from 1 to T) , such that all LSTMs share the same parameters.
  • connectionist temporal classification is applied to approximately map the LSTM sequential output (p) into its target string as follow:
  • the CTC looks for an approximately optimized path ( ⁇ ) with maximum probability through the LSTMs output sequence, which aligns the different lengths of LSTM sequence and the word string.
  • the CTC is specifically designed for the sequence labelling tasks where it is hard to pre-segment the input sequence to the segments that exactly match a target sequence.
  • the CTC layer is directly connected to the outputs of LSTMs, and works as the output layer of the whole RNN. It not only allows the model to avoid a number of complicated post-processing (e.g. transforming the LSTM output sequence into a word string) , but also makes it possible to be trained in an end-to-end fashion by minimizing a overall loss function over (X, S w ) ⁇ .
  • the loss for each sample pair is computed as sum of the negative log likelihood of the true word string,
  • the RNNs model according to the present application follows a bidirectional LSTM architecture, as shown in Fig. 4. It has two separate LSTM hidden layers that process the input sequence forward and backward, respectively. Both hidden layers are connected to the same output layers, allowing it to access to both past and future information in the sequence.
  • the CNN model according to the present application is trained on about 1.8 ⁇ 10 5 character images and the CNN sequence is generated by applying the trained CNN with a sliding-window on the word images, followed by a column-wise normalization.
  • the RNN model contains a bidirectional LSTM architecture. Each LSTM layer has 128 LSTM cell memory blocks.
  • the input layer of our RNN model has 128 neurons (corresponding to the dimensions of the CNN sequence, x t ⁇ R 128 ) , which are fully connected to both hidden layers.
  • the outputs of two hidden layers are concatenated, and then fully connected to the output layer of the LSTM with 37 output classes (including an additional non-character class) , by using a softmax function.
  • our RNN model has 273445 parameters in total, which are initialized with a Gaussian distribution of mean 0 and standard deviation 0.01 in the training process.
  • the recurrent model is trained with steepest descent.
  • the parameters are updated per training sequence by using a learning rate of 10 -4 and a momentum of 0.9.
  • Each input sequence is randomly selected from the training set.
  • a forward-backward algorithm is performed to jointly optimize the bidirectional LSTM and CTC parameters, where a forward propagation is implemented through whole network, followed by a forward-backward algorithm that aligns the ground truth word strings to the LSTM output maps, ⁇ B -1 (S w ) , ⁇ , p ⁇ R 37 ⁇ T .
  • the loss function of E.q. (6) is computed approximately as,
  • the RNN is trained on about 3000 word images, taken from the training sets of three benchmarks mentioned in the next section.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

Methods and apparatus for recognizing text in an image are disclosed. According to an embodiment, the method comprises encoding the image into a first sequence with a convolutional neural network (CNN), wherein the first sequence is an output from a last second convolutional layer of the CNN; decoding the first sequence with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; and mapping the second sequence into a word string in which repeated labels and non-character labels are removed.

Description

Methods and Apparatus for Recognizing Text in an Image Technical Field
This application relates to text recognizing; in particular, to methods and apparatus for recognizing text in an image.
Background
Text recognition in natural image has received increasing attention in computer vision, due to its numerous practical applications. This problem includes two sub tasks, namely text detection and text-line/word recognition. The main difficulty arises from the large diversity of text patterns (e. g. low resolution, low contrast, and blurring) , and highly complicated background clutters. Consequently, individual character segmentation or separation is extremely challenging.
Most previous studies focus on developing powerful character classifiers, some of which are incorporated with an additional language model, leading to the state-of-the-art performance. These approaches mainly follow the basic pipeline of conventional OCR techniques by first involving a character-level segmentation, then followed by an isolated character classifier and post-processing for recognition. Several approaches adopt deep neural networks for representation learning, but the recognition is still confined to character-level classification. All the current successful scene text recognition systems are mostly built on isolated character classifier. Their performance are thus severely harmed by the difficulty of character-level segmentation or separation. Importantly, recognizing each character independently discards meaningful context information of the words, significantly reducing its reliability and robustness.
Summary
According to an aspect of the present application, a method for recognizing text in an image comprises: encoding the image into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN; decoding the first sequence with a  recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; and mapping the second sequence into a word string in which repeated labels and non-character labels are removed.
According to another aspect of the present application, an apparatus for recognizing text in an image comprises: a convolutional neural network (CNN) encoding the image into a first sequence, wherein the first sequence is an output from a last second convolutional layer of the CNN; and a recurrent neural network (RNN) decoding the first sequence into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; wherein the RNN further maps the second sequence into a word string in which repeated labels and non-character labels are removed.
Drawings
Fig. 1 is flowchart of a method for recognizing text in an image according to an embodiment of the present application.
Fig. 2 illustrates an overall pipeline of an apparatus or a system for recognizing text in an image according to an embodiment of the present application.
Fig. 3 illustrates a structure of a five-layer Maxout CNN mode used in an embodiment of the present application.
Fig. 4 illustrates a structure of a RNN mode used in an embodiment of the present application.
Fig. 5 illustrates a structure of the memory cells in a RNN mode used in an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings in detail.
Fig. 1 illustrates a flowchart of a method 100 for recognizing text in an image according to an embodiment of the present application. As shown in Fig. 1, at step S101, an image with characters is encoded into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN. At step S102, the first sequence is decoded with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence. At step S103, the second sequence is mapped by into a word string in which repeated labels and non-character labels are removed.
According to the embodiment, no character segmentation is needed. Instead, the output from the last second convolutional layer of the CNN is obtained and directly used as an input of the RNN for text recognizing, so that an advantage of the RNN in retaining meaningful interdependencies of continuous text is considered during the process.
In an embodiment, the CNN may perform convolution to the image as a whole. In this case, the result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence that is to be used as the input of the RNN.
Alternatively, the CNN may use a slide window to scan the word image from left to right densely and divide the image into continuous segments. Note that, such segments are not equal to those obtained by character segmentation, since it is just simply scanning and dividing, without needing to identify individual characters. The segments are convolutedindividually by the CNN. The results of the convolution obtained from the last second convolutional layer of the CNN are components which collectively form the first sequence that is to be used as the input of the RNN.
Fig. 2 illustrates an overall pipeline of an apparatus or a system for recognizing text in an image according to an embodiment of the present application. As shown in the top box, an image with a character string “apartment” is divided into  segments of the same size, e. g. , 32*32. Such division is implemented with a sliding window by scanning. The division involves no character identifying work. Each segment may comprise one or more complete or incomplete characters, and may comprise no character at all, as can be seen from the second row in the top box.
The middle box in Fig. 2 shows a CNN network, which performs convolution to each of the segment sequentially. As is known, a CNN may comprise several layers. In prior art, some text recognizing technologies have used the CNN network for an isolated character recognition , in which a character label is output at the last layer of the CNN. According to the present application, the output from the last second layer of the CNN is used. The output has 128 feature maps, each of which includes a single neuron. For example, a segment with a size of 32*32 may generate a 128D output from the last second layer of the CNN. For an image which may be divided by a sliding window into T segments, T number of 128D outputs may be generated, wherein T is a positive integer varied by the aspect ratio of the image.. The output sequence represents high-level deep features of the input image.
The bottom box in Fig. 2 shows an RNN network, which decodes the output sequence from the CNN. As can be seen, the RNN has the same number of channels as the CNN. However, different from the CNN in which each channel works individually, the sequential channels in the RNN are connected and interacted by internal states of the RNN in the hidden layers. Based on such configuration, for each component in the sequence output from the CNN, estimated probabilities over all possible characters are output, with consideration of the relationship with both the previous component (if any) and the following component (if any) in the sequence. The estimated probabilities for each component in the sequence output from the CNN (thus for each segment of the image) are then considered together and mapped into a word string “apartment” , in which repeated labels and non-character labels are removed.
According to an embodiment, the image may be resized to be adaptive to the CNN or the sliding window so that it can be properly processed and recognized.  For example, for a sliding window having a size of 32*32, the image may be resized to have a height of 32, while keeping its original aspect ratio unchanged.
Although Fig. 2 shows a situation when a sliding window is used. In another embodiment, the sliding window is not necessary. Under such circumstances, the image is convoluted as a whole, and the output from the last second layer of the CNN has a matrix of 128*T, which is equivalent to the sequence obtained by connecting T 128D outputs together.
Before describing the CNN and RNN in detail, the process of word image recognition is formulated as a sequence labeling problem as below. The probability of the correct word strings (Sw) is maximized given an input image (I) as follows,
Figure PCTCN2015081308-appb-000001
where θ are the parameters of the recurrent system. (I, Sw) ∈Ω is a sample pair from a training set, Ω , where
Figure PCTCN2015081308-appb-000002
is the ground truth word string (containing K characters) of the image I. Commonly, the chain rule is applied to model the joint probability over Sw,
Figure PCTCN2015081308-appb-000003
Thus the sum of the log probabilities over all sample pairs in the training set (Ω) is optimized to learn the model parameters. A RNN is developed to model the sequential probabilities
Figure PCTCN2015081308-appb-000004
where the variable number of the sequentially conditioned characters can be expressed by an internal state of the RNN in hidden layer, ht. This internal state is updated when the next sequential input xt is presented by computing a non-linear function H,
ht+1=H (ht, xt)  (3)
where the non-linear function H defines exact form of the proposed recurrent  system. X= {x1, x2, ..., xT} is the sequential CNN features computed from the word image.
Figure PCTCN2015081308-appb-000005
Designs of the
Figure PCTCN2015081308-appb-000006
and H play crucial roles in the proposed system. A CNN model is developed to generate the sequential xt, and define H with a long short-term memory (LSTM) architecture.
Both CNN and RNN are previously trained, which will be described in detail as below.
To better understand and implement embodiments of the application, a five-layer maxout CNN and a bidirectional long short-term memory (LSTM) based on RNN are used in the following illustrative example. The LSTM-based RNN may further comprise a connectionist temporal classification (CTC) layer. It is noted that other kinds of CNN and/or RNN are also possible to be used to implement the application.
Fig. 3 illustrates a structure of a five-layer Maxout CNN mode used in an embodiment of the present application. As shown, the basic pipeline is to compute point-wise maximum through a number of grouped feature maps or channels. For example, an input image/segment is of a size 32*32 corresponding the size of sliding window. The Maxout CNN network has five convolutional layers, each of which is followed by a two-group or four-group Maxout operation, with different numbers of feature maps, i.e. 48, 64, 128, 128 and 36 respectively. During the convolution, no pooling operation is involved, and the output map of last two convolutional layers are just one pixel. This allows the CNN to convolute the whole word images at once, leading to a significant computational efficiency. For each word image, it may be resized into the same height of 32, and keep its original aspect ratio unchanged. By applying the learned CNN to the resized image, a 128D CNN sequence may be directly from the output of last second convolutional layer. This operation is similar to computing deep feature independently from the sliding window by moving it densely  through the image, but with much computational efficiency. The used Maxout CNN may be trained on 36 classes comprised of case insensitive character sample images, including 26 characters and 10 numbers.
As mentioned above, for an image divided by a sliding window into T segments, the output from the CNN to the RNN is a matrix X= {x1, x2, ..., xT} , in which each of x1, x2, ..., xT is a 128D vector.
Fig. 4 illustrates a structure of a RNN mode used in an embodiment of the present application. The RNN comprises bidirectional long short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence from the input CNN sequence, and the CTC layer generates the word string from the second sequence.
As shown, the bidirectional LSTM has two separate LSTM hidden layers that process the input sequence forward and backward, respectively. Both hidden layers are connected to the same input and output layers.
The main shortcoming of the standard RNN is the vanishing gradient problem, making it hard to transmit the gradient information consistently over long time. This is a crucial issue in designing a RNN model, and the long short-term memory (LSTM) was proposed specially to address this problem. The LSTM defines a new neuron or cell structure in the hidden layer with three additional multiplicative gates: the input gate, forget gate and output gate. These new cells are referred as memory cells, which allow the LSTM to learn meaningful long-range interdependencies. The structure of the memory cells is described in Fig. 5. σ is the logistic sigmoid function, realising the non-linearity of LSTM. The cell activation is a summation of the previous cell activation and the input modulation, which are controlled by the forget gate and input gate respectively. These two gates trade off the influences between the previous memory cell and current input information. The output gate controls how much of the cell activation to transfer to the final hidden state. Each LSTM hidden layer includes 128 LSTM memory cells, each of which has  a structure shown in Fig. 5.
The sequence labelling of varying lengths is processed by recurrently implementing the LSTM memory for each sequential input xt (t is an integer from 1 to T) , such that all LSTMs share the same parameters. The output of the LSTM ht is used to fed to the LSTM at next input xt+1. It is also used to compute the current output, which is transformed to the estimated probabilities over all possible characters. It finally generates a sequence of the estimations with the same length of input sequence, p= {p1, p2, p3, ..., pT} .
Due to the unsegmented nature of the word image at the character level, the length of the LSTM outputs (T) is not consistent with the length of a target word string, |Sw|=K. This makes it difficult to train the RNN directly with the target strings. To this end, a connectionist temporal classification (CTC) is applied to approximately map the LSTM sequential output (p) into its target string as follow:
Figure PCTCN2015081308-appb-000007
where the projection B removes the repeated labels and the non-character labels. For example, B ( "-gg-o-oo-dd-" ) = "good" . The CTC looks for an approximately optimized path (π) with maximum probability through the LSTMs output sequence, which aligns the different lengths of LSTM sequence and the word string.
The CTC is specifically designed for the sequence labelling tasks where it is hard to pre-segment the input sequence to the segments that exactly match a target sequence. In our RNN model, the CTC layer is directly connected to the outputs of LSTMs, and works as the output layer of the whole RNN. It not only allows the model to avoid a number of complicated post-processing (e.g. transforming the LSTM output sequence into a word string) , but also makes it possible to be trained in an end-to-end fashion by minimizing a overall loss function over (X, Sw) ∈Ω . The loss for each sample pair is computed as sum of the negative log likelihood of the true word string,
Figure PCTCN2015081308-appb-000008
Finally, the RNNs model according to the present application follows a bidirectional LSTM architecture, as shown in Fig. 4. It has two separate LSTM hidden layers that process the input sequence forward and backward, respectively. Both hidden layers are connected to the same output layers, allowing it to access to both past and future information in the sequence.
In an example, the CNN model according to the present application is trained on about 1.8×105 character images and the CNN sequence is generated by applying the trained CNN with a sliding-window on the word images, followed by a column-wise normalization. The RNN model contains a bidirectional LSTM architecture. Each LSTM layer has 128 LSTM cell memory blocks. The input layer of our RNN model has 128 neurons (corresponding to the dimensions of the CNN sequence, xt∈R128) , which are fully connected to both hidden layers. The outputs of two hidden layers are concatenated, and then fully connected to the output layer of the LSTM with 37 output classes (including an additional non-character class) , by using a softmax function. Thus our RNN model has 273445 parameters in total, which are initialized with a Gaussian distribution of mean 0 and standard deviation 0.01 in the training process.
The recurrent model is trained with steepest descent. The parameters are updated per training sequence by using a learning rate of 10-4 and a momentum of 0.9. Each input sequence is randomly selected from the training set. A forward-backward algorithm is performed to jointly optimize the bidirectional LSTM and CTC parameters, where a forward propagation is implemented through whole network, followed by a forward-backward algorithm that aligns the ground truth word strings to the LSTM output maps, π∈B-1 (Sw) , π, p∈R37×T. The loss function of E.q. (6) is computed approximately as,
Figure PCTCN2015081308-appb-000009
Finally the approximated error is propagated backward to update the parameters. The RNN is trained on about 3000 word images, taken from the training sets of three benchmarks mentioned in the next section.
The performance of the proposed solution of text recognition is compared against start-of-the-art methods on three standard benchmarks for cropped word image recognition. The experiment result shows demonstrated that the method and apparatus of the present application has good performance in recognizing words in images with training by relative small number of samples.
Although the preferred embodiments of the present invention have been described, many modifications and changes may be possible once those skilled in the art get to know some basic inventive concepts. The appended claims are intended to be construed comprising these preferred embodiments and all the changes and modifications fallen within the scope of the present invention.
It will be apparent to those skilled in the art that various modifications and variations could be made to the present application without departing from the spirit and scope of the present invention. Thus, if any modifications and variations lie within the spirit and principle of the present application, the present invention is intended to include these modifications and variations.

Claims (18)

  1. A method for recognizing text in an image, comprising:
    encoding the image into a first sequence with a convolutional neural network (CNN) , wherein the first sequence is an output from a last second convolutional layer of the CNN;
    decoding the first sequence with a recurrent neural network (RNN) into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence; and
    mapping the second sequence into a word string in which repeated labels and non-character labels are removed.
  2. The method of claim 1, wherein encoding the image into a first sequence with a CNN comprises:
    convoluting the image as a whole with the CNN, wherein a result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence.
  3. The method of claim 1, wherein encoding the image into a first sequence with a CNN comprises:
    applying a sliding window to the image to divide the image into segments having a same size; and
    convoluting the segments, individually and sequentially, with the CNN, wherein results of the convolution obtained from the last second convolutional layer of the CNN are components forming the first sequence.
  4. The method of claim 1, prior to the step of encoding, further comprising,
    resizing the image to have a predetermined size.
  5. The method of claim 4, wherein the CNN has been trained with image samples having the predetermined size and outputs 36 classes of different characters at a last layer of the CNN.
  6. The method of claim 1, wherein the output from the last second convolutional layer of the CNN is just one neuron.
  7. The method of claim 1, wherein the RNN has been trained with a set of convolutional sequence and corresponding word strings.
  8. The method of claim 1, wherein the CNN comprises a maxout CNN.
  9. The method of claim 1, wherein the RNN comprises a bidirectional long short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence, and the CTC layer generates the word string.
  10. An apparatus for recognizing text in an image, comprising:
    a convolutional neural network (CNN) encoding the image into a first sequence, wherein the first sequence is an output from a last second convolutional layer of the CNN; and
    a recurrent neural network (RNN) decoding the first sequence into a second sequence, which has a same length as the first sequence and indicates estimated probabilities over all possible characters corresponding to each component in the first sequence;
    wherein the RNN further maps the second sequence into a word string in which repeated labels and non-character labels are removed.
  11. The apparatus of claim 10, wherein the CNN encodes the image into a first  sequence by:
    convoluting the image as a whole with the CNN, wherein a result of the convolution obtained from the last second convolutional layer of the CNN is the first sequence.
  12. The apparatus of claim 10, wherein the CNN encodes the image into a first sequence by:
    applying a sliding window to the image to divide the image into segments having a same size; and
    convoluting the segments, individually and sequentially, with the CNN, wherein results of the convolution obtained from the last second convolutional layer of the CNN are components forming the first sequence.
  13. The apparatus of claim 10, wherein the image has been resized to have a predetermined size before being input to the CNN.
  14. The apparatus of claim 13, wherein the CNN has been trained with image samples having the predetermined size and outputs 36 classes of different characters at a last layer of the CNN.
  15. The apparatus of claim 10, wherein the output from the last second convolutional layer of the CNN is just one neuron.
  16. The apparatus of claim 10, wherein the RNN has been trained with a set of convolutional sequence and corresponding word strings.
  17. The apparatus of claim 10, wherein the CNN comprises a maxout CNN.
  18. The apparatus of claim 10, wherein the RNN comprises a bidirectional long  short-term memory (LSTM) layer and a connectionist temporal classification (CTC) layer, wherein the LSTM layer generates the second sequence, and the CTC layer generates the word string.
PCT/CN2015/081308 2015-06-12 2015-06-12 Methods and apparatus for recognizing text in an image WO2016197381A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580080720.6A CN107636691A (en) 2015-06-12 2015-06-12 Method and apparatus for identifying the text in image
PCT/CN2015/081308 WO2016197381A1 (en) 2015-06-12 2015-06-12 Methods and apparatus for recognizing text in an image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/081308 WO2016197381A1 (en) 2015-06-12 2015-06-12 Methods and apparatus for recognizing text in an image

Publications (1)

Publication Number Publication Date
WO2016197381A1 true WO2016197381A1 (en) 2016-12-15

Family

ID=57502873

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/081308 WO2016197381A1 (en) 2015-06-12 2015-06-12 Methods and apparatus for recognizing text in an image

Country Status (2)

Country Link
CN (1) CN107636691A (en)
WO (1) WO2016197381A1 (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194341A (en) * 2017-05-16 2017-09-22 西安电子科技大学 The many convolution neural network fusion face identification methods of Maxout and system
CN107195295A (en) * 2017-05-04 2017-09-22 百度在线网络技术(北京)有限公司 Audio recognition method and device based on Chinese and English mixing dictionary
CN107301860A (en) * 2017-05-04 2017-10-27 百度在线网络技术(北京)有限公司 Audio recognition method and device based on Chinese and English mixing dictionary
CN107480682A (en) * 2017-08-25 2017-12-15 重庆慧都科技有限公司 A kind of commodity packaging date of manufacture detection method
CN108228686A (en) * 2017-06-15 2018-06-29 北京市商汤科技开发有限公司 It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN108230413A (en) * 2018-01-23 2018-06-29 北京市商汤科技开发有限公司 Image Description Methods and device, electronic equipment, computer storage media, program
CN108427953A (en) * 2018-02-26 2018-08-21 北京易达图灵科技有限公司 A kind of character recognition method and device
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN108776779A (en) * 2018-05-25 2018-11-09 西安电子科技大学 SAR Target Recognition of Sequential Images methods based on convolution loop network
CN109242796A (en) * 2018-09-05 2019-01-18 北京旷视科技有限公司 Character image processing method, device, electronic equipment and computer storage medium
CN109753966A (en) * 2018-12-16 2019-05-14 初速度(苏州)科技有限公司 A kind of Text region training system and method
CN109840524A (en) * 2019-01-04 2019-06-04 平安科技(深圳)有限公司 Kind identification method, device, equipment and the storage medium of text
CN110175610A (en) * 2019-05-23 2019-08-27 上海交通大学 A kind of bill images text recognition method for supporting secret protection
CN110210581A (en) * 2019-04-28 2019-09-06 平安科技(深圳)有限公司 A kind of handwritten text recognition methods and device, electronic equipment
WO2019194356A1 (en) * 2018-04-02 2019-10-10 삼성전자주식회사 Electronic device and control method thereof
CN110766017A (en) * 2019-10-22 2020-02-07 国网新疆电力有限公司信息通信公司 Mobile terminal character recognition method and system based on deep learning
CN111160348A (en) * 2019-11-20 2020-05-15 中国科学院深圳先进技术研究院 Text recognition method for natural scene, storage device and computer equipment
CN111325203A (en) * 2020-01-21 2020-06-23 福州大学 American license plate recognition method and system based on image correction
CN111428727A (en) * 2020-03-27 2020-07-17 华南理工大学 Natural scene text recognition method based on sequence transformation correction and attention mechanism
CN111461116A (en) * 2020-03-25 2020-07-28 深圳市云恩科技有限公司 Ship board text recognition model, modeling method and training method thereof
CN111461105A (en) * 2019-01-18 2020-07-28 顺丰科技有限公司 Text recognition method and device
CN111651980A (en) * 2020-05-27 2020-09-11 河南科技学院 Wheat cold resistance identification method with hybrid neural network fused with Attention mechanism
US10817741B2 (en) 2016-02-29 2020-10-27 Alibaba Group Holding Limited Word segmentation system, method and device
CN111860460A (en) * 2020-08-05 2020-10-30 江苏新安电器股份有限公司 Application method of improved LSTM model in human behavior recognition
CN111985484A (en) * 2020-08-11 2020-11-24 云南电网有限责任公司电力科学研究院 CNN-LSTM-based temperature instrument digital identification method and device
CN112052852A (en) * 2020-09-09 2020-12-08 国家气象信息中心 Character recognition method of handwritten meteorological archive data based on deep learning
CN112508023A (en) * 2020-10-27 2021-03-16 重庆大学 Deep learning-based end-to-end identification method for code-spraying characters of parts
WO2021079347A1 (en) * 2019-10-25 2021-04-29 Element Ai Inc. 2d document extractor
CN112990208A (en) * 2019-12-12 2021-06-18 搜狗(杭州)智能科技有限公司 Text recognition method and device
US11049018B2 (en) 2017-06-23 2021-06-29 Nvidia Corporation Transforming convolutional neural networks for visual sequence learning
CN113128490A (en) * 2021-04-28 2021-07-16 湖南荣冠智能科技有限公司 Prescription information scanning and automatic identification method
CN113837282A (en) * 2021-09-24 2021-12-24 上海脉衍人工智能科技有限公司 Natural scene text recognition method and computing device
US11481605B2 (en) 2019-10-25 2022-10-25 Servicenow Canada Inc. 2D document extractor
EP4191471A4 (en) * 2020-07-30 2024-01-17 Shanghai Goldway Intelligent Transportation System Co., Ltd. Sequence recognition method and apparatus, image processing device, and storage medium

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388896B (en) * 2018-02-09 2021-06-22 杭州雄迈集成电路技术股份有限公司 License plate identification method based on dynamic time sequence convolution neural network
CN108682418B (en) * 2018-06-26 2022-03-04 北京理工大学 Speech recognition method based on pre-training and bidirectional LSTM
CN109214378A (en) * 2018-08-16 2019-01-15 新智数字科技有限公司 A kind of method and system integrally identifying metering meter reading based on neural network
TWI677826B (en) * 2018-09-19 2019-11-21 國家中山科學研究院 License plate recognition system and method
CN109784340A (en) * 2018-12-14 2019-05-21 北京市首都公路发展集团有限公司 A kind of licence plate recognition method and device
CN109726657B (en) * 2018-12-21 2023-06-09 万达信息股份有限公司 Deep learning scene text sequence recognition method
CN109919150A (en) * 2019-01-23 2019-06-21 浙江理工大学 A kind of non-division recognition sequence method and system of 3D pressed characters
CN110188761A (en) * 2019-04-22 2019-08-30 平安科技(深圳)有限公司 Recognition methods, device, computer equipment and the storage medium of identifying code
CN113450433B (en) * 2020-03-26 2024-08-16 阿里巴巴集团控股有限公司 Picture generation method, device, computer equipment and medium
CN112232195B (en) * 2020-10-15 2024-02-20 北京临近空间飞行器系统工程研究所 Handwritten Chinese character recognition method, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1694130A (en) * 2005-03-24 2005-11-09 上海大学 Identification method of mobile number plate based on three-channel parallel artificial nerve network
US20060045341A1 (en) * 2004-08-31 2006-03-02 Samsung Electronics Co., Ltd. Apparatus and method for high-speed character recognition
CN101957920A (en) * 2010-09-08 2011-01-26 中国人民解放军国防科学技术大学 Vehicle license plate searching method based on digital videos
KR20130122842A (en) * 2012-05-01 2013-11-11 한국생산기술연구원 Encoding and decoding method of ls cord

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060045341A1 (en) * 2004-08-31 2006-03-02 Samsung Electronics Co., Ltd. Apparatus and method for high-speed character recognition
CN1694130A (en) * 2005-03-24 2005-11-09 上海大学 Identification method of mobile number plate based on three-channel parallel artificial nerve network
CN101957920A (en) * 2010-09-08 2011-01-26 中国人民解放军国防科学技术大学 Vehicle license plate searching method based on digital videos
KR20130122842A (en) * 2012-05-01 2013-11-11 한국생산기술연구원 Encoding and decoding method of ls cord

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10817741B2 (en) 2016-02-29 2020-10-27 Alibaba Group Holding Limited Word segmentation system, method and device
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN107195295A (en) * 2017-05-04 2017-09-22 百度在线网络技术(北京)有限公司 Audio recognition method and device based on Chinese and English mixing dictionary
CN107301860A (en) * 2017-05-04 2017-10-27 百度在线网络技术(北京)有限公司 Audio recognition method and device based on Chinese and English mixing dictionary
CN107194341A (en) * 2017-05-16 2017-09-22 西安电子科技大学 The many convolution neural network fusion face identification methods of Maxout and system
CN107194341B (en) * 2017-05-16 2020-04-21 西安电子科技大学 Face recognition method and system based on fusion of Maxout multi-convolution neural network
CN108228686A (en) * 2017-06-15 2018-06-29 北京市商汤科技开发有限公司 It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN108228686B (en) * 2017-06-15 2021-03-23 北京市商汤科技开发有限公司 Method and device for realizing image-text matching and electronic equipment
US11049018B2 (en) 2017-06-23 2021-06-29 Nvidia Corporation Transforming convolutional neural networks for visual sequence learning
US11645530B2 (en) 2017-06-23 2023-05-09 Nvidia Corporation Transforming convolutional neural networks for visual sequence learning
CN107480682A (en) * 2017-08-25 2017-12-15 重庆慧都科技有限公司 A kind of commodity packaging date of manufacture detection method
CN108230413B (en) * 2018-01-23 2021-07-06 北京市商汤科技开发有限公司 Image description method and device, electronic equipment and computer storage medium
CN108230413A (en) * 2018-01-23 2018-06-29 北京市商汤科技开发有限公司 Image Description Methods and device, electronic equipment, computer storage media, program
CN108427953A (en) * 2018-02-26 2018-08-21 北京易达图灵科技有限公司 A kind of character recognition method and device
WO2019194356A1 (en) * 2018-04-02 2019-10-10 삼성전자주식회사 Electronic device and control method thereof
US11482025B2 (en) 2018-04-02 2022-10-25 Samsung Electronics Co., Ltd. Electronic device and control method thereof
CN108776779A (en) * 2018-05-25 2018-11-09 西安电子科技大学 SAR Target Recognition of Sequential Images methods based on convolution loop network
CN108776779B (en) * 2018-05-25 2022-09-23 西安电子科技大学 Convolutional-circulation-network-based SAR sequence image target identification method
CN109242796A (en) * 2018-09-05 2019-01-18 北京旷视科技有限公司 Character image processing method, device, electronic equipment and computer storage medium
CN109753966A (en) * 2018-12-16 2019-05-14 初速度(苏州)科技有限公司 A kind of Text region training system and method
CN109840524B (en) * 2019-01-04 2023-07-11 平安科技(深圳)有限公司 Text type recognition method, device, equipment and storage medium
CN109840524A (en) * 2019-01-04 2019-06-04 平安科技(深圳)有限公司 Kind identification method, device, equipment and the storage medium of text
CN111461105A (en) * 2019-01-18 2020-07-28 顺丰科技有限公司 Text recognition method and device
CN111461105B (en) * 2019-01-18 2023-11-28 顺丰科技有限公司 Text recognition method and device
CN110210581B (en) * 2019-04-28 2023-11-24 平安科技(深圳)有限公司 Handwriting text recognition method and device and electronic equipment
CN110210581A (en) * 2019-04-28 2019-09-06 平安科技(深圳)有限公司 A kind of handwritten text recognition methods and device, electronic equipment
CN110175610A (en) * 2019-05-23 2019-08-27 上海交通大学 A kind of bill images text recognition method for supporting secret protection
CN110175610B (en) * 2019-05-23 2023-09-05 上海交通大学 Bill image text recognition method supporting privacy protection
CN110766017A (en) * 2019-10-22 2020-02-07 国网新疆电力有限公司信息通信公司 Mobile terminal character recognition method and system based on deep learning
CN110766017B (en) * 2019-10-22 2023-08-04 国网新疆电力有限公司信息通信公司 Mobile terminal text recognition method and system based on deep learning
EP4049167A4 (en) * 2019-10-25 2022-12-21 Servicenow Canada Inc. 2d document extractor
US11481605B2 (en) 2019-10-25 2022-10-25 Servicenow Canada Inc. 2D document extractor
WO2021079347A1 (en) * 2019-10-25 2021-04-29 Element Ai Inc. 2d document extractor
CN111160348A (en) * 2019-11-20 2020-05-15 中国科学院深圳先进技术研究院 Text recognition method for natural scene, storage device and computer equipment
CN112990208B (en) * 2019-12-12 2024-04-30 北京搜狗科技发展有限公司 Text recognition method and device
CN112990208A (en) * 2019-12-12 2021-06-18 搜狗(杭州)智能科技有限公司 Text recognition method and device
CN111325203B (en) * 2020-01-21 2022-07-05 福州大学 American license plate recognition method and system based on image correction
CN111325203A (en) * 2020-01-21 2020-06-23 福州大学 American license plate recognition method and system based on image correction
CN111461116B (en) * 2020-03-25 2024-02-02 深圳市云恩科技有限公司 Ship board text recognition model structure, modeling method and training method thereof
CN111461116A (en) * 2020-03-25 2020-07-28 深圳市云恩科技有限公司 Ship board text recognition model, modeling method and training method thereof
CN111428727B (en) * 2020-03-27 2023-04-07 华南理工大学 Natural scene text recognition method based on sequence transformation correction and attention mechanism
CN111428727A (en) * 2020-03-27 2020-07-17 华南理工大学 Natural scene text recognition method based on sequence transformation correction and attention mechanism
CN111651980A (en) * 2020-05-27 2020-09-11 河南科技学院 Wheat cold resistance identification method with hybrid neural network fused with Attention mechanism
EP4191471A4 (en) * 2020-07-30 2024-01-17 Shanghai Goldway Intelligent Transportation System Co., Ltd. Sequence recognition method and apparatus, image processing device, and storage medium
CN111860460A (en) * 2020-08-05 2020-10-30 江苏新安电器股份有限公司 Application method of improved LSTM model in human behavior recognition
CN111985484A (en) * 2020-08-11 2020-11-24 云南电网有限责任公司电力科学研究院 CNN-LSTM-based temperature instrument digital identification method and device
CN112052852A (en) * 2020-09-09 2020-12-08 国家气象信息中心 Character recognition method of handwritten meteorological archive data based on deep learning
CN112052852B (en) * 2020-09-09 2023-12-29 国家气象信息中心 Character recognition method of handwriting meteorological archive data based on deep learning
CN112508023A (en) * 2020-10-27 2021-03-16 重庆大学 Deep learning-based end-to-end identification method for code-spraying characters of parts
CN113128490B (en) * 2021-04-28 2023-12-05 湖南荣冠智能科技有限公司 Prescription information scanning and automatic identification method
CN113128490A (en) * 2021-04-28 2021-07-16 湖南荣冠智能科技有限公司 Prescription information scanning and automatic identification method
CN113837282A (en) * 2021-09-24 2021-12-24 上海脉衍人工智能科技有限公司 Natural scene text recognition method and computing device
CN113837282B (en) * 2021-09-24 2024-02-02 上海脉衍人工智能科技有限公司 Natural scene text recognition method and computing device

Also Published As

Publication number Publication date
CN107636691A (en) 2018-01-26

Similar Documents

Publication Publication Date Title
WO2016197381A1 (en) Methods and apparatus for recognizing text in an image
US11823050B2 (en) Semi-supervised person re-identification using multi-view clustering
US11138441B2 (en) Video action segmentation by mixed temporal domain adaption
US20210027098A1 (en) Weakly Supervised Image Segmentation Via Curriculum Learning
US11776236B2 (en) Unsupervised representation learning with contrastive prototypes
EP3690740B1 (en) Method for optimizing hyperparameters of auto-labeling device which auto-labels training images for use in deep learning network to analyze images with high precision, and optimizing device using the same
WO2021135254A1 (en) License plate number recognition method and apparatus, electronic device, and storage medium
US20190303535A1 (en) Interpretable bio-medical link prediction using deep neural representation
US11797845B2 (en) Model learning device, model learning method, and program
CN110321967B (en) Image classification improvement method based on convolutional neural network
CN112733768B (en) Natural scene text recognition method and device based on bidirectional characteristic language model
Zuo et al. Challenging tough samples in unsupervised domain adaptation
WO2015192263A1 (en) A method and a system for face verification
US20240242487A1 (en) Transfer learning in image recognition systems
CN107945210B (en) Target tracking method based on deep learning and environment self-adaption
CN112149526B (en) Lane line detection method and system based on long-distance information fusion
Rajani et al. A convolutional vision transformer for semantic segmentation of side-scan sonar data
CN116630369A (en) Unmanned aerial vehicle target tracking method based on space-time memory network
CN115908806A (en) Small sample image segmentation method based on lightweight multi-scale feature enhancement network
CN114913339A (en) Training method and device of feature map extraction model
Liu et al. Multi-digit recognition with convolutional neural network and long short-term memory
CN113792822B (en) Efficient dynamic image classification method
Kumar et al. Analysis and fast feature selection technique for real-time face detection materials using modified region optimized convolutional neural network
Passalis et al. Multilayer probabilistic knowledge transfer for learning image representations
Rajput et al. Handwritten Digit Recognition using Convolution Neural Networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15894651

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29/03/2018)

122 Ep: pct application non-entry in european phase

Ref document number: 15894651

Country of ref document: EP

Kind code of ref document: A1